This file analyses the data of the count table included in countData.txt
. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. One observation in a given condition has a clone in the second condition: this will be refered as the “genotype effect” in the rest of the document. The data were provided as a courtesy of Mohamed Zouine (ENSAT, Toulouse).
Download the following files: * http://www.nathalievialaneix.eu/doc/R/solution_edgeR-tomato.R (R script) * http://www.nathalievialaneix.eu/doc/txt/countData.txt (dataset, text format) * … (design, text format) and save them all in the same directory.
Then start RStudio and using the menu “File/Open File…”, open the file solution_edgeR-tomato.R (that contains the R script of this document). Also set your working directory to the directory where you have downloaded all files with the menu “Session / Set working directory / To source file location”.
The command lines in the R script can be executed using the button “Run” or the shortcut Ctrl+Enter
. The analysis has been performed using R and the R packages edgeR, limma, RColorBrewer, mixOmics and HTSFilter. Precise versions of the R software used in this document is given in the last section of this document.
The required packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievialaneix.eu/teaching/rnaseq.html to install it by yourself (HTSFilter
is probably not installed; you can not use the standard command install.packages
to install it because it is a Bioconductor package: check the webpage!).
## Do Not Run
# how to install required packages?
# install.packages(c("RColorBrewer","mixOmics"))
# source("http://bioconductor.org/biocLite.R")
# biocLite("edgeR")
## End DNR
library(edgeR)
library(limma)
library(RColorBrewer)
library(mixOmics)
library(VennDiagram)
library(HTSFilter)
If you open the file countData.txt
with a simple text reader (wordpad for instance), you see that:
the first column contains the gene names;
the next three columns contain the WT samples;
the last three columns contain the Mutant samples;
columns are separated by tabs.
If you open the file design.csv
with a simple text reader (wordpad for instance), you see that:
the first column contains the sample names;
the second column contains the condition (M for Mutant or WT for Wild Type);
columns are separated by commas;
the first row contains column names.
The two files are imported in, respectively, rawCountTable
and sampleInfo
with the function read.table
:
rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1)
sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1)
Their content can be checked with head
that print the first 6 rows:
head(rawCountTable)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1 0 0 0 0
## Solyc00g005020.1.1 0 0 0 0
## Solyc00g005040.2.1 0 0 0 0
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005060.1.1 0 0 0 0
## Solyc00g005070.1.1 0 0 0 0
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1 0 0
## Solyc00g005020.1.1 0 0
## Solyc00g005040.2.1 0 0
## Solyc00g005050.2.1 366 294
## Solyc00g005060.1.1 0 0
## Solyc00g005070.1.1 0 0
nrow(rawCountTable)
## [1] 34675
34675 genes are included in this file.
Similarly, informations about samples are contained into sampleInfo
. The entry genotype
, that corresponds to the clone number, is converted into a factor for subsequent analyses.
head(sampleInfo)
## condition genotype
## Cond.WT.Rep.1 WT 1
## Cond.WT.Rep.2 WT 2
## Cond.WT.Rep.3 WT 3
## Cond.Mt.Rep.1 M 1
## Cond.Mt.Rep.2 M 2
## Cond.Mt.Rep.3 M 3
sampleInfo$genotype <- as.factor(sampleInfo$genotype)
DGEList
objectA DGEList object is needed to process RNAseq datasets. This object is created using the count table and the design file:
dgeFull <- DGEList(rawCountTable, remove.zeros = TRUE)
## Removing 7492 rows with all zero counts
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 1 16602772 1
## Cond.WT.Rep.2 1 14374675 1
## Cond.WT.Rep.3 1 17427146 1
## Cond.Mt.Rep.1 1 18060148 1
## Cond.Mt.Rep.2 1 16010158 1
## Cond.Mt.Rep.3 1 13750228 1
The information about sample genotypes and conditions can be added to the $samples
entry with:
dgeFull$samples$condition <- relevel(sampleInfo$condition, ref = "WT")
dgeFull$samples$genotype <- sampleInfo$genotype
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 1 WT 1
## Cond.WT.Rep.2 1 14374675 1 WT 2
## Cond.WT.Rep.3 1 17427146 1 WT 3
## Cond.Mt.Rep.1 1 18060148 1 M 1
## Cond.Mt.Rep.2 1 16010158 1 M 2
## Cond.Mt.Rep.3 1 13750228 1 M 3
Exploratory analysis is generally perormed on \(\log_2\)-transformed counts to avoid problems due to skewness of the count distribution:
pseudoCounts <- log2(dgeFull$counts + 1)
head(pseudoCounts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 8.262095 8.974415 8.873444 8.531381
## Solyc00g005080.1.1 0.000000 0.000000 0.000000 0.000000
## Solyc00g005150.1.1 0.000000 0.000000 0.000000 1.584963
## Solyc00g005840.2.1 8.724514 8.625709 7.714246 8.784635
## Solyc00g005860.1.1 0.000000 1.000000 0.000000 2.000000
## Solyc00g006470.1.1 4.247928 5.523562 6.087463 4.000000
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 8.519636 8.204571
## Solyc00g005080.1.1 1.584963 0.000000
## Solyc00g005150.1.1 0.000000 1.000000
## Solyc00g005840.2.1 8.826548 8.539159
## Solyc00g005860.1.1 0.000000 0.000000
## Solyc00g006470.1.1 4.754888 4.087463
Standard exploratory analyses include:
Cond.WT.Rep.1
)hist(pseudoCounts[ ,"Cond.WT.Rep.1"], main = "", xlab = "counts")
par(mar = c(8,4,1,2))
boxplot(pseudoCounts, col = "gray", las = 3, cex.names = 1)
limma
package)limma::plotMA(pseudoCounts[ ,1:2], xlab = "M", ylab = "A", main = "")
abline(h = 0, col = "red")
limma
package)MDS is similar to PCA when Euclidean distances are used to assess the distances between samples. In limma
, only the 500 top genes (the most variable genes accross samples).
colConditions <- brewer.pal(3, "Set2")
colConditions <- colConditions[match(sampleInfo$condition,
levels(sampleInfo$condition))]
pchGenotypes <- c(8, 15, 16)[match(sampleInfo$genotype,
levels(sampleInfo$genotype))]
plotMDS(pseudoCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2],
legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16),
legend = levels(sampleInfo$genotype))
mixOmics
package)sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Cond.WT.Rep.1 0.0000 139.0220 198.5141 227.78237
## Cond.WT.Rep.2 139.0220 0.0000 145.2869 262.96305
## Cond.WT.Rep.3 198.5141 145.2869 0.0000 310.60653
## Cond.Mt.Rep.1 227.7824 262.9630 310.6065 0.00000
## Cond.Mt.Rep.2 235.9857 267.5978 315.1272 94.13883
## Cond.Mt.Rep.3 211.7681 238.1150 283.7170 111.54555
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Cond.WT.Rep.1 235.98574 211.7681
## Cond.WT.Rep.2 267.59783 238.1150
## Cond.WT.Rep.3 315.12718 283.7170
## Cond.Mt.Rep.1 94.13883 111.5456
## Cond.Mt.Rep.2 0.00000 107.4814
## Cond.Mt.Rep.3 107.48136 0.0000
cimColor <- colorRampPalette(rev(brewer.pal(9, "Reds")))(16)
cim(sampleDists, color = cimColor, symkey = FALSE, row.cex = 0.7, col.cex = 0.7)
dgeFull <- calcNormFactors(dgeFull, method="TMM")
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 3
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
Important: using calcNormFactors
does not change the counts: it just updates the column norm.factors
in $samples
. It is therefore recommanded that you use the same name (dgeFull
) to save the result of this function:
head(dgeFull$counts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Solyc00g006470.1.1 18 45 67 15
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## Solyc00g006470.1.1 26 16
dgeFull$samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 3
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
Normalized counts and pseudo-counts can be extracted from dgeFull
using the function cpm
:
normCounts <- cpm(dgeFull)
pseudoNormCounts <- cpm(dgeFull, log = TRUE, prior.count = 1)
par(mar = c(8,4,1,2))
boxplot(pseudoNormCounts, col = "gray", las = 3, cex.names = 1)
plotMDS(pseudoNormCounts, pch = pchGenotypes, col = colConditions)
legend("topright", lwd = 2, col = brewer.pal(3, "Set2")[1:2],
legend = levels(sampleInfo$condition))
legend("bottomright", pch = c(8, 15, 16),
legend = levels(sampleInfo$genotype))
A further analysis and comparison of the different normalizations provided in the R packages edgeR
and DESeq2
is provided in this document (the design of the dataset used in this practical application is similar to the design of the tomato dataset).
This section will compare the results of different types of approach to obtain genes which are differentially expressed between the wild type tomatoes and the mutants:
a standard NB exact test between two conditions;
a GLM with the plant and genotype effects.
In this first approach, the differences between the two groups (WT and M) is tested using an exact NB test between the two groups. This method is performed by:
creating a DGEList
object using the argument group
and using the same normalization factors than in dgeFull
;
estimating the dispersion for this object with the functions estimateCommonDisp
and estimateTagwiseDisp
;
performing the test with the function exactTest
.
group
in DGEList
A new DGEList
object is created with the argument group
. Normalization factors are updated from that in dgeFull
.
dgeFull.group <- DGEList(rawCountTable, remove.zeros = TRUE,
group = dgeFull$samples$condition)
## Removing 7492 rows with all zero counts
dgeFull.group$samples$norm.factors <- dgeFull$samples$norm.factors
dgeFull.group
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 0.9435083
## Cond.WT.Rep.2 WT 14374675 0.8879054
## Cond.WT.Rep.3 WT 17427146 0.6929932
## Cond.Mt.Rep.1 M 18060148 1.1762817
## Cond.Mt.Rep.2 M 16010158 1.2551958
## Cond.Mt.Rep.3 M 13750228 1.1666371
Common and then tagwise dispersions can be estimated with:
dgeFull.group <- estimateCommonDisp(dgeFull.group)
dgeFull.group <- estimateTagwiseDisp(dgeFull.group)
dgeFull.group
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 0.9435083
## Cond.WT.Rep.2 WT 14374675 0.8879054
## Cond.WT.Rep.3 WT 17427146 0.6929932
## Cond.Mt.Rep.1 M 18060148 1.1762817
## Cond.Mt.Rep.2 M 16010158 1.2551958
## Cond.Mt.Rep.3 M 13750228 1.1666371
##
## $common.dispersion
## [1] 0.0771487
##
## $pseudo.counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 3.118385e+02 6.274668e+02 6.181625e+02 277.174174
## Solyc00g005080.1.1 1.387779e-17 1.387779e-17 1.387779e-17 0.000000
## Solyc00g005150.1.1 1.387779e-17 1.387779e-17 1.387779e-17 1.576323
## Solyc00g005840.2.1 4.299823e+02 4.923710e+02 2.768288e+02 330.472737
## Solyc00g005860.1.1 1.788530e-03 1.182016e+00 2.506739e-02 2.455437
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 290.700664 292.5323378
## Solyc00g005080.1.1 1.698498 0.0000000
## Solyc00g005150.1.1 0.000000 0.9949206
## Solyc00g005840.2.1 359.808607 369.1486991
## Solyc00g005860.1.1 0.000000 0.0000000
## 27178 more rows ...
##
## $pseudo.lib.size
## [1] 15961436
##
## $AveLogCPM
## [1] 4.662164 -2.811589 -2.713178 4.566861 -2.620642
## 27178 more elements ...
##
## $prior.df
## [1] 10
##
## $prior.n
## [1] 2.5
##
## $tagwise.dispersion
## [1] 0.06868585 0.40945758 0.23730247 0.06042544 0.37974655
## 27178 more elements ...
##
## $span
## [1] 0.3
edgeR
also contains a function to estimate dispersions in a more robust way, that can be used if the previous approach seems to fail. This function is estimateGLMRobustDisp
. The quality of the variability estimatino can be assessed with the BCV versus average log CPM plot (that plots \(\phi_g\) versus the average normalized count for all genes):
plotBCV(dgeFull.group)
An exact perform an exact test for the difference in expression between the two conditions “WT” and “M”:
dgeExactTest <- exactTest(dgeFull.group)
dgeExactTest
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Solyc00g005050.2.1 -0.8550485 4.662164 0.00717063
## Solyc00g005080.1.1 2.4605333 -2.811589 0.53194167
## Solyc00g005150.1.1 2.9663412 -2.713178 0.27853427
## Solyc00g005840.2.1 -0.1792636 4.566861 0.54962937
## Solyc00g005860.1.1 0.8423155 -2.620642 1.00000000
## 27178 more rows ...
##
## $comparison
## [1] "WT" "M"
##
## $genes
## NULL
\(p\)-values are corrected with the function topTags
:
resExactTest <- topTags(dgeExactTest, n = nrow(dgeExactTest$table))
head(resExactTest$table)
## logFC logCPM PValue FDR
## Solyc07g053460.2.1 10.21752 8.535498 4.732511e-110 1.286438e-105
## Solyc03g098680.2.1 10.01232 7.704347 2.139525e-100 2.907936e-96
## Solyc01g009590.2.1 10.16384 8.694514 3.141090e-99 2.846141e-95
## Solyc07g015960.1.1 10.05634 7.328319 1.695548e-91 1.152252e-87
## Solyc06g064480.2.1 11.41320 6.803637 2.305164e-91 1.253226e-87
## Solyc01g086830.2.1 10.22500 7.374713 5.409676e-88 2.450854e-84
\(p\)-value and (BH) adjusted \(p\)-value distribution can be assessed with:
par(mfrow = c(1,2))
hist(resExactTest$table$PValue, xlab = "p-value", main = "raw p-values")
hist(resExactTest$table$FDR, xlab = "p-value", main = "adjusted p-values")
And finally, genes with a FDR smaller than 5% and a log Fold Change larger than 1 or smaller than -1 are extracted:
selectedET <- resExactTest$table$FDR < 0.05 & abs(resExactTest$table$logFC) > 1
selectedET <- resExactTest$table[selectedET, ]
nrow(selectedET)
## [1] 7010
head(selectedET)
## logFC logCPM PValue FDR
## Solyc07g053460.2.1 10.21752 8.535498 4.732511e-110 1.286438e-105
## Solyc03g098680.2.1 10.01232 7.704347 2.139525e-100 2.907936e-96
## Solyc01g009590.2.1 10.16384 8.694514 3.141090e-99 2.846141e-95
## Solyc07g015960.1.1 10.05634 7.328319 1.695548e-91 1.152252e-87
## Solyc06g064480.2.1 11.41320 6.803637 2.305164e-91 1.253226e-87
## Solyc01g086830.2.1 10.22500 7.374713 5.409676e-88 2.450854e-84
which shows that 7010 genes are found differential with this method. The column logFC
can be used to found up/down regulated genes in the M:
selectedET$updown <- factor(ifelse(selectedET$logFC > 0, "up", "down"))
head(selectedET)
## logFC logCPM PValue FDR updown
## Solyc07g053460.2.1 10.21752 8.535498 4.732511e-110 1.286438e-105 up
## Solyc03g098680.2.1 10.01232 7.704347 2.139525e-100 2.907936e-96 up
## Solyc01g009590.2.1 10.16384 8.694514 3.141090e-99 2.846141e-95 up
## Solyc07g015960.1.1 10.05634 7.328319 1.695548e-91 1.152252e-87 up
## Solyc06g064480.2.1 11.41320 6.803637 2.305164e-91 1.253226e-87 up
## Solyc01g086830.2.1 10.22500 7.374713 5.409676e-88 2.450854e-84 up
This list can be exported with:
write.table(selectedET, file = "tomatoDEG.csv", sep = ",")
Here, we fit a GLM to account for the genotype effect. The model writes \(K_{gj} \sim \mbox{NB}(\mu_{gj}, \phi_g)\) with \(\mathbb{E}(\log K_{gj}) = \log(s_j) + \log(\lambda_{gj})\) in which \(j\) is the sample number, \(s_j\) is the normalization factor and \(\log(\lambda_{gj})\) is explained by \(\log(\lambda_{gj}) = \lambda_{g0} + \beta_{g,j \textrm{ is M}} + \gamma_{g,j\textrm{ is clone 2}} + \gamma_{g,j\textrm{ is clone +}}\) (WT condition and clone number 1 are reference levels).
The model is first encoded in a design matrix:
design.matrix <- model.matrix(~ dgeFull$samples$condition +
dgeFull$samples$genotype)
design.matrix
## (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1 1 0 0
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 1 1
## 6 1 1 0
## dgeFull$samples$genotype3
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
Common, trended and then tagwise dispersions can be estimated with:
dgeFull <- estimateDisp(dgeFull, design.matrix)
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 2
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
##
## $design
## (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1 1 0 0
## 2 1 0 1
## 3 1 0 1
## 4 1 1 0
## 5 1 1 1
## 6 1 1 0
## dgeFull$samples$genotype3
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
##
##
## $common.dispersion
## [1] 0.09915901
##
## $trended.dispersion
## [1] 0.08445656 0.24312199 0.24312199 0.08432161 0.24312199
## 27178 more elements ...
##
## $tagwise.dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
##
## $AveLogCPM
## [1] 4.662877 -2.812956 -2.714771 4.566921 -2.622279
## 27178 more elements ...
##
## $trend.method
## [1] "locfit"
##
## $prior.df
## [1] 6.40573
##
## $prior.n
## [1] 3.202865
##
## $span
## [1] 0.2833739
The quality of the variability estimation can be assessed with the BCV versus average log CPM plot (that plots \(\phi_g\) versus the average normalized count for all genes):
plotBCV(dgeFull)
The GLM is fitted with the function glmFit
:
fit <- glmFit(dgeFull, design.matrix)
fit
## An object of class "DGEGLM"
## $coefficients
## (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1 -10.60593 -0.5474377
## Solyc00g005080.1.1 -19.84956 1.5895858
## Solyc00g005150.1.1 -18.22668 2.0788213
## Solyc00g005840.2.1 -10.60049 -0.0988238
## Solyc00g005860.1.1 -16.60440 0.4872586
## dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1 0.3578367 0.35273273
## Solyc00g005080.1.1 2.1070540 0.00265474
## Solyc00g005150.1.1 -2.1074340 -0.34003529
## Solyc00g005840.2.1 0.1125328 -0.14970142
## Solyc00g005860.1.1 -0.7308550 -2.38661673
## 27178 more rows ...
##
## $fitted.values
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 3.878720e+02 4.520405e+02 4.255504e+02 3.042097e+02
## Solyc00g005080.1.1 3.407412e-15 1.135343e-07 1.931898e-15 5.170012e-08
## Solyc00g005150.1.1 1.124398e-07 7.280992e-15 5.739921e-08 2.000000e+00
## Solyc00g005840.2.1 3.899910e+02 3.556151e+02 2.588498e+02 4.791059e+02
## Solyc00g005860.1.1 8.061300e-01 2.800752e-01 9.852187e-08 1.998478e+00
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 4.116214e+02 3.269036e+02
## Solyc00g005080.1.1 2.000000e+00 2.871013e-08
## Solyc00g005150.1.1 1.503615e-07 1.000000e+00
## Solyc00g005840.2.1 5.072162e+02 3.114648e+02
## Solyc00g005860.1.1 8.061300e-01 2.392275e-07
## 27178 more rows ...
##
## $deviance
## [1] 1.711039e+00 2.741461e-07 4.966424e-07 1.289240e+00 4.216594e+00
## 27178 more elements ...
##
## $iter
## [1] 5 17 16 5 15
## 27178 more elements ...
##
## $failed
## [1] FALSE FALSE FALSE FALSE FALSE
## 27178 more elements ...
##
## $method
## [1] "levenberg"
##
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $unshrunk.coefficients
## (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1 -10.60625 -0.54760473
## Solyc00g005080.1.1 -49.87975 16.23037094
## Solyc00g005150.1.1 -32.56778 16.38934864
## Solyc00g005840.2.1 -10.60081 -0.09884859
## Solyc00g005860.1.1 -16.78244 0.60324982
## dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1 0.3579379 0.3528334
## Solyc00g005080.1.1 17.5265040 -0.3073248
## Solyc00g005150.1.1 -16.3478196 -0.4122619
## Solyc00g005840.2.1 0.1125669 -0.1497501
## Solyc00g005860.1.1 -0.8523453 -15.6573515
## 27178 more rows ...
##
## $df.residual
## [1] 2 2 2 2 2
## 27178 more elements ...
##
## $design
## (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1 1 0 0
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 1 1
## 6 1 1 0
## dgeFull$samples$genotype3
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
##
##
## $offset
## [,1] [,2] [,3] [,4] [,5] [,6]
## x 16.56693 16.36209 16.3068 16.87158 16.81603 16.59069
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
##
## $dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
##
## $prior.count
## [1] 0.125
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 2
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
##
## $prior.df
## [1] 6.40573
##
## $AveLogCPM
## [1] 4.662877 -2.812956 -2.714771 4.566921 -2.622279
## 27178 more elements ...
Then tests can be performed with a log-ratio test (function glmRT
). For instance, to testing differential genes between WT and M, is equivalent to testing the nullity of the second coefficient (see the design matrix) :
dgeLRTtest <- glmLRT(fit, coef = 2)
dgeLRTtest
## An object of class "DGELRT"
## $coefficients
## (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1 -10.60593 -0.5474377
## Solyc00g005080.1.1 -19.84956 1.5895858
## Solyc00g005150.1.1 -18.22668 2.0788213
## Solyc00g005840.2.1 -10.60049 -0.0988238
## Solyc00g005860.1.1 -16.60440 0.4872586
## dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1 0.3578367 0.35273273
## Solyc00g005080.1.1 2.1070540 0.00265474
## Solyc00g005150.1.1 -2.1074340 -0.34003529
## Solyc00g005840.2.1 0.1125328 -0.14970142
## Solyc00g005860.1.1 -0.7308550 -2.38661673
## 27178 more rows ...
##
## $fitted.values
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 3.878720e+02 4.520405e+02 4.255504e+02 3.042097e+02
## Solyc00g005080.1.1 3.407412e-15 1.135343e-07 1.931898e-15 5.170012e-08
## Solyc00g005150.1.1 1.124398e-07 7.280992e-15 5.739921e-08 2.000000e+00
## Solyc00g005840.2.1 3.899910e+02 3.556151e+02 2.588498e+02 4.791059e+02
## Solyc00g005860.1.1 8.061300e-01 2.800752e-01 9.852187e-08 1.998478e+00
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 4.116214e+02 3.269036e+02
## Solyc00g005080.1.1 2.000000e+00 2.871013e-08
## Solyc00g005150.1.1 1.503615e-07 1.000000e+00
## Solyc00g005840.2.1 5.072162e+02 3.114648e+02
## Solyc00g005860.1.1 8.061300e-01 2.392275e-07
## 27178 more rows ...
##
## $deviance
## [1] 1.711039e+00 2.741461e-07 4.966424e-07 1.289240e+00 4.216594e+00
## 27178 more elements ...
##
## $iter
## [1] 5 17 16 5 15
## 27178 more elements ...
##
## $failed
## [1] FALSE FALSE FALSE FALSE FALSE
## 27178 more elements ...
##
## $method
## [1] "levenberg"
##
## $unshrunk.coefficients
## (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1 -10.60625 -0.54760473
## Solyc00g005080.1.1 -49.87975 16.23037094
## Solyc00g005150.1.1 -32.56778 16.38934864
## Solyc00g005840.2.1 -10.60081 -0.09884859
## Solyc00g005860.1.1 -16.78244 0.60324982
## dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1 0.3579379 0.3528334
## Solyc00g005080.1.1 17.5265040 -0.3073248
## Solyc00g005150.1.1 -16.3478196 -0.4122619
## Solyc00g005840.2.1 0.1125669 -0.1497501
## Solyc00g005860.1.1 -0.8523453 -15.6573515
## 27178 more rows ...
##
## $df.residual
## [1] 2 2 2 2 2
## 27178 more elements ...
##
## $design
## (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1 1 0 0
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 1 1
## 6 1 1 0
## dgeFull$samples$genotype3
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
##
##
## $offset
## [,1] [,2] [,3] [,4] [,5] [,6]
## x 16.56693 16.36209 16.3068 16.87158 16.81603 16.59069
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
##
## $dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
##
## $prior.count
## [1] 0.125
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 2
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
##
## $prior.df
## [1] 6.40573
##
## $AveLogCPM
## [1] 4.662877 -2.812956 -2.714771 4.566921 -2.622279
## 27178 more elements ...
##
## $table
## logFC logCPM LR PValue
## Solyc00g005050.2.1 -0.7897857 4.662877 5.4634125 0.01941869
## Solyc00g005080.1.1 2.2932875 -2.812956 1.7265297 0.18885469
## Solyc00g005150.1.1 2.9991052 -2.714771 2.9663231 0.08501488
## Solyc00g005840.2.1 -0.1425726 4.566921 0.1670542 0.68274322
## Solyc00g005860.1.1 0.7029656 -2.622279 0.2358925 0.62718864
## 27178 more rows ...
##
## $comparison
## [1] "dgeFull$samples$conditionM"
##
## $df.test
## [1] 1 1 1 1 1
## 27178 more elements ...
Testing differential genes between clone number 2 and 3 is equivalent to testing the equality of coefficients 3 and 4:
contrasts <- rep(0, ncol(design.matrix))
contrasts[3] <- 1
contrasts[4] <- -1
dgeLRTtest2 <- glmLRT(fit, contrast = contrasts)
dgeLRTtest2
## An object of class "DGELRT"
## $coefficients
## (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1 -10.60593 -0.5474377
## Solyc00g005080.1.1 -19.84956 1.5895858
## Solyc00g005150.1.1 -18.22668 2.0788213
## Solyc00g005840.2.1 -10.60049 -0.0988238
## Solyc00g005860.1.1 -16.60440 0.4872586
## dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1 0.3578367 0.35273273
## Solyc00g005080.1.1 2.1070540 0.00265474
## Solyc00g005150.1.1 -2.1074340 -0.34003529
## Solyc00g005840.2.1 0.1125328 -0.14970142
## Solyc00g005860.1.1 -0.7308550 -2.38661673
## 27178 more rows ...
##
## $fitted.values
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 3.878720e+02 4.520405e+02 4.255504e+02 3.042097e+02
## Solyc00g005080.1.1 3.407412e-15 1.135343e-07 1.931898e-15 5.170012e-08
## Solyc00g005150.1.1 1.124398e-07 7.280992e-15 5.739921e-08 2.000000e+00
## Solyc00g005840.2.1 3.899910e+02 3.556151e+02 2.588498e+02 4.791059e+02
## Solyc00g005860.1.1 8.061300e-01 2.800752e-01 9.852187e-08 1.998478e+00
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 4.116214e+02 3.269036e+02
## Solyc00g005080.1.1 2.000000e+00 2.871013e-08
## Solyc00g005150.1.1 1.503615e-07 1.000000e+00
## Solyc00g005840.2.1 5.072162e+02 3.114648e+02
## Solyc00g005860.1.1 8.061300e-01 2.392275e-07
## 27178 more rows ...
##
## $deviance
## [1] 1.711039e+00 2.741461e-07 4.966424e-07 1.289240e+00 4.216594e+00
## 27178 more elements ...
##
## $iter
## [1] 5 17 16 5 15
## 27178 more elements ...
##
## $failed
## [1] FALSE FALSE FALSE FALSE FALSE
## 27178 more elements ...
##
## $method
## [1] "levenberg"
##
## $unshrunk.coefficients
## (Intercept) dgeFull$samples$conditionM
## Solyc00g005050.2.1 -10.60625 -0.54760473
## Solyc00g005080.1.1 -49.87975 16.23037094
## Solyc00g005150.1.1 -32.56778 16.38934864
## Solyc00g005840.2.1 -10.60081 -0.09884859
## Solyc00g005860.1.1 -16.78244 0.60324982
## dgeFull$samples$genotype2 dgeFull$samples$genotype3
## Solyc00g005050.2.1 0.3579379 0.3528334
## Solyc00g005080.1.1 17.5265040 -0.3073248
## Solyc00g005150.1.1 -16.3478196 -0.4122619
## Solyc00g005840.2.1 0.1125669 -0.1497501
## Solyc00g005860.1.1 -0.8523453 -15.6573515
## 27178 more rows ...
##
## $df.residual
## [1] 2 2 2 2 2
## 27178 more elements ...
##
## $design
## (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1 1 0 0
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 1 1
## 6 1 1 0
## dgeFull$samples$genotype3
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
##
##
## $offset
## [,1] [,2] [,3] [,4] [,5] [,6]
## x 16.56693 16.36209 16.3068 16.87158 16.81603 16.59069
## attr(,"class")
## [1] "compressedMatrix"
## attr(,"repeat.row")
## [1] TRUE
## attr(,"repeat.col")
## [1] FALSE
##
## $dispersion
## [1] 0.07708625 0.24312199 0.24312199 0.08344663 0.24312199
## 27178 more elements ...
##
## $prior.count
## [1] 0.125
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 2
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
##
## $prior.df
## [1] 6.40573
##
## $AveLogCPM
## [1] 4.662877 -2.812956 -2.714771 4.566921 -2.622279
## 27178 more elements ...
##
## $table
## logFC logCPM LR PValue
## Solyc00g005050.2.1 0.007363416 4.662877 0.0003273673 0.9855644
## Solyc00g005080.1.1 3.036006369 -2.812956 2.0358241034 0.1536309
## Solyc00g005150.1.1 -2.549817295 -2.714771 1.4898460619 0.2222403
## Solyc00g005840.2.1 0.378324000 4.566921 0.7836039489 0.3760412
## Solyc00g005860.1.1 2.388759289 -2.622279 1.2301155435 0.2673846
## 27178 more rows ...
##
## $comparison
## [1] "1*dgeFull$samples$genotype2 -1*dgeFull$samples$genotype3"
##
## $df.test
## [1] 1 1 1 1 1
## 27178 more elements ...
Finally, DEGs can be extracted as previously, using the function topTags
:
resLRT <- topTags(dgeLRTtest, n = nrow(dgeFull$counts))
head(resLRT$table)
## logFC logCPM LR PValue FDR
## Solyc07g053460.2.1 10.216353 8.535374 345.4052 4.243561e-77 1.153527e-72
## Solyc03g098680.2.1 10.001557 7.704126 335.2817 6.799065e-75 9.240949e-71
## Solyc06g064480.2.1 11.427146 6.803229 324.1399 1.816171e-72 1.645632e-68
## Solyc01g009590.2.1 10.158952 8.694402 305.7543 1.837166e-68 1.248492e-64
## Solyc11g028040.1.1 11.810303 5.864389 303.3936 6.003952e-68 3.264109e-64
## Solyc11g018570.1.1 9.869822 6.583870 296.6953 1.728899e-66 7.832776e-63
selectedLRT <- resLRT$table$FDR < 0.05 & abs(resLRT$table$logFC) > 1
selectedLRT <- resLRT$table[selectedLRT, ]
nrow(selectedLRT)
## [1] 6890
head(selectedLRT)
## logFC logCPM LR PValue FDR
## Solyc07g053460.2.1 10.216353 8.535374 345.4052 4.243561e-77 1.153527e-72
## Solyc03g098680.2.1 10.001557 7.704126 335.2817 6.799065e-75 9.240949e-71
## Solyc06g064480.2.1 11.427146 6.803229 324.1399 1.816171e-72 1.645632e-68
## Solyc01g009590.2.1 10.158952 8.694402 305.7543 1.837166e-68 1.248492e-64
## Solyc11g028040.1.1 11.810303 5.864389 303.3936 6.003952e-68 3.264109e-64
## Solyc11g018570.1.1 9.869822 6.583870 296.6953 1.728899e-66 7.832776e-63
A Venn diagram comparing the two approaches is provided below:
vd <- venn.diagram(x = list("Exact test" = rownames(selectedET),
"GLM" = rownames(selectedLRT)),
fill = brewer.pal(3, "Set2")[1:2], filename = NULL)
grid.draw(vd)
More complex models and a more detailed comparison between the different approches for differential analysis is provided in this document (the same that was previously pointed for normalization comparison) and this document (with the analysis of an interaction effect).
Independant filtering can be performed with the package HTSFilter
after the dispersion has been estimated:
dgeFilt <- HTSFilter(dgeFull)$filteredData
dgeFilt
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g006470.1.1 18 45 67 15
## Solyc00g006490.2.1 533 345 435 618
## Solyc00g006690.2.1 13 35 101 13
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005840.2.1 453 371
## Solyc00g006470.1.1 26 16
## Solyc00g006490.2.1 616 527
## Solyc00g006690.2.1 24 24
## 23699 more rows ...
##
## $samples
## group lib.size norm.factors condition genotype
## Cond.WT.Rep.1 1 16602772 0.9435083 WT 1
## Cond.WT.Rep.2 1 14374675 0.8879054 WT 2
## Cond.WT.Rep.3 1 17427146 0.6929932 WT 2
## Cond.Mt.Rep.1 1 18060148 1.1762817 M 1
## Cond.Mt.Rep.2 1 16010158 1.2551958 M 2
## Cond.Mt.Rep.3 1 13750228 1.1666371 M 3
##
## $design
## (Intercept) dgeFull$samples$conditionM dgeFull$samples$genotype2
## 1 1 0 0
## 2 1 0 1
## 3 1 0 1
## 4 1 1 0
## 5 1 1 1
## 6 1 1 0
## dgeFull$samples$genotype3
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
## attr(,"assign")
## [1] 0 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$`dgeFull$samples$condition`
## [1] "contr.treatment"
##
## attr(,"contrasts")$`dgeFull$samples$genotype`
## [1] "contr.treatment"
##
##
## $common.dispersion
## [1] 0.09915901
##
## $trended.dispersion
## [1] 0.08445656 0.08432161 0.13000266 0.08502943 0.12649632
## 23699 more elements ...
##
## $tagwise.dispersion
## [1] 0.07708625 0.08344663 0.11902698 0.06969235 0.20709806
## 23699 more elements ...
##
## $AveLogCPM
## [1] 4.662877 4.566921 1.174171 4.987809 1.365946
## 23699 more elements ...
##
## $trend.method
## [1] "locfit"
##
## $prior.df
## [1] 6.40573
##
## $prior.n
## [1] 3.202865
##
## $span
## [1] 0.2833739
Then, the differential analysis (GLM approach) is performed:
fit <- glmFit(dgeFilt, design.matrix)
dgeLRTfilt <- glmLRT(fit, coef = 2)
resLRTfilt <- topTags(dgeLRTfilt, n = nrow(dgeFilt$counts))
selectedFilt <- resLRTfilt$table$FDR < 0.05 & abs(resLRTfilt$table$logFC) > 1
selectedFilt <- resLRTfilt$table[selectedFilt, ]
nrow(selectedFilt)
## [1] 6738
head(selectedFilt)
## logFC logCPM LR PValue FDR
## Solyc07g053460.2.1 10.216353 8.535374 345.4052 4.243561e-77 1.005894e-72
## Solyc03g098680.2.1 10.001557 7.704126 335.2817 6.799065e-75 8.058252e-71
## Solyc06g064480.2.1 11.427146 6.803229 324.1399 1.816171e-72 1.435017e-68
## Solyc01g009590.2.1 10.158952 8.694402 305.7543 1.837166e-68 1.088705e-64
## Solyc11g028040.1.1 11.810303 5.864389 303.3936 6.003952e-68 2.846354e-64
## Solyc11g018570.1.1 9.869822 6.583870 296.6953 1.728899e-66 6.830303e-63
A Venn diagram comparing the two approaches is provided below:
vd <- venn.diagram(x = list("No filtering" = rownames(selectedLRT),
"Filtering" = rownames(selectedFilt)),
fill = brewer.pal(3, "Set2")[1:2], filename = NULL)
grid.draw(vd)
To create a MA plot between M and WT, the entry $samples$group
of the DGEList
object must be filled with the indication of what the two groups are.
dgeFilt$samples$group <- dgeFilt$samples$condition
plotSmear(dgeFilt, de.tags = rownames(selectedFilt))
volcanoData <- cbind(resLRTfilt$table$logFC, -log10(resLRTfilt$table$FDR))
colnames(volcanoData) <- c("logFC", "negLogPval")
DEGs <- resLRTfilt$table$FDR < 0.05 & abs(resLRTfilt$table$logFC) > 1
point.col <- ifelse(DEGs, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5)
selY <- cpm(dgeFilt, log = TRUE, prior.count = 1)
selY <- selY[match(rownames(selectedFilt), rownames(dgeFilt$counts)), ]
finalHM <- cim(t(selY), color = cimColor, symkey = FALSE, row.cex = 0.7,
col.cex = 0.7)
If you are interested in the result of the gene clustering, the result of HAC is saved into $ddc
:
plot(finalHM$ddc, leaflab="none")
abline(h=10, lwd=2, col="pink")
Using this dendrogram, we might want to cut the tree at level \(h=10\) (for instance). This can be performed using the function cutree
, which will provide a cluster membership for each gene.
geneClust <- cutree(as.hclust(finalHM$ddc), h=10)
head(geneClust)
## Solyc07g053460.2.1 Solyc03g098680.2.1 Solyc06g064480.2.1
## 1 2 2
## Solyc01g009590.2.1 Solyc11g028040.1.1 Solyc11g018570.1.1
## 1 2 2
For instance, the number of clusters is equal to
length(unique(geneClust))
## [1] 16
and the genes in cluster 1 are:
names(which(geneClust == 1))
## [1] "Solyc07g053460.2.1" "Solyc01g009590.2.1" "Solyc02g032910.1.1"
## [4] "Solyc02g078370.1.1" "Solyc05g050700.1.1"
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] HTSFilter_1.16.0 VennDiagram_1.6.17 futile.logger_1.4.3
## [4] mixOmics_6.1.2 ggplot2_2.2.1 lattice_0.20-35
## [7] MASS_7.3-47 RColorBrewer_1.1-2 edgeR_3.18.0
## [10] limma_3.32.2
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.36.2 tidyr_0.6.1
## [3] jsonlite_1.4 splines_3.4.0
## [5] ellipse_0.3-8 Formula_1.2-1
## [7] shiny_1.0.3 assertthat_0.2.0
## [9] stats4_3.4.0 latticeExtra_0.6-28
## [11] GenomeInfoDbData_0.99.0 yaml_2.1.14
## [13] RSQLite_1.1-2 backports_1.0.5
## [15] digest_0.6.12 checkmate_1.8.2
## [17] GenomicRanges_1.28.2 XVector_0.16.0
## [19] colorspace_1.3-2 htmltools_0.3.6
## [21] httpuv_1.3.3 Matrix_1.2-10
## [23] plyr_1.8.4 DESeq2_1.16.1
## [25] XML_3.98-1.7 genefilter_1.58.1
## [27] zlibbioc_1.22.0 xtable_1.8-2
## [29] corpcor_1.6.9 scales_0.4.1
## [31] BiocParallel_1.10.1 htmlTable_1.9
## [33] tibble_1.3.0 annotate_1.54.0
## [35] IRanges_2.10.1 SummarizedExperiment_1.6.1
## [37] nnet_7.3-12 BiocGenerics_0.22.0
## [39] lazyeval_0.2.0 survival_2.41-3
## [41] magrittr_1.5 mime_0.5
## [43] memoise_1.1.0 evaluate_0.10
## [45] foreign_0.8-67 data.table_1.10.4
## [47] tools_3.4.0 matrixStats_0.52.2
## [49] stringr_1.2.0 S4Vectors_0.14.1
## [51] munsell_0.4.3 locfit_1.5-9.1
## [53] cluster_2.0.6 DelayedArray_0.2.2
## [55] AnnotationDbi_1.38.0 lambda.r_1.1.9
## [57] compiler_3.4.0 DESeq_1.28.0
## [59] GenomeInfoDb_1.12.0 RCurl_1.95-4.8
## [61] htmlwidgets_0.8 igraph_1.0.1
## [63] base64enc_0.1-3 bitops_1.0-6
## [65] rmarkdown_1.5 gtable_0.2.0
## [67] DBI_0.6-1 reshape2_1.4.2
## [69] R6_2.2.0 gridExtra_2.2.1
## [71] knitr_1.15.1 dplyr_0.5.0
## [73] Hmisc_4.0-3 rprojroot_1.2
## [75] futile.options_1.0.0 stringi_1.1.5
## [77] parallel_3.4.0 Rcpp_0.12.10
## [79] rpart_4.1-11 acepack_1.4.1
## [81] geneplotter_1.54.0 rgl_0.98.1