Where to start?
4.0. Starting from count table
- Preparing the files (count table and design)
- Importing the files
4.1. Starting from count table
4.4 Data exploration and quality assessment
4.5 Differential expression analysis
4.6 Independant filtering
4.7 Diagnostic plot for multiple testing
4.8 Inspecting the results
4.9 Interpreting the DE analysis results

This file analyses the data of the count table included in countData.txt. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. The analysis is organized as the document “Practical statistical analysis of RNA-Seq data” which is itself based on other data (the data pasilla included in the R package with the same name). The section numbers are taken from this document to ease the parallel between the present analysis and the example processed in the tutorial.

Where to start?

Start RStudio and create a new script using the menu “File/New File/R script” and save it on your computer or download the file http://www.nathalievialaneix.eu/doc/R/solution_edgeR-tomato.R if you feel unconfortable with R.

A good practice for the rest of the exercise is then to use RStudio menu “Session / Set working directory / To source file location” and to download all the files within the directory in which your script has been saved. The command lines that you write in your RScript file can be executed using the button “Run” or the shortcut Ctrl+Enter.

The analysis has been performed using R version 3.2.2 and the R packages edgeR (version 3.10.2), limma (version 3.24.15), mixOmics (version 5.1.2), RColorBrewer (version 1.0-5) and HTSFilter (version 1.8.0). These packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievialaneix.eu/teaching/rnaseq.html to install it by yourself (HTSFilter is probably not installed; you can not use the standard command install.packages to install it because it is a Bioconductor package: check the webpage!).

## Do Not Run
# how to install required packages?
# install.packages(c("RColorBrewer","mixOmics"))
# source("http://bioconductor.org/biocLite.R")
# biocLite("edgeR")
## End DNR
library(edgeR)

## Loading required package: limma

library(limma)
library(RColorBrewer)
library(mixOmics)

## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2

library(HTSFilter)

## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:limma':
## 
##     plotMA
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'

4.0. Starting from count table

Preparing the files (count table and design)

First download the data set at http://www.nathalievialaneix.eu/doc/txt/countData.txt. Save it in a directory on your computer and set the working directory of R so that it fits this directory (in RStudio, you can use the menu “Session / Set working directory”).

The following command lines can be used to list the files included in the working directory:

directory <- getwd()
dir(directory)

among which should be countData.txt.

If you open the file countData.txt with a simple text reader (wordpad for instance), you see that:

the first column contains the gene names;
the next three columns contain the WT samples;
the last three columns contain the Mutant samples;
columns are separated by tabs.

Use a spreadsheet (e.g., Excel) and create a document which contains a first column called “file” with the sample names and a second column called “condition” with the condition (WT or M)

files	condition
Cond.WT.Rep.1	WT
Cond.WT.Rep.2	WT
…	…
Cond.Mt.Rep.1	M
…	…

Save this file under the name design.csv (csv format) inside your working directory. In my case, this file is separated by commas, as in the following picture:

design file

Importing the files

Exercise 3.1 Using the function read.table, import the two files inside the R environment (properly set the options sep, header and row.names):

rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1)
sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1)

Exercise 3.2 Have a look at the count data:

head(rawCountTable)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1             0             0             0             0
## Solyc00g005020.1.1             0             0             0             0
## Solyc00g005040.2.1             0             0             0             0
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005060.1.1             0             0             0             0
## Solyc00g005070.1.1             0             0             0             0
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1             0             0
## Solyc00g005020.1.1             0             0
## Solyc00g005040.2.1             0             0
## Solyc00g005050.2.1           366           294
## Solyc00g005060.1.1             0             0
## Solyc00g005070.1.1             0             0

nrow(rawCountTable)

## [1] 34675

34675 genes are included in this file.

Warning: Exercises 3.3 and 3.4 can be skipped as explained below…

Exercise 3.3 Have a look at the sample information and order the count table in the same way that the sample information: as we created the file design.csv ourselves, this question is already solved.

Exercise 3.4 Create the ‘condition’ column: as we created the file design.csv ourselves, this question is already solved.

4.1. Starting from count table

Exercise 4.1 Create a DGEList data object

dgeFull <- DGEList(rawCountTable, group=sampleInfo$condition)
dgeFull

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1             0             0             0             0
## Solyc00g005020.1.1             0             0             0             0
## Solyc00g005040.2.1             0             0             0             0
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005060.1.1             0             0             0             0
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1             0             0
## Solyc00g005020.1.1             0             0
## Solyc00g005040.2.1             0             0
## Solyc00g005050.2.1           366           294
## Solyc00g005060.1.1             0             0
## 34670 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772            1
## Cond.WT.Rep.2    WT 14374675            1
## Cond.WT.Rep.3    WT 17427146            1
## Cond.Mt.Rep.1     M 18060148            1
## Cond.Mt.Rep.2     M 16010158            1
## Cond.Mt.Rep.3     M 13750228            1

Exercise 4.2 This exercise can be skipped. It is useful when you want to add additional information on the samples in the DGEList object.

4.4 Data exploration and quality assessment

Exercise 4.7 Extract pseudo-counts (ie $\log_2(K+1)$)

pseudoCounts <- log2(dgeFull$counts+1)
head(pseudoCounts)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1      0.000000      0.000000      0.000000      0.000000
## Solyc00g005020.1.1      0.000000      0.000000      0.000000      0.000000
## Solyc00g005040.2.1      0.000000      0.000000      0.000000      0.000000
## Solyc00g005050.2.1      8.262095      8.974415      8.873444      8.531381
## Solyc00g005060.1.1      0.000000      0.000000      0.000000      0.000000
## Solyc00g005070.1.1      0.000000      0.000000      0.000000      0.000000
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1      0.000000      0.000000
## Solyc00g005020.1.1      0.000000      0.000000
## Solyc00g005040.2.1      0.000000      0.000000
## Solyc00g005050.2.1      8.519636      8.204571
## Solyc00g005060.1.1      0.000000      0.000000
## Solyc00g005070.1.1      0.000000      0.000000

Exercise 4.8 Histogram for pseudo-counts (sample Cond.WT.Rep.1)

hist(pseudoCounts[,"Cond.WT.Rep.1"])

Exercise 4.9 Boxplot for pseudo-counts

boxplot(pseudoCounts, col="gray", las=3)

Exercise 4.10 MA-plots between WT or Mt samples

par(mfrow=c(1,2))
## WT1 vs WT2
# A values
avalues <- (pseudoCounts[,1] + pseudoCounts[,2])/2
# M values
mvalues <- (pseudoCounts[,1] - pseudoCounts[,2])
plot(avalues, mvalues, xlab="A", ylab="M", pch=19, main="treated")
abline(h=0, col="red")
## Mt1 vs Mt2
# A values
avalues <- (pseudoCounts[,4] + pseudoCounts[,5])/2
# M values
mvalues <- (pseudoCounts[,4] - pseudoCounts[,5])
plot(avalues, mvalues, xlab="A", ylab="M", pch=19, main="control")
abline(h=0, col="red")

Exercise 4.11 MDS for pseudo-counts (using limma package)

plotMDS(pseudoCounts)

Exercise 4.12 heatmap for pseudo-counts (using mixOmics package)

sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists

##               Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Cond.WT.Rep.1        0.0000      139.0220      198.5141     227.78237
## Cond.WT.Rep.2      139.0220        0.0000      145.2869     262.96305
## Cond.WT.Rep.3      198.5141      145.2869        0.0000     310.60653
## Cond.Mt.Rep.1      227.7824      262.9630      310.6065       0.00000
## Cond.Mt.Rep.2      235.9857      267.5978      315.1272      94.13883
## Cond.Mt.Rep.3      211.7681      238.1150      283.7170     111.54555
##               Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Cond.WT.Rep.1     235.98574      211.7681
## Cond.WT.Rep.2     267.59783      238.1150
## Cond.WT.Rep.3     315.12718      283.7170
## Cond.Mt.Rep.1      94.13883      111.5456
## Cond.Mt.Rep.2       0.00000      107.4814
## Cond.Mt.Rep.3     107.48136        0.0000

cimColor <- colorRampPalette(rev(brewer.pal(9, "Blues")))(16)
cim(sampleDists, color=cimColor, symkey=FALSE)

4.5 Differential expression analysis

Exercise 4.13 remove genes with zero counts for all samples

dgeFull <- DGEList(dgeFull$counts[apply(dgeFull$counts, 1, sum) != 0, ],
                   group=dgeFull$samples$group)
head(dgeFull$counts)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
## Solyc00g006470.1.1            18            45            67            15
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## Solyc00g006470.1.1            26            16

Exercise 4.14 estimate the normalization factors

dgeFull <- calcNormFactors(dgeFull, method="TMM")
dgeFull$samples

##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772    0.9435083
## Cond.WT.Rep.2    WT 14374675    0.8879054
## Cond.WT.Rep.3    WT 17427146    0.6929932
## Cond.Mt.Rep.1     M 18060148    1.1762817
## Cond.Mt.Rep.2     M 16010158    1.2551958
## Cond.Mt.Rep.3     M 13750228    1.1666371

Important: using calcNormFactors does not change the counts: it just updates the column norm.factors in $samples. It is therefore recommanded that you use the same name (dgeFull) to save the result of this function:

head(dgeFull$counts)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
## Solyc00g006470.1.1            18            45            67            15
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## Solyc00g006470.1.1            26            16

From the normalization factors and the original count table, find the normalized counts and use the $\log_2$-transformation to inspect them with boxplots and a MDS. Normalized counts can be extracted from dgeFull using the function cpm:

eff.lib.size <- dgeFull$samples$lib.size*dgeFull$samples$norm.factors
normCounts <- cpm(dgeFull)
pseudoNormCounts <- log2(normCounts + 1)
boxplot(pseudoNormCounts, col="gray", las=3)

plotMDS(pseudoNormCounts)

Exercise 4.15 estimate common and tagwise dispersion

dgeFull <- estimateCommonDisp(dgeFull)
dgeFull <- estimateTagwiseDisp(dgeFull)
dgeFull

## An object of class "DGEList"
## $counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1           306           502           468           369
## Solyc00g005080.1.1             0             0             0             0
## Solyc00g005150.1.1             0             0             0             2
## Solyc00g005840.2.1           422           394           209           440
## Solyc00g005860.1.1             0             1             0             3
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1           366           294
## Solyc00g005080.1.1             2             0
## Solyc00g005150.1.1             0             1
## Solyc00g005840.2.1           453           371
## Solyc00g005860.1.1             0             0
## 27178 more rows ...
## 
## $samples
##               group lib.size norm.factors
## Cond.WT.Rep.1    WT 16602772    0.9435083
## Cond.WT.Rep.2    WT 14374675    0.8879054
## Cond.WT.Rep.3    WT 17427146    0.6929932
## Cond.Mt.Rep.1     M 18060148    1.1762817
## Cond.Mt.Rep.2     M 16010158    1.2551958
## Cond.Mt.Rep.3     M 13750228    1.1666371
## 
## $common.dispersion
## [1] 0.0771487
## 
## $pseudo.counts
##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1  3.118402e+02    627.453178  618.14661023    277.172158
## Solyc00g005080.1.1  0.000000e+00      0.000000    0.00000000      0.000000
## Solyc00g005150.1.1  0.000000e+00      0.000000    0.00000000      1.576625
## Solyc00g005840.2.1  4.299819e+02    492.356242  276.85159629    330.468840
## Solyc00g005860.1.1  1.790464e-03      1.181911    0.02508957      2.456094
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1    290.701358   292.5323622
## Solyc00g005080.1.1      1.698763     0.0000000
## Solyc00g005150.1.1      0.000000     0.9949215
## Solyc00g005840.2.1    359.809587   369.1487554
## Solyc00g005860.1.1      0.000000     0.0000000
## 27178 more rows ...
## 
## $pseudo.lib.size
## [1] 15961436
## 
## $AveLogCPM
## [1]  4.662164 -2.811589 -2.713178  4.566861 -2.620642
## 27178 more elements ...
## 
## $prior.n
## [1] 2.5
## 
## $tagwise.dispersion
## [1] 0.06867039 0.40944624 0.23726001 0.06040654 0.37971489
## 27178 more elements ...

Exercise 4.16 perform an exact test for the difference in expression between the two conditions “WT” and “Mt”

dgeTest <- exactTest(dgeFull)
dgeTest

## An object of class "DGEExact"
## $table
##                         logFC    logCPM     PValue
## Solyc00g005050.2.1  0.8550483  4.662164 0.00716433
## Solyc00g005080.1.1 -2.4605336 -2.811589 0.53194084
## Solyc00g005150.1.1 -2.9663414 -2.713178 0.27852936
## Solyc00g005840.2.1  0.1792639  4.566861 0.54956898
## Solyc00g005860.1.1 -0.8423184 -2.620642 1.00000000
## 27178 more rows ...
## 
## $comparison
## [1] "M"  "WT"
## 
## $genes
## NULL

4.6 Independant filtering

Exercise 4.17 remove low expressed genes

filtData <- HTSFilter(dgeFull)$filteredData

dgeTestFilt <- exactTest(filtData)
dgeTestFilt

## An object of class "DGEExact"
## $table
##                         logFC   logCPM      PValue
## Solyc00g005050.2.1 0.85504826 4.662164 0.007164330
## Solyc00g005840.2.1 0.17926394 4.566861 0.549568980
## Solyc00g006470.1.1 1.75460476 1.164895 0.001333356
## Solyc00g006490.2.1 0.06847886 4.987762 0.805760711
## Solyc00g006690.2.1 1.84727375 1.355638 0.006188601
## 22936 more rows ...
## 
## $comparison
## [1] "M"  "WT"
## 
## $genes
## NULL

4.7 Diagnostic plot for multiple testing

Exercise 4.18 plot an histogram of unadjusted p-values

hist(dgeTest$table[,"PValue"], breaks=50)

Exercise 4.19 plot an histogram of unadjusted p-values after filtering

hist(dgeTestFilt$table[,"PValue"], breaks=50)

4.8 Inspecting the results

Exercise 4.20 extract a summary of the differential expression statistics

resNoFilt <- topTags(dgeTest, n=nrow(dgeTest$table))
head(resNoFilt)

## Comparison of groups:  WT-M 
##                        logFC   logCPM        PValue           FDR
## Solyc07g053460.2.1 -10.21752 8.535498 4.655152e-110 1.265410e-105
## Solyc03g098680.2.1 -10.01232 7.704347 2.109079e-100  2.866555e-96
## Solyc01g009590.2.1 -10.16384 8.694514  3.008506e-99  2.726007e-95
## Solyc07g015960.1.1 -10.05634 7.328319  1.662644e-91  1.129892e-87
## Solyc06g064480.2.1 -11.41320 6.803637  2.269855e-91  1.234029e-87
## Solyc01g086830.2.1 -10.22500 7.374713  5.285082e-88  2.394406e-84

resFilt <- topTags(dgeTestFilt, n=nrow(dgeTest$table))
head(resFilt)

## Comparison of groups:  WT-M 
##                        logFC   logCPM        PValue           FDR
## Solyc07g053460.2.1 -10.21752 8.535498 4.655152e-110 1.067938e-105
## Solyc03g098680.2.1 -10.01232 7.704347 2.109079e-100  2.419219e-96
## Solyc01g009590.2.1 -10.16384 8.694514  3.008506e-99  2.300604e-95
## Solyc07g015960.1.1 -10.05634 7.328319  1.662644e-91  9.535681e-88
## Solyc06g064480.2.1 -11.41320 6.803637  2.269855e-91  1.041455e-87
## Solyc01g086830.2.1 -10.22500 7.374713  5.285082e-88  2.020751e-84

Exercise 4.21 compare the number of differentially expressed genes with and without filtering (risk: 1%)

# before independent filtering
sum(resNoFilt$table$FDR < 0.01)

## [1] 6077

# after independent filtering
sum(resFilt$table$FDR < 0.01)

## [1] 6184

Exercise 4.22 extract and sort differentially expressed genes

sigDownReg <- resFilt$table[resFilt$table$FDR<0.01,]
sigDownReg <- sigDownReg[order(sigDownReg$logFC),]
head(sigDownReg)

##                        logFC   logCPM       PValue          FDR
## Solyc08g060930.2.1 -12.51991 4.508122 9.517060e-70 7.797532e-67
## Solyc07g042880.1.1 -12.50830 4.496547 1.914258e-70 1.626481e-67
## Solyc01g099570.2.1 -12.26036 4.251771 2.673194e-66 1.657453e-63
## Solyc11g028040.1.1 -11.80636 5.865162 1.143701e-82 2.385239e-79
## Solyc12g010560.1.1 -11.67222 3.672741 3.794059e-59 1.851905e-56
## Solyc07g065770.2.1 -11.42245 3.428110 2.240386e-50 8.030734e-48

sigUpReg <- sigDownReg[order(sigDownReg$logFC, decreasing=TRUE),]
head(sigUpReg)

##                       logFC     logCPM       PValue          FDR
## Solyc09g089500.2.1 9.202703 5.85908053 1.449894e-67 1.007940e-64
## Solyc02g089710.2.1 8.756017 0.75399745 3.849181e-22 2.830258e-20
## Solyc03g013340.2.1 8.026917 0.06652057 8.906932e-15 3.388622e-13
## Solyc03g078580.2.1 7.999501 0.06761050 3.418908e-09 6.161286e-08
## Solyc02g078080.1.1 7.719094 1.39812183 2.655482e-22 1.997358e-20
## Solyc03g098760.1.1 7.443860 8.09992520 1.856150e-82 3.548495e-79

Exercise 4.24 write the results in csv files

write.csv(sigDownReg, file="sigDownReg_tomato.csv")
write.csv(sigUpReg, file="sigUpReg_tomato.csv")

4.9 Interpreting the DE analysis results

Exercise 4.25 create a MA plot with 1% differentially expressed genes

plotSmear(dgeTestFilt,
          de.tags = rownames(resFilt$table)[which(resFilt$table$FDR<0.01)])

Exercise 4.26 create a Volcano plot

volcanoData <- cbind(resFilt$table$logFC, -log10(resFilt$table$FDR))
colnames(volcanoData) <- c("logFC", "negLogPval")
head(volcanoData)

##          logFC negLogPval
## [1,] -10.21752  104.97145
## [2,] -10.01232   95.61632
## [3,] -10.16384   94.63816
## [4,] -10.05634   87.02065
## [5,] -11.41320   86.98236
## [6,] -10.22500   83.69449

plot(volcanoData, pch=19)

Exercise 4.27 transform the normalized counts in log-counts-per-million

y <- cpm(dgeFull, log=TRUE, prior.count = 1)
head(y)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1     4.2924482      5.299855      5.278467      4.123588
## Solyc00g005080.1.1    -4.0280732     -4.028073     -4.028073     -4.028073
## Solyc00g005150.1.1    -4.0280732     -4.028073     -4.028073     -2.685565
## Solyc00g005840.2.1     4.7549185      4.950978      4.118282      4.376651
## Solyc00g005860.1.1    -4.0280732     -2.840165     -4.028073     -2.303913
## Solyc00g006470.1.1     0.2754417      1.842784      2.487764     -0.381981
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1     4.1917189    4.20074963
## Solyc00g005080.1.1    -2.6364953   -4.02807316
## Solyc00g005150.1.1    -4.0280732   -3.01585514
## Solyc00g005840.2.1     4.4984566    4.53535423
## Solyc00g005860.1.1    -4.0280732   -4.02807316
## Solyc00g006470.1.1     0.4383916    0.08230305

Exercise 4.28 select 1% differentially expressed genes and produce a heatmap

selY <- y[rownames(resFilt$table)[resFilt$table$FDR<0.01 & 
                                    abs(resFilt$table$logFC)>1.5],]
head(selY)

##                    Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc07g053460.2.1    -0.3892916     -0.713732   -0.64179564      9.435273
## Solyc03g098680.2.1    -1.1703336     -1.142283   -1.07306157      8.584728
## Solyc01g009590.2.1    -0.1662132     -1.142283   -0.04078523      9.464694
## Solyc07g015960.1.1    -0.9766573     -2.197638   -1.69103905      8.120533
## Solyc06g064480.2.1    -2.9984687     -2.197638   -4.02807316      7.573417
## Solyc01g086830.2.1    -1.1703336     -1.416220   -2.79487461      8.209483
##                    Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc07g053460.2.1      9.706334      9.441661
## Solyc03g098680.2.1      8.921504      8.569667
## Solyc01g009590.2.1      9.996001      9.559335
## Solyc07g015960.1.1      8.617938      8.183916
## Solyc06g064480.2.1      8.093823      7.679248
## Solyc01g086830.2.1      8.728357      8.094945

cimColor <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)[255:1]
finalHM <- cim(t(selY), color=cimColor, symkey=FALSE)

If you are interesting in the result of the gene clustering, it can be obtained from the previous command. More precisely, the result of HAC is stored into $ddc which can be plotted as shown below:

plot(finalHM$ddc, leaflab="none")
abline(h=10, lwd=2, col="pink")

Using this dendrogram, we might want to cut the tree at level $h=10$ (for instance), which can be performed using the function cutree, which will provide a cluster membership for each gene.

geneClust <- cutree(as.hclust(finalHM$ddc), h=10)
head(geneClust)

## Solyc07g053460.2.1 Solyc03g098680.2.1 Solyc01g009590.2.1 
##                  1                  2                  1 
## Solyc07g015960.1.1 Solyc06g064480.2.1 Solyc01g086830.2.1 
##                  2                  2                  2

For instance, the number of clusters is equal to

length(unique(geneClust))

## [1] 17

and the genes in cluster 1 are:

names(which(geneClust==1))

## [1] "Solyc07g053460.2.1" "Solyc01g009590.2.1" "Solyc02g078370.1.1"
## [4] "Solyc02g032910.1.1" "Solyc05g050700.1.1"

Practical statistical analysis of RNA-Seq data - edgeR - tomato data

Annick Moisan, Ignacio Gonzales, Nathalie Villa-Vialaneix

March, 10th 2015; updated September 6th