This file analyses the data of the count table included in countData.txt
. These data are RNA sequencing of tomatoes, with two conditions (wild type and mutant) and 3 replicates by condition. The analysis is organized as the document “Practical statistical analysis of RNA-Seq data” which is itself based on other data (the data pasilla
included in the R package with the same name). The section numbers are taken from this document to ease the parallel between the present analysis and the example processed in the tutorial.
Start RStudio and create a new script using the menu “File/New File/R script” and save it on your computer or download the file http://www.nathalievialaneix.eu/doc/R/solution_edgeR-tomato.R if you feel unconfortable with R.
A good practice for the rest of the exercise is then to use RStudio menu “Session / Set working directory / To source file location” and to download all the files within the directory in which your script has been saved. The command lines that you write in your RScript file can be executed using the button “Run” or the shortcut Ctrl+Enter
.
The analysis has been performed using R version 3.2.2 and the R packages edgeR (version 3.10.2), limma (version 3.24.15), mixOmics (version 5.1.2), RColorBrewer (version 1.0-5) and HTSFilter (version 1.8.0). These packages must be loaded prior the analysis. If one of them is not already installed, you can refer to the webpage http://www.nathalievialaneix.eu/teaching/rnaseq.html to install it by yourself (HTSFilter
is probably not installed; you can not use the standard command install.packages
to install it because it is a Bioconductor package: check the webpage!).
## Do Not Run
# how to install required packages?
# install.packages(c("RColorBrewer","mixOmics"))
# source("http://bioconductor.org/biocLite.R")
# biocLite("edgeR")
## End DNR
library(edgeR)
## Loading required package: limma
library(limma)
library(RColorBrewer)
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
library(HTSFilter)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:limma':
##
## plotMA
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
First download the data set at http://www.nathalievialaneix.eu/doc/txt/countData.txt. Save it in a directory on your computer and set the working directory of R so that it fits this directory (in RStudio, you can use the menu “Session / Set working directory”).
The following command lines can be used to list the files included in the working directory:
directory <- getwd()
dir(directory)
among which should be countData.txt
.
If you open the file countData.txt
with a simple text reader (wordpad for instance), you see that:
the first column contains the gene names;
the next three columns contain the WT samples;
the last three columns contain the Mutant samples;
columns are separated by tabs.
Use a spreadsheet (e.g., Excel) and create a document which contains a first column called “file” with the sample names and a second column called “condition” with the condition (WT or M)
files | condition |
---|---|
Cond.WT.Rep.1 | WT |
Cond.WT.Rep.2 | WT |
… | … |
Cond.Mt.Rep.1 | M |
… | … |
Save this file under the name design.csv
(csv format) inside your working directory. In my case, this file is separated by commas, as in the following picture:
Exercise 3.1 Using the function read.table
, import the two files inside the R environment (properly set the options sep
, header
and row.names
):
rawCountTable <- read.table("countData.txt", header=TRUE, sep="\t", row.names=1)
sampleInfo <- read.table("design.csv", header=TRUE, sep=",", row.names=1)
Exercise 3.2 Have a look at the count data:
head(rawCountTable)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1 0 0 0 0
## Solyc00g005020.1.1 0 0 0 0
## Solyc00g005040.2.1 0 0 0 0
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005060.1.1 0 0 0 0
## Solyc00g005070.1.1 0 0 0 0
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1 0 0
## Solyc00g005020.1.1 0 0
## Solyc00g005040.2.1 0 0
## Solyc00g005050.2.1 366 294
## Solyc00g005060.1.1 0 0
## Solyc00g005070.1.1 0 0
nrow(rawCountTable)
## [1] 34675
34675 genes are included in this file.
Warning: Exercises 3.3 and 3.4 can be skipped as explained below…
Exercise 3.3 Have a look at the sample information and order the count table in the same way that the sample information: as we created the file design.csv
ourselves, this question is already solved.
Exercise 3.4 Create the ‘condition’ column: as we created the file design.csv
ourselves, this question is already solved.
Exercise 4.1 Create a DGEList data object
dgeFull <- DGEList(rawCountTable, group=sampleInfo$condition)
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1 0 0 0 0
## Solyc00g005020.1.1 0 0 0 0
## Solyc00g005040.2.1 0 0 0 0
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005060.1.1 0 0 0 0
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1 0 0
## Solyc00g005020.1.1 0 0
## Solyc00g005040.2.1 0 0
## Solyc00g005050.2.1 366 294
## Solyc00g005060.1.1 0 0
## 34670 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 1
## Cond.WT.Rep.2 WT 14374675 1
## Cond.WT.Rep.3 WT 17427146 1
## Cond.Mt.Rep.1 M 18060148 1
## Cond.Mt.Rep.2 M 16010158 1
## Cond.Mt.Rep.3 M 13750228 1
Exercise 4.2 This exercise can be skipped. It is useful when you want to add additional information on the samples in the DGEList object.
Exercise 4.7 Extract pseudo-counts (ie \(\log_2(K+1)\))
pseudoCounts <- log2(dgeFull$counts+1)
head(pseudoCounts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005000.2.1 0.000000 0.000000 0.000000 0.000000
## Solyc00g005020.1.1 0.000000 0.000000 0.000000 0.000000
## Solyc00g005040.2.1 0.000000 0.000000 0.000000 0.000000
## Solyc00g005050.2.1 8.262095 8.974415 8.873444 8.531381
## Solyc00g005060.1.1 0.000000 0.000000 0.000000 0.000000
## Solyc00g005070.1.1 0.000000 0.000000 0.000000 0.000000
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005000.2.1 0.000000 0.000000
## Solyc00g005020.1.1 0.000000 0.000000
## Solyc00g005040.2.1 0.000000 0.000000
## Solyc00g005050.2.1 8.519636 8.204571
## Solyc00g005060.1.1 0.000000 0.000000
## Solyc00g005070.1.1 0.000000 0.000000
Exercise 4.8 Histogram for pseudo-counts (sample Cond.WT.Rep.1
)
hist(pseudoCounts[,"Cond.WT.Rep.1"])
Exercise 4.9 Boxplot for pseudo-counts
boxplot(pseudoCounts, col="gray", las=3)
Exercise 4.10 MA-plots between WT or Mt samples
par(mfrow=c(1,2))
## WT1 vs WT2
# A values
avalues <- (pseudoCounts[,1] + pseudoCounts[,2])/2
# M values
mvalues <- (pseudoCounts[,1] - pseudoCounts[,2])
plot(avalues, mvalues, xlab="A", ylab="M", pch=19, main="treated")
abline(h=0, col="red")
## Mt1 vs Mt2
# A values
avalues <- (pseudoCounts[,4] + pseudoCounts[,5])/2
# M values
mvalues <- (pseudoCounts[,4] - pseudoCounts[,5])
plot(avalues, mvalues, xlab="A", ylab="M", pch=19, main="control")
abline(h=0, col="red")
Exercise 4.11 MDS for pseudo-counts (using limma
package)
plotMDS(pseudoCounts)
Exercise 4.12 heatmap for pseudo-counts (using mixOmics
package)
sampleDists <- as.matrix(dist(t(pseudoCounts)))
sampleDists
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Cond.WT.Rep.1 0.0000 139.0220 198.5141 227.78237
## Cond.WT.Rep.2 139.0220 0.0000 145.2869 262.96305
## Cond.WT.Rep.3 198.5141 145.2869 0.0000 310.60653
## Cond.Mt.Rep.1 227.7824 262.9630 310.6065 0.00000
## Cond.Mt.Rep.2 235.9857 267.5978 315.1272 94.13883
## Cond.Mt.Rep.3 211.7681 238.1150 283.7170 111.54555
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Cond.WT.Rep.1 235.98574 211.7681
## Cond.WT.Rep.2 267.59783 238.1150
## Cond.WT.Rep.3 315.12718 283.7170
## Cond.Mt.Rep.1 94.13883 111.5456
## Cond.Mt.Rep.2 0.00000 107.4814
## Cond.Mt.Rep.3 107.48136 0.0000
cimColor <- colorRampPalette(rev(brewer.pal(9, "Blues")))(16)
cim(sampleDists, color=cimColor, symkey=FALSE)
Exercise 4.13 remove genes with zero counts for all samples
dgeFull <- DGEList(dgeFull$counts[apply(dgeFull$counts, 1, sum) != 0, ],
group=dgeFull$samples$group)
head(dgeFull$counts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Solyc00g006470.1.1 18 45 67 15
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## Solyc00g006470.1.1 26 16
Exercise 4.14 estimate the normalization factors
dgeFull <- calcNormFactors(dgeFull, method="TMM")
dgeFull$samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 0.9435083
## Cond.WT.Rep.2 WT 14374675 0.8879054
## Cond.WT.Rep.3 WT 17427146 0.6929932
## Cond.Mt.Rep.1 M 18060148 1.1762817
## Cond.Mt.Rep.2 M 16010158 1.2551958
## Cond.Mt.Rep.3 M 13750228 1.1666371
Important: using calcNormFactors
does not change the counts: it just updates the column norm.factors
in $samples
. It is therefore recommanded that you use the same name (dgeFull
) to save the result of this function:
head(dgeFull$counts)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Solyc00g006470.1.1 18 45 67 15
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## Solyc00g006470.1.1 26 16
From the normalization factors and the original count table, find the normalized counts and use the \(\log_2\)-transformation to inspect them with boxplots and a MDS. Normalized counts can be extracted from dgeFull
using the function cpm
:
eff.lib.size <- dgeFull$samples$lib.size*dgeFull$samples$norm.factors
normCounts <- cpm(dgeFull)
pseudoNormCounts <- log2(normCounts + 1)
boxplot(pseudoNormCounts, col="gray", las=3)
plotMDS(pseudoNormCounts)
Exercise 4.15 estimate common and tagwise dispersion
dgeFull <- estimateCommonDisp(dgeFull)
dgeFull <- estimateTagwiseDisp(dgeFull)
dgeFull
## An object of class "DGEList"
## $counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 306 502 468 369
## Solyc00g005080.1.1 0 0 0 0
## Solyc00g005150.1.1 0 0 0 2
## Solyc00g005840.2.1 422 394 209 440
## Solyc00g005860.1.1 0 1 0 3
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 366 294
## Solyc00g005080.1.1 2 0
## Solyc00g005150.1.1 0 1
## Solyc00g005840.2.1 453 371
## Solyc00g005860.1.1 0 0
## 27178 more rows ...
##
## $samples
## group lib.size norm.factors
## Cond.WT.Rep.1 WT 16602772 0.9435083
## Cond.WT.Rep.2 WT 14374675 0.8879054
## Cond.WT.Rep.3 WT 17427146 0.6929932
## Cond.Mt.Rep.1 M 18060148 1.1762817
## Cond.Mt.Rep.2 M 16010158 1.2551958
## Cond.Mt.Rep.3 M 13750228 1.1666371
##
## $common.dispersion
## [1] 0.0771487
##
## $pseudo.counts
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 3.118402e+02 627.453178 618.14661023 277.172158
## Solyc00g005080.1.1 0.000000e+00 0.000000 0.00000000 0.000000
## Solyc00g005150.1.1 0.000000e+00 0.000000 0.00000000 1.576625
## Solyc00g005840.2.1 4.299819e+02 492.356242 276.85159629 330.468840
## Solyc00g005860.1.1 1.790464e-03 1.181911 0.02508957 2.456094
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 290.701358 292.5323622
## Solyc00g005080.1.1 1.698763 0.0000000
## Solyc00g005150.1.1 0.000000 0.9949215
## Solyc00g005840.2.1 359.809587 369.1487554
## Solyc00g005860.1.1 0.000000 0.0000000
## 27178 more rows ...
##
## $pseudo.lib.size
## [1] 15961436
##
## $AveLogCPM
## [1] 4.662164 -2.811589 -2.713178 4.566861 -2.620642
## 27178 more elements ...
##
## $prior.n
## [1] 2.5
##
## $tagwise.dispersion
## [1] 0.06867039 0.40944624 0.23726001 0.06040654 0.37971489
## 27178 more elements ...
Exercise 4.16 perform an exact test for the difference in expression between the two conditions “WT” and “Mt”
dgeTest <- exactTest(dgeFull)
dgeTest
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Solyc00g005050.2.1 0.8550483 4.662164 0.00716433
## Solyc00g005080.1.1 -2.4605336 -2.811589 0.53194084
## Solyc00g005150.1.1 -2.9663414 -2.713178 0.27852936
## Solyc00g005840.2.1 0.1792639 4.566861 0.54956898
## Solyc00g005860.1.1 -0.8423184 -2.620642 1.00000000
## 27178 more rows ...
##
## $comparison
## [1] "M" "WT"
##
## $genes
## NULL
Exercise 4.17 remove low expressed genes
filtData <- HTSFilter(dgeFull)$filteredData
dgeTestFilt <- exactTest(filtData)
dgeTestFilt
## An object of class "DGEExact"
## $table
## logFC logCPM PValue
## Solyc00g005050.2.1 0.85504826 4.662164 0.007164330
## Solyc00g005840.2.1 0.17926394 4.566861 0.549568980
## Solyc00g006470.1.1 1.75460476 1.164895 0.001333356
## Solyc00g006490.2.1 0.06847886 4.987762 0.805760711
## Solyc00g006690.2.1 1.84727375 1.355638 0.006188601
## 22936 more rows ...
##
## $comparison
## [1] "M" "WT"
##
## $genes
## NULL
Exercise 4.18 plot an histogram of unadjusted p-values
hist(dgeTest$table[,"PValue"], breaks=50)
Exercise 4.19 plot an histogram of unadjusted p-values after filtering
hist(dgeTestFilt$table[,"PValue"], breaks=50)
Exercise 4.20 extract a summary of the differential expression statistics
resNoFilt <- topTags(dgeTest, n=nrow(dgeTest$table))
head(resNoFilt)
## Comparison of groups: WT-M
## logFC logCPM PValue FDR
## Solyc07g053460.2.1 -10.21752 8.535498 4.655152e-110 1.265410e-105
## Solyc03g098680.2.1 -10.01232 7.704347 2.109079e-100 2.866555e-96
## Solyc01g009590.2.1 -10.16384 8.694514 3.008506e-99 2.726007e-95
## Solyc07g015960.1.1 -10.05634 7.328319 1.662644e-91 1.129892e-87
## Solyc06g064480.2.1 -11.41320 6.803637 2.269855e-91 1.234029e-87
## Solyc01g086830.2.1 -10.22500 7.374713 5.285082e-88 2.394406e-84
resFilt <- topTags(dgeTestFilt, n=nrow(dgeTest$table))
head(resFilt)
## Comparison of groups: WT-M
## logFC logCPM PValue FDR
## Solyc07g053460.2.1 -10.21752 8.535498 4.655152e-110 1.067938e-105
## Solyc03g098680.2.1 -10.01232 7.704347 2.109079e-100 2.419219e-96
## Solyc01g009590.2.1 -10.16384 8.694514 3.008506e-99 2.300604e-95
## Solyc07g015960.1.1 -10.05634 7.328319 1.662644e-91 9.535681e-88
## Solyc06g064480.2.1 -11.41320 6.803637 2.269855e-91 1.041455e-87
## Solyc01g086830.2.1 -10.22500 7.374713 5.285082e-88 2.020751e-84
Exercise 4.21 compare the number of differentially expressed genes with and without filtering (risk: 1%)
# before independent filtering
sum(resNoFilt$table$FDR < 0.01)
## [1] 6077
# after independent filtering
sum(resFilt$table$FDR < 0.01)
## [1] 6184
Exercise 4.22 extract and sort differentially expressed genes
sigDownReg <- resFilt$table[resFilt$table$FDR<0.01,]
sigDownReg <- sigDownReg[order(sigDownReg$logFC),]
head(sigDownReg)
## logFC logCPM PValue FDR
## Solyc08g060930.2.1 -12.51991 4.508122 9.517060e-70 7.797532e-67
## Solyc07g042880.1.1 -12.50830 4.496547 1.914258e-70 1.626481e-67
## Solyc01g099570.2.1 -12.26036 4.251771 2.673194e-66 1.657453e-63
## Solyc11g028040.1.1 -11.80636 5.865162 1.143701e-82 2.385239e-79
## Solyc12g010560.1.1 -11.67222 3.672741 3.794059e-59 1.851905e-56
## Solyc07g065770.2.1 -11.42245 3.428110 2.240386e-50 8.030734e-48
sigUpReg <- sigDownReg[order(sigDownReg$logFC, decreasing=TRUE),]
head(sigUpReg)
## logFC logCPM PValue FDR
## Solyc09g089500.2.1 9.202703 5.85908053 1.449894e-67 1.007940e-64
## Solyc02g089710.2.1 8.756017 0.75399745 3.849181e-22 2.830258e-20
## Solyc03g013340.2.1 8.026917 0.06652057 8.906932e-15 3.388622e-13
## Solyc03g078580.2.1 7.999501 0.06761050 3.418908e-09 6.161286e-08
## Solyc02g078080.1.1 7.719094 1.39812183 2.655482e-22 1.997358e-20
## Solyc03g098760.1.1 7.443860 8.09992520 1.856150e-82 3.548495e-79
Exercise 4.24 write the results in csv files
write.csv(sigDownReg, file="sigDownReg_tomato.csv")
write.csv(sigUpReg, file="sigUpReg_tomato.csv")
Exercise 4.25 create a MA plot with 1% differentially expressed genes
plotSmear(dgeTestFilt,
de.tags = rownames(resFilt$table)[which(resFilt$table$FDR<0.01)])
Exercise 4.26 create a Volcano plot
volcanoData <- cbind(resFilt$table$logFC, -log10(resFilt$table$FDR))
colnames(volcanoData) <- c("logFC", "negLogPval")
head(volcanoData)
## logFC negLogPval
## [1,] -10.21752 104.97145
## [2,] -10.01232 95.61632
## [3,] -10.16384 94.63816
## [4,] -10.05634 87.02065
## [5,] -11.41320 86.98236
## [6,] -10.22500 83.69449
plot(volcanoData, pch=19)
Exercise 4.27 transform the normalized counts in log-counts-per-million
y <- cpm(dgeFull, log=TRUE, prior.count = 1)
head(y)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc00g005050.2.1 4.2924482 5.299855 5.278467 4.123588
## Solyc00g005080.1.1 -4.0280732 -4.028073 -4.028073 -4.028073
## Solyc00g005150.1.1 -4.0280732 -4.028073 -4.028073 -2.685565
## Solyc00g005840.2.1 4.7549185 4.950978 4.118282 4.376651
## Solyc00g005860.1.1 -4.0280732 -2.840165 -4.028073 -2.303913
## Solyc00g006470.1.1 0.2754417 1.842784 2.487764 -0.381981
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc00g005050.2.1 4.1917189 4.20074963
## Solyc00g005080.1.1 -2.6364953 -4.02807316
## Solyc00g005150.1.1 -4.0280732 -3.01585514
## Solyc00g005840.2.1 4.4984566 4.53535423
## Solyc00g005860.1.1 -4.0280732 -4.02807316
## Solyc00g006470.1.1 0.4383916 0.08230305
Exercise 4.28 select 1% differentially expressed genes and produce a heatmap
selY <- y[rownames(resFilt$table)[resFilt$table$FDR<0.01 &
abs(resFilt$table$logFC)>1.5],]
head(selY)
## Cond.WT.Rep.1 Cond.WT.Rep.2 Cond.WT.Rep.3 Cond.Mt.Rep.1
## Solyc07g053460.2.1 -0.3892916 -0.713732 -0.64179564 9.435273
## Solyc03g098680.2.1 -1.1703336 -1.142283 -1.07306157 8.584728
## Solyc01g009590.2.1 -0.1662132 -1.142283 -0.04078523 9.464694
## Solyc07g015960.1.1 -0.9766573 -2.197638 -1.69103905 8.120533
## Solyc06g064480.2.1 -2.9984687 -2.197638 -4.02807316 7.573417
## Solyc01g086830.2.1 -1.1703336 -1.416220 -2.79487461 8.209483
## Cond.Mt.Rep.2 Cond.Mt.Rep.3
## Solyc07g053460.2.1 9.706334 9.441661
## Solyc03g098680.2.1 8.921504 8.569667
## Solyc01g009590.2.1 9.996001 9.559335
## Solyc07g015960.1.1 8.617938 8.183916
## Solyc06g064480.2.1 8.093823 7.679248
## Solyc01g086830.2.1 8.728357 8.094945
cimColor <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)[255:1]
finalHM <- cim(t(selY), color=cimColor, symkey=FALSE)
If you are interesting in the result of the gene clustering, it can be obtained from the previous command. More precisely, the result of HAC is stored into $ddc
which can be plotted as shown below:
plot(finalHM$ddc, leaflab="none")
abline(h=10, lwd=2, col="pink")
Using this dendrogram, we might want to cut the tree at level \(h=10\) (for instance), which can be performed using the function cutree
, which will provide a cluster membership for each gene.
geneClust <- cutree(as.hclust(finalHM$ddc), h=10)
head(geneClust)
## Solyc07g053460.2.1 Solyc03g098680.2.1 Solyc01g009590.2.1
## 1 2 1
## Solyc07g015960.1.1 Solyc06g064480.2.1 Solyc01g086830.2.1
## 2 2 2
For instance, the number of clusters is equal to
length(unique(geneClust))
## [1] 17
and the genes in cluster 1 are:
names(which(geneClust==1))
## [1] "Solyc07g053460.2.1" "Solyc01g009590.2.1" "Solyc02g078370.1.1"
## [4] "Solyc02g032910.1.1" "Solyc05g050700.1.1"