This document has the following dependencies:
library(limma)
library(leukemiasEset)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("limma", "leukemiasEset"))
limma is a very popular package for analyzing microarray and RNA-seq data.
LIMMA stands for “linear models for microarray data”. Perhaps unsurprisingly, limma contains functionality for fitting a broad class of statistical models called “linear models”. Examples of such models include linear regression and analysis of variance. While most of the functionality of limma has been developed for microarray data, the model fitting routines of limma are useful for many types of data, and is not limited to microarrays. For example, I am using limma in my research on analysis of DNA methylation.
(The discussion in this section is not specific to limma.)
A very common analysis setup is having access to a matrix of numeric values representing some measurements; an example is gene expression. Traditionally in Bioconductor, and in computational biology more generally, columns of this matrix are samples and rows of the matrix are features. Features can be many things; in gene expression a feature is a gene. The feature by sample layout in Bioconductor is the transpose of the layout in classic statistics where the matrix is samples by features; this sometimes cause confusion.
A very common case of this type of data is gene expression data (either from microarrays or from RNA sequencing) where the features are individual genes.
Samples are usually few (usually less than a hundred, almost always less than a thousand) and are often grouped; arguably the most common setup is having samples from two different groups which we can call cases and controls. The objective of an analysis is frequently to discover which features (genes) are different between groups or stated differently: to discover which genes are differentially expressed between cases and controls. Of course, more complicated designs are also used; sometimes more than two groups are considered and sometimes there are additional important covariates such as age or sex. An example of a more complicated design is a time series experiment where each time point is a group and where it is sometimes important to account for the time elapsed between time points.
Broadly speaking, how samples are distributed between groups determines the design of the study. In addition to the design, there is one or more question(s) of interest(s) such as the difference between two groups. Such questions are usually formalized as contrasts; an example of a contrast is indeed the difference between two groups.
Samples are usually assumed to be independent but are sometimes paired; an example of pairing is when a normal and a cancer sample from the same patient is available. Pairing allows for more efficient inference because each sample has a sample specific control. If only some samples are paired, or if multiple co-linked samples exists, it becomes harder (but usually possible) to account for this structure in the statistical model.
As stated above, the most common design is a two group design with unpaired samples.
Features are often genes or genomic intervals, for example different promoters or genomic bins. The data is often gene expression data but could be histone modification abundances or even measurements from a Hi-C contact matrix.
Common to all these cases is the rectangular data structure (the matrix) with samples on columns and features on rows. This is exactly the data structure which is represented by an ExpressionSet
or a SummarizedExperiment
.
A number of different packages allows us to fit common types of models to this data structure
Extremely simplified, limma is useful for continuous data such as microarray data and edgeR / DESeq / DESeq2 are useful for count data such as high-throughput sequencing. But that is a very simplified statement.
In addition to the distributional assumptions, all of these packages uses something called empirical Bayes techniques to borrow information across features. As stated above, usually the number of samples is small and the number of features is large. It has been shown time and time again that you can get better results by borrowing information across features, for example by modeling a mean-variance relationship. This can be done in many ways and often depends on the data characteristics of a specific type of data. For example both edgeR and DESeq(2) are very popular in the analysis of RNA-seq data and all three packages uses models based on the negative binomial distribution. For a statistical point of view, a main difference between these packages (models) is how they borrow information across genes.
Fully understanding these classes of models as well as their strengths and limitations are beyond our scope. But we will still introduce aspects of these packages because they are so widely used.
Let us use the leukemiasEset
dataset from the leukemiasEset package; this is an ExpressionSet
.
library(leukemiasEset)
data(leukemiasEset)
leukemiasEset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 20172 features, 60 samples
## element names: exprs, se.exprs
## protocolData
## sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60
## total)
## varLabels: ScanDate
## varMetadata: labelDescription
## phenoData
## sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60
## total)
## varLabels: Project Tissue ... Subtype (5 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: genemapperhgu133plus2
table(leukemiasEset$LeukemiaType)
##
## ALL AML CLL CML NoL
## 12 12 12 12 12
This is data on different types of leukemia. The code NoL
means not leukemia, ie. normal controls.
Let us ask which genes are differentially expressed between the ALL
type and normal controls. First we subset the data and clean it up
ourData <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
ourData$LeukemiaType <- factor(ourData$LeukemiaType)
Now we do a standard limma model fit
design <- model.matrix(~ ourData$LeukemiaType)
fit <- lmFit(ourData, design)
fit <- eBayes(fit)
topTable(fit)
## logFC AveExpr t P.Value adj.P.Val
## ENSG00000163751 4.089587 5.819472 22.51729 9.894742e-18 1.733025e-13
## ENSG00000104043 4.519488 5.762115 21.98550 1.718248e-17 1.733025e-13
## ENSG00000008394 5.267835 7.482490 20.08250 1.374231e-16 9.240332e-13
## ENSG00000165140 3.206807 6.560163 19.41855 2.959391e-16 1.492421e-12
## ENSG00000204103 4.786273 7.774809 19.04041 4.628812e-16 1.867448e-12
## ENSG00000145569 2.845963 5.958707 18.46886 9.239404e-16 3.067090e-12
## ENSG00000038427 5.047670 6.496822 18.35375 1.064328e-15 3.067090e-12
## ENSG00000173391 4.282498 5.293222 17.89645 1.881511e-15 4.744229e-12
## ENSG00000138449 5.295928 6.999716 17.79655 2.134448e-15 4.784010e-12
## ENSG00000105352 2.521351 7.054018 17.62423 2.657074e-15 5.359850e-12
## B
## ENSG00000163751 30.15253
## ENSG00000104043 29.65066
## ENSG00000008394 27.73589
## ENSG00000165140 27.02056
## ENSG00000204103 26.60138
## ENSG00000145569 25.95084
## ENSG00000038427 25.81729
## ENSG00000173391 25.27803
## ENSG00000138449 25.15836
## ENSG00000105352 24.95031
What happens here is a common limma (and friends) workflow. First, the comparison of interest (and the design of the experiment) is defined through a so-called “design matrix”. This matrix basically encompasses everything we know about the design; in this case there are two groups (we have more to say on the design below). Next, the model is fitted. This is followed by borrowing strength across genes using a so-called empirical Bayes procedure (this is the step in limma which really works wonders). Because this design only has two groups there is only one possible comparison to make: which genes differs between the two groups. This question is examined by the topTable()
function which lists the top differentially expressed genes. In a more complicated design, the topTable()
function would need to be told which comparison of interest to summarize.
An important part of the output is logFC
which is the log fold-change. To interpret the sign of this quantity you need to know if this is ALL-NoL
(in which case positive values are up-regulated in ALL) or the reverse. In this case this is determined by the reference level which is the first level of the factor.
ourData$LeukemiaType
## [1] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL NoL NoL NoL NoL NoL
## [18] NoL NoL NoL NoL NoL NoL NoL
## Levels: ALL NoL
we see the reference level is ALL
so positive values means it is down-regulated in cancer. You can change the reference level of a factor using the relevel()
function. You can also confirm this by computing the logFC
by hand, which is useful to know. Let’s compute the fold-change of the top differentially expressed gene:
topTable(fit, n = 1)
## logFC AveExpr t P.Value adj.P.Val
## ENSG00000163751 4.089587 5.819472 22.51729 9.894742e-18 1.733025e-13
## B
## ENSG00000163751 30.15253
genename <- rownames(topTable(fit, n=1))
typeMean <- tapply(exprs(ourData)[genename,], ourData$LeukemiaType, mean)
typeMean
## ALL NoL
## 3.774678 7.864265
typeMean["NoL"] - typeMean["ALL"]
## NoL
## 4.089587
confirming the statement. It is sometimes useful to check things by hand to make sure you have the right interpretation. Finally, note that limma doesn’t do anything different from a difference of means when it computes logFC
; all the statistical improvements centers on computing better t-statistics and p-values.
The reader who has some experience with statistics will note that all we are doing is comparing two groups; this is the same setup as the classic t-statistic. What we are computing here is indeed a t-statistic, but one where the variance estimation (the denominator of the t-statistics) is moderated by borrowing strength across genes (this is what eBayes()
does); this is called a moderated t-statistic.
The output from topTable()
includes
logFC
: the log fold-change between cases and controls.t
: the t-statistic used to assess differential expression.P.Value
: the p-value for differential expression; this value is not adjusted for multiple testing.adj.P.Val
: the p-value adjusted for multiple testing. Different adjustment methods are available, the default is Benjamini-Horchberg.How to setup and interpret a design matrix for more complicated designs is beyond the scope of this course. The limma User’s Guide is extremely helpful here. Also, note that setting up a design matrix for an experiment is a standard task in statistics (and requires very little knowledge about genomics), so other sources of help is a local, friendly statistician or text books on basic statistics.
In the analysis in the preceding section we setup our model like this
design <- model.matrix(~ ourData$LeukemiaType)
and the we use F-statistics to get at our question of interest. We can it this easily because there is only really one interesting question for this design: is there differential expression between the two groups. But this formulation did not use contrasts; in the “Analysis Setup and Design” section we discussed how one specifies the question of interest using contrasts and we did not really do this here.
Let’s try. A contrast is interpreted relative to the design matrix one uses. One conceptual design may be represented by different design matrices, which is one of the reasons why design matrices and contrasts take a while to absorb.
Let’s have a look
head(design)
## (Intercept) ourData$LeukemiaTypeNoL
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
This matrix has two columns because there are two parameters in this conceptual design: the expression level in each of the two groups. In this parametrization column 1 represents the expression of the ALL
group and column 2 represents the difference in expression level from the NoL
group to the ALL
group. Testing that the two groups have the same expression level is done by testing whether the second parameter (equal to the difference in expression between the two groups) is equal to zero.
A different parametrization is
design2 <- model.matrix(~ ourData$LeukemiaType - 1)
head(design2)
## ourData$LeukemiaTypeALL ourData$LeukemiaTypeNoL
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
colnames(design2) <- c("ALL", "NoL")
In this design, the two parameters corresponding to the two columns of the design matrix, represents the expression levels in the two groups. And the scientific question gets translated into asking whether or not these two parameters are the same. Let us see how we form a contrast matrix for this
fit2 <- lmFit(ourData, design2)
contrast.matrix <- makeContrasts("ALL-NoL", levels = design2)
contrast.matrix
## Contrasts
## Levels ALL-NoL
## ALL 1
## NoL -1
Here we say we are interested in ALL-NoL
which has the opposite sign of what we were doing above (where it was NoL-ALL
; since NoL
is the natural reference group this makes a lot more sense. Now we fit
fit2C <- contrasts.fit(fit2, contrast.matrix)
fit2C <- eBayes(fit2C)
topTable(fit2C)
## logFC AveExpr t P.Value adj.P.Val
## ENSG00000163751 -4.089587 5.819472 -22.51729 9.894742e-18 1.733025e-13
## ENSG00000104043 -4.519488 5.762115 -21.98550 1.718248e-17 1.733025e-13
## ENSG00000008394 -5.267835 7.482490 -20.08250 1.374231e-16 9.240332e-13
## ENSG00000165140 -3.206807 6.560163 -19.41855 2.959391e-16 1.492421e-12
## ENSG00000204103 -4.786273 7.774809 -19.04041 4.628812e-16 1.867448e-12
## ENSG00000145569 -2.845963 5.958707 -18.46886 9.239404e-16 3.067090e-12
## ENSG00000038427 -5.047670 6.496822 -18.35375 1.064328e-15 3.067090e-12
## ENSG00000173391 -4.282498 5.293222 -17.89645 1.881511e-15 4.744229e-12
## ENSG00000138449 -5.295928 6.999716 -17.79655 2.134448e-15 4.784010e-12
## ENSG00000105352 -2.521351 7.054018 -17.62423 2.657074e-15 5.359850e-12
## B
## ENSG00000163751 30.15253
## ENSG00000104043 29.65066
## ENSG00000008394 27.73589
## ENSG00000165140 27.02056
## ENSG00000204103 26.60138
## ENSG00000145569 25.95084
## ENSG00000038427 25.81729
## ENSG00000173391 25.27803
## ENSG00000138449 25.15836
## ENSG00000105352 24.95031
Note that this is exactly the same output from topTable()
as above, except for the sign of the logFC
column.
As we see above, limma works directly on ExpressionSet
s. It also works directly on matrices. But limma also have a class RGList
which represents a two-color microarray. The basic data stored in this class is very ExpressionSet
-like, but it has at least two matrices of expression measurements R
(Red) and G
(Green) and optionally two additional matrices of background estimates (Rb
and Gb
). It has a slot called genes
which is basically equivalent to featureData
for ExpressionSet
s (ie. information about which genes are measured on the microarray) as well as a targets
slot which is basically the pData
information from ExpressionSet
.
limma introduced the concept of a so-called targets
file. This is just a simple text file (usually TAB or comma-separated) which holds the phenotype data. The idea is that it is easier for many users to create this text file in a spreadsheet program, and then read it into R and stored the information in the data object.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel methods stats graphics grDevices utils datasets
## [8] base
##
## other attached packages:
## [1] leukemiasEset_1.4.0 Biobase_2.28.0 BiocGenerics_0.14.0
## [4] limma_3.24.15 BiocStyle_1.6.0 rmarkdown_0.8
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.1 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 knitr_1.11 stringr_1.0.0
## [9] digest_0.6.8 evaluate_0.7.2