Identifying differentially expressed genes using linear models (part 1)
Overview
Teaching: 20 min
Exercises: 20 minLearning Objectives
Be able to use
limma
to identify differentially expressed genes.Understand the formula class of objects in R, and use it to specify the appropriate model for linear modeling.
What you should have
At this point, you should have two fully processed ExpressionSet objects. We will use the first one (GSE33146) in this lesson.
The formula class of objects
The formula class is the work horse of statistical modeling in R. We use ~
in the
specification of the model, where y ~ x
means the response y
is modelled by a linear variable
x
. More complex models are possible using the operators +
, *
, :
, and others. Refer to the
formula manual for more information on the different operators and their meaning in a
formula. Linear models expressed this way include an implicit ‘intercept’ term, which we can remove
from the model by indicating + 0
in our model formula. For a variety of reasons, it’s often
convenient to remove the intercept term when doing differential gene expression analysis.
Identifying differentially expressed features
In a transcriptomics experiment, whether using microarrays or RNA-seq, we often fit the data from
every transcript as a response against a common model of independent variables. The variables
describe the experiment, and we specify just the right side of a formula (the left side being used
to fit the data). For example, if we have a treatment treat
that can take several values,
and the actual value of treat
varies across samples, we might specify a model by
~ treat
or possibly
~ 0 + treat
The second formula explicitly removes the intercept. More on that later.
Specifying our model for differential gene expression analysis
In order to identify differentially expressed genes using linear models, we need to do two things:
- Generate a model matrix specifying the design, and
- Fit the model to the design.
In all cases except for the simple two-class comparison, we must also specify a contrast matrix, which we’ll get to shortly.
In our first data set, we only have one treatment: the change of culture conditions.
library(limma)
design <- model.matrix( ~ gse33146_eset[['culture medium:ch1']])
colnames(design)[2] <- "SCGM"
Notice that the design matrix has two columns: one specifiying the intercept and one specifying the change of culture conditions.
(Intercept) | SCGM |
---|---|
1 | 0 |
1 | 0 |
1 | 0 |
1 | 1 |
1 | 1 |
1 | 1 |
The lmFit()
function of limma fits a linear model for every row of our expression matrix. In the case of a two-class comparison, this is equivalent to a simple t-test.
fit <- lmFit(gse33146_eset,design)
Empirical Bayes correction in limma
Empirical Bayes (eBayes) is a method that borrows information about the distribution
across genes to calculate a robust test statistic. In limma
, this can be performed using
the eBayes()
function. The function requires that we provide an object returned from
fitting a linear model (or contrast matrix) to the data. Performing eBayes correction is easily done in R using the following
line of code:
fitted.ebayes <- eBayes(fit)
Extracting differentially expressed genes
We now have, model fits for each feature on the array, and we can arrange them in a table using
topTable()
By default, topTable will provide the top 10 features sorted by the “B” statistic,
which is the log odds of differential expression.
topTable(fitted.ebayes)
Removing intercept from test coefficients
logFC | AveExpr | t | P.Value | adj.P.Val | B | |
---|---|---|---|---|---|---|
204268_at | -4.013515 | 12.416222 | -64.54093 | 0 | 1e-07 | 17.05001 |
203691_at | -4.359692 | 9.194869 | -62.85603 | 0 | 1e-07 | 16.93255 |
228335_at | 5.036631 | 7.251430 | 62.22840 | 0 | 1e-07 | 16.88718 |
226560_at | -3.924272 | 7.475488 | -61.58967 | 0 | 1e-07 | 16.84008 |
1558846_at | -3.910434 | 6.392854 | -61.34606 | 0 | 1e-07 | 16.82186 |
201820_at | -5.150011 | 9.049732 | -61.29092 | 0 | 1e-07 | 16.81772 |
210809_s_at | 4.309197 | 10.827920 | 59.60159 | 0 | 1e-07 | 16.68715 |
204304_s_at | -3.724700 | 7.445306 | -59.22172 | 0 | 1e-07 | 16.65680 |
204971_at | -4.622910 | 9.764728 | -55.37340 | 0 | 1e-07 | 16.32707 |
206166_s_at | -4.268026 | 6.384286 | -55.22246 | 0 | 1e-07 | 16.31326 |
See how it removed the intercept? In this case, the “SCGM” coefficient is telling us all that we need to know.
Another good function is decideTests()
. This function returns which tested features of an array pass the test criteria. By default, this is a (Benjamini-Hochberg) p value of 0.05, and no log fold change cutoff. We can look at a summary with a log fold change cutoff of 1 (meaning 2 fold change in either direction).
summary(decideTests(fitted.ebayes[,"SCGM"],lfc=1))
SCGM
Down 633
NotSig 53460
Up 582
Using contrasts
The intercept is basically useless when doing these sorts of tests, and what we really want is to be comparing experimental groups, right? So can we create a design that reflects the changes in group means? We don’t have do do this for only two conditions, but it becomes essential for more complex experimental designs.
To accomplish this we have to specify contrasts within our experimental design. Contrasts are the comparisons we wish to make. If treat
represents two alternative treatments, we simply want to know if the gene expression is different for one treatment versus the other, we calculate the group means for each treatment, and the contrast (the difference) between them.
design <- model.matrix( ~ 0 + gse33146_eset[['culture medium:ch1']])
colnames(design) <- levels(as.factor(gse33146_eset[['culture medium:ch1']]))
MEGM | SCGM |
---|---|
1 | 0 |
1 | 0 |
1 | 0 |
0 | 1 |
0 | 1 |
0 | 1 |
If we remove the intercept, our coefficients now correspond to the conditions. This seems (to me) more natural, but it creates an extra step: deciding the contrasts between groups. In this case it’s easy because there are only two groups. In other cases, we’ll have more choices.
contrast_matrix <- makeContrasts(SCGM - MEGM, levels=design)
contrast_matrix
Contrasts
Levels SCGM - MEGM
MEGM -1
SCGM 1
In our contrast matrix, we are interested in finding out the
difference between the group grown in SCGM (the EMT phenotype) and
the control (grown in MEGM). For that reason, we used SCGM-MEGM
, with the latter being the reference.
This formulation means that when the log fold change is positive, expression is greater in SCGM than MEGM, and vice versa.
Naturally, you can perform more complex analysis with multiple comparisons depending on
the question of interest. You can read up more on using limma for more complex analysis
from the limma user guide (available at
https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf,
and in particular, pages 35-64), which demonstrates the use of limma for a wide range of
questions and even two-colored platforms.
Now we can move ahead to the fit.
fit <- lmFit(gse33146_eset,design)
fit2 <- contrasts.fit(fit,contrasts=contrast_matrix)
fit2 <- eBayes(fit2)
summary(decideTests(fit2,lfc=1))
SCGM - MEGM
Down 633
NotSig 53460
Up 582
Notice we found the exact same number of differentially expressed genes, but used the contrasts explicitly in the second case.
The importance of a good model
While the process of fitting a model to the data is not difficult, the difficulty that is most often encountered is the choice of an appropriate model. Many times, there are underlying confounders that are not immediately apparent but have significant impact on the results. One such confounder frequently encountered in microarrays is batch effect, which arises when samples are analyzed on different days by different people, leading to the introduction of technical artifacts. For this reason, exploratory data analysis (EDA) is critical to understanding the nature of data prior to model fitting. While outside the scope of this practical, it is a worthwhile investment to find out some of these methods and also how one can correct for these technical differences in a statistically robust manner.
Key Points
The
formula
class of objects in R enables us to represent a wide range of models to identify differentially expressed genes.