Identifying differentially expressed genes using linear models (part 1)

Overview

Teaching: 20 min
Exercises: 20 min
Learning Objectives
  • Be able to use limma to identify differentially expressed genes.

  • Understand the formula class of objects in R, and use it to specify the appropriate model for linear modeling.

What you should have

At this point, you should have two fully processed ExpressionSet objects. We will use the first one (GSE33146) in this lesson.

The formula class of objects

The formula class is the work horse of statistical modeling in R. We use ~ in the specification of the model, where y ~ x means the response y is modelled by a linear variable x. More complex models are possible using the operators +, *, :, and others. Refer to the formula manual for more information on the different operators and their meaning in a formula. Linear models expressed this way include an implicit ‘intercept’ term, which we can remove from the model by indicating + 0 in our model formula. For a variety of reasons, it’s often convenient to remove the intercept term when doing differential gene expression analysis.

Identifying differentially expressed features

In a transcriptomics experiment, whether using microarrays or RNA-seq, we often fit the data from every transcript as a response against a common model of independent variables. The variables describe the experiment, and we specify just the right side of a formula (the left side being used to fit the data). For example, if we have a treatment treat that can take several values, and the actual value of treat varies across samples, we might specify a model by

~  treat

or possibly

~ 0 + treat

The second formula explicitly removes the intercept. More on that later.

Specifying our model for differential gene expression analysis

In order to identify differentially expressed genes using linear models, we need to do two things:

  1. Generate a model matrix specifying the design, and
  2. Fit the model to the design.

In all cases except for the simple two-class comparison, we must also specify a contrast matrix, which we’ll get to shortly.

In our first data set, we only have one treatment: the change of culture conditions.

library(limma)
design <- model.matrix( ~ gse33146_eset[['culture medium:ch1']])
colnames(design)[2] <- "SCGM"

Notice that the design matrix has two columns: one specifiying the intercept and one specifying the change of culture conditions.

(Intercept) SCGM
1 0
1 0
1 0
1 1
1 1
1 1

The lmFit() function of limma fits a linear model for every row of our expression matrix. In the case of a two-class comparison, this is equivalent to a simple t-test.

fit <- lmFit(gse33146_eset,design)

Empirical Bayes correction in limma

Empirical Bayes (eBayes) is a method that borrows information about the distribution across genes to calculate a robust test statistic. In limma, this can be performed using the eBayes() function. The function requires that we provide an object returned from fitting a linear model (or contrast matrix) to the data. Performing eBayes correction is easily done in R using the following line of code:

fitted.ebayes <- eBayes(fit)

Extracting differentially expressed genes

We now have, model fits for each feature on the array, and we can arrange them in a table using topTable() By default, topTable will provide the top 10 features sorted by the “B” statistic, which is the log odds of differential expression.

topTable(fitted.ebayes)
Removing intercept from test coefficients
  logFC AveExpr t P.Value adj.P.Val B
204268_at -4.013515 12.416222 -64.54093 0 1e-07 17.05001
203691_at -4.359692 9.194869 -62.85603 0 1e-07 16.93255
228335_at 5.036631 7.251430 62.22840 0 1e-07 16.88718
226560_at -3.924272 7.475488 -61.58967 0 1e-07 16.84008
1558846_at -3.910434 6.392854 -61.34606 0 1e-07 16.82186
201820_at -5.150011 9.049732 -61.29092 0 1e-07 16.81772
210809_s_at 4.309197 10.827920 59.60159 0 1e-07 16.68715
204304_s_at -3.724700 7.445306 -59.22172 0 1e-07 16.65680
204971_at -4.622910 9.764728 -55.37340 0 1e-07 16.32707
206166_s_at -4.268026 6.384286 -55.22246 0 1e-07 16.31326

See how it removed the intercept? In this case, the “SCGM” coefficient is telling us all that we need to know.

Another good function is decideTests(). This function returns which tested features of an array pass the test criteria. By default, this is a (Benjamini-Hochberg) p value of 0.05, and no log fold change cutoff. We can look at a summary with a log fold change cutoff of 1 (meaning 2 fold change in either direction).

summary(decideTests(fitted.ebayes[,"SCGM"],lfc=1))
        SCGM
Down     633
NotSig 53460
Up       582

Using contrasts

The intercept is basically useless when doing these sorts of tests, and what we really want is to be comparing experimental groups, right? So can we create a design that reflects the changes in group means? We don’t have do do this for only two conditions, but it becomes essential for more complex experimental designs.

To accomplish this we have to specify contrasts within our experimental design. Contrasts are the comparisons we wish to make. If treat represents two alternative treatments, we simply want to know if the gene expression is different for one treatment versus the other, we calculate the group means for each treatment, and the contrast (the difference) between them.

design <- model.matrix( ~ 0 + gse33146_eset[['culture medium:ch1']])
colnames(design) <- levels(as.factor(gse33146_eset[['culture medium:ch1']]))
MEGM SCGM
1 0
1 0
1 0
0 1
0 1
0 1

If we remove the intercept, our coefficients now correspond to the conditions. This seems (to me) more natural, but it creates an extra step: deciding the contrasts between groups. In this case it’s easy because there are only two groups. In other cases, we’ll have more choices.

contrast_matrix <- makeContrasts(SCGM - MEGM, levels=design)
contrast_matrix
      Contrasts
Levels SCGM - MEGM
  MEGM          -1
  SCGM           1

In our contrast matrix, we are interested in finding out the difference between the group grown in SCGM (the EMT phenotype) and the control (grown in MEGM). For that reason, we used SCGM-MEGM, with the latter being the reference. This formulation means that when the log fold change is positive, expression is greater in SCGM than MEGM, and vice versa. Naturally, you can perform more complex analysis with multiple comparisons depending on the question of interest. You can read up more on using limma for more complex analysis from the limma user guide (available at https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf, and in particular, pages 35-64), which demonstrates the use of limma for a wide range of questions and even two-colored platforms.

Now we can move ahead to the fit.

fit <- lmFit(gse33146_eset,design)
fit2 <- contrasts.fit(fit,contrasts=contrast_matrix)
fit2 <- eBayes(fit2)
summary(decideTests(fit2,lfc=1))
       SCGM - MEGM
Down           633
NotSig       53460
Up             582

Notice we found the exact same number of differentially expressed genes, but used the contrasts explicitly in the second case.

The importance of a good model

While the process of fitting a model to the data is not difficult, the difficulty that is most often encountered is the choice of an appropriate model. Many times, there are underlying confounders that are not immediately apparent but have significant impact on the results. One such confounder frequently encountered in microarrays is batch effect, which arises when samples are analyzed on different days by different people, leading to the introduction of technical artifacts. For this reason, exploratory data analysis (EDA) is critical to understanding the nature of data prior to model fitting. While outside the scope of this practical, it is a worthwhile investment to find out some of these methods and also how one can correct for these technical differences in a statistically robust manner.

Key Points

  • The formula class of objects in R enables us to represent a wide range of models to identify differentially expressed genes.