Working with data

Overview

Teaching: 10 min
Exercises: 20 min
Learning Objectives
  • Be able to perform a wide range of basic statistical computations in R.

  • Be able to extend, using lessons taught earlier on the use of apply, basic statistics function that use a numeric vector as an input to numeric data frames and matrices.

  • Be aware of the importance of knowing the data at hand, including the nature of the data one is working with.

This episode is adapted from Data Carpentry R genomics

Metadata from an experiment

In this example, we are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations. This metadata describes information on the Ara-3 clones and the columns represent:

Column Description
sample clone name
generation generation when sample frozen
clade based on parsimony-based tree
strain ancestral strain
cit citrate-using mutant status
run Sequence read archive sample ID
genome_size size in Mbp (made up data for this lesson)

The metadata file required for this lesson can be downloaded directly here or viewed in Github.

Tip

If you can’t find the Ecoli_metadata.csv file, have lost track of it, or want to do it in R, download the file directly using the R download.file() function:

download.file("https://raw.githubusercontent.com/datacarpentry/R-genomics/gh-pages/data/Ecoli_metadata.csv", "data/Ecoli_metadata.csv")

You are now ready to load the data. We are going to use the R function read.csv() to load the data file into memory (as a data.frame):

metadata <- read.csv('data/Ecoli_metadata.csv')
head(metadata)
    sample generation   clade strain     cit       run genome_size
1   REL606          0    <NA> REL606 unknown                  4.62
2 REL1166A       2000 unknown REL606 unknown SRR098028        4.63
3   ZDB409       5000 unknown REL606 unknown SRR098281        4.60
4   ZDB429      10000      UC REL606 unknown SRR098282        4.59
5   ZDB446      15000      UC REL606 unknown SRR098283        4.66
6   ZDB458      20000 (C1,C2) REL606 unknown SRR098284        4.63

What you have just done is to both download the data from the internet and then read that data into a data frame, the first few rows of which you can see above.

You can investigate the structure of any object with the function str(). Try that now on the metadata data frame.

Solution

str(metadata)
'data.frame':	30 obs. of  7 variables:
 $ sample     : Factor w/ 30 levels "CZB152","CZB154",..: 7 6 18 19 20 21 22 23 24 25 ...
 $ generation : int  0 2000 5000 10000 15000 20000 20000 20000 25000 25000 ...
 $ clade      : Factor w/ 7 levels "(C1,C2)","C1",..: NA 7 7 6 6 1 1 1 2 4 ...
 $ strain     : Factor w/ 1 level "REL606": 1 1 1 1 1 1 1 1 1 1 ...
 $ cit        : Factor w/ 3 levels "minus","plus",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ run        : Factor w/ 30 levels "","SRR097977",..: 1 5 22 23 24 25 26 27 28 29 ...
 $ genome_size: num  4.62 4.63 4.6 4.59 4.66 4.63 4.62 4.61 4.65 4.59 ...

You can see here that the str output has a detailed description of the data structure.

As you can see, many of the columns in our data frame are of a special class called factor. Before we learn more about the data.frame class, we are going to talk about factors. They are very useful but not necessarily intuitive, and therefore require some attention.

Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting. Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

From str(metadata), we can see the names of the multiple columns. And, we see that some say things like “Factor w/ 30 levels” When we read in a file using read.csv, any column that contains text is automatically assumed to be a factor. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, we see that cit is a factor with three levels: minus, plus and unknown.

Inspecting data.frame objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

  • What is the class of the object metadata?
  • How many rows and how many columns are in this object?
  • How many citrate+ mutants have been recorded in this population?

Solution

class(metadata)
[1] "data.frame"
## how many rows?
nrow(metadata)
[1] 30
## how many columns
ncol(metadata)
[1] 7
## how many citrate+
sum(metadata$cit == "plus")
[1] 9

Statistics in R

As the language favored by statisticians, R has a wealth of statistical functions (and a even greater wealth of packages that incorporate more advanced statistical methods). In this section, we will provide an overview of some basic statistics that can be done in R.

Setting up

For this section and the next, we will require some toy data to work with. We will use the famous iris dataset that is provided in R. To load the dataset, simply issue the command data(iris). This will create an object iris in the workspace. Throughout, we will refer to this dataset in the code snippets provided.

Summarizing data and descriptive statistics

The summary() function provides a way to quickly summarize the data at hand. The following is the output from summary() when we run it on iris:

> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width 
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100 
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300 
 Median :5.800   Median :3.000   Median :4.350   Median :1.300 
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199 
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800 
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500 
       Species 
 setosa    :50 
 versicolor:50 
 virginica :50

Noticeably, R was able to figure out that the data in the last column (that of species information) cannot be represented using summary statistics such as the mean or the median because it is a categorical variable. This is another key feature of R: the same function can have different effects depending on the type of data the function is being applied on. For further illustration, lets see what happens if iris is to be a matrix instead of a data frame:

 > summary(as.matrix(iris))
  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width       Species 
 5.0    :10    3.0    :26   1.4    :13    0.2    :29   setosa    :50 
 5.1    : 9    2.8    :14   1.5    :13    1.3    :13   versicolor:50 
 6.3    : 9    3.2    :13   4.5    : 8    1.5    :12   virginica :50 
 5.7    : 8    3.4    :12   5.1    : 8    1.8    :12
 6.7    : 8    3.1    :11   1.3    : 7    1.4    : 8 
 5.5    : 7    2.9    :10   1.6    : 7    2.3    : 8 
 (Other):99    (Other):64   (Other):94    (Other):68 

Instead of summarizing it using summary statistics such as the mean and the median as observed in the previous output, summary now returns us the frequency of each observed value.

Two other important statistics that are commonly calculated are the mean and the standard deviation. Unlike summary() that can be called on dataframes and matrices, the mean() and the sd() functions require the input be a numeric vector.

Finding row and column means

Although mean() is used to calculate the mean of a numeric vector, a common task is to calculate the average of columns and/or row. There are two ways to do so: one using apply() and one using a built in R function. Using apply to find the column mean, one can do the following:

apply(iris[,1:4], 2, mean)

The snippet can be read as: “On the first 4 columns from iris, calculate the mean along the columns (margin=2)”. Instead of using ‘2’ as the margin, one can use ‘1’ if they are interested in row means.

One can alternatively use the built in R functions colMeans() and rowMeans() respectivey for the same purpose.

Finding column and row standard deviations

There does not exist any function analgous to rowMeans() and colMeans() for the calculation of standard deviation. How will you write a script to calculate the row and column standard deviation? Hint: You can use apply to do so.

Measuring correlations

From the manual

cor                   package:stats                    R Documentation
Correlation, Variance and Covariance (Matrices)
Description:

     'var', 'cov' and 'cor' compute the variance of 'x' and the
     covariance or correlation of 'x' and 'y' if these are vectors.  If
     'x' and 'y' are matrices then the covariances (or correlations)
     between the columns of 'x' and the columns of 'y' are computed.

     'cov2cor' scales a covariance matrix into the corresponding
     correlation matrix _efficiently_.

We can calculate very quickly and efficiently the correlation coefficients between two variables using the cor. For example, to find how well correlated sepal and petal width are, we can do the following:

> cor(iris$Sepal.Width, iris$Petal.Width)
[1] -0.3661259

In the above snippet, we calculated the correlation between two variables – sepal width and petal width. What if we wanted to calculate the pairwise correlation of all the different variables? Do we calculate all the possible pairs, and then tabulate our own table? Fortunately, this is not the case. From the manual, we see that ‘x’ can be a dataframe or a matrix. Furthermore, we also know that if x is a data frame or a matrix, we get pairwise correlations. So for instance

> cor(iris)
Error in cor(iris) : 'x' must be numeric

What was happening? As it turns out, the error was thrown because column 5 of our data contains information about Species. However, this information is not stored as a numeric data. As a result, the correlation cannot be calculated. Throwing a quick fix by excluding the problematic column (or including only the first 4 columns):

> cor(iris[,1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Instead of a number, cor() now provides us a correlation matrix between all 4 variables. Quite handy!

The importance of knowing what the default options are

It is important to be aware of what the output represents. Specifically, because we did not specify the type of correlation coefficient, R simply used the default (which is the Pearson correlation coefficient). However, it is also able to calculate a few other correlation coefficients, including the Spearman correlation coefficient and Kendall’s tau. The Pearson correlation coefficient might not always be the best or even the appropriate measure. This highlights the need to know and understand what exactly are the default options when we do any sort of analysis in R.

Calculating correlations is one of the most common tasks in analysis. However, it is important to be aware of the limitations of using correlation coefficients as well as the underlying assumptions for each type of correlation coefficient.



Key Points

  • The summary function is used to provide a quick overview of numeric data frames.

  • The same function in R can give different results depending on the data type.

  • mean and standard deviation for a vector are calculated using the mean and sd functions respectively.

  • apply can be used to extend the mean and sd functions to perform summary statistics on data frames or matrices.

  • cor is used to calculate correlation coefficient between two vectors, or if a matrix is provided, pairwise correlation coefficients between variables.

  • It is important to know what type of data one has on hand, as functions require specific data types as inputs.

  • Default options might not always be the most appropriate for the data at hand, and hence it is important to know what the defaults are when writing scripts.