Working with data
Overview
Teaching: 10 min
Exercises: 20 minLearning Objectives
Be able to perform a wide range of basic statistical computations in R.
Be able to extend, using lessons taught earlier on the use of apply, basic statistics function that use a numeric vector as an input to numeric data frames and matrices.
Be aware of the importance of knowing the data at hand, including the nature of the data one is working with.
This episode is adapted from Data Carpentry R genomics
Metadata from an experiment
In this example, we are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations. This metadata describes information on the Ara-3 clones and the columns represent:
Column | Description |
---|---|
sample | clone name |
generation | generation when sample frozen |
clade | based on parsimony-based tree |
strain | ancestral strain |
cit | citrate-using mutant status |
run | Sequence read archive sample ID |
genome_size | size in Mbp (made up data for this lesson) |
The metadata file required for this lesson can be downloaded directly here or viewed in Github.
Tip
If you can’t find the Ecoli_metadata.csv file, have lost track of it, or want to do it in R, download the file directly using the R
download.file()
function:download.file("https://raw.githubusercontent.com/datacarpentry/R-genomics/gh-pages/data/Ecoli_metadata.csv", "data/Ecoli_metadata.csv")
You are now ready to load the data. We are going to use the R function read.csv()
to load the data file into memory (as a data.frame
):
metadata <- read.csv('data/Ecoli_metadata.csv')
head(metadata)
sample generation clade strain cit run genome_size
1 REL606 0 <NA> REL606 unknown 4.62
2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
What you have just done is to both download the data from the internet and then read that data into a data frame, the first few rows of which you can see above.
You can investigate the structure of any object with the function str()
. Try that now on the metadata
data frame.
Solution
str(metadata)
'data.frame': 30 obs. of 7 variables: $ sample : Factor w/ 30 levels "CZB152","CZB154",..: 7 6 18 19 20 21 22 23 24 25 ... $ generation : int 0 2000 5000 10000 15000 20000 20000 20000 25000 25000 ... $ clade : Factor w/ 7 levels "(C1,C2)","C1",..: NA 7 7 6 6 1 1 1 2 4 ... $ strain : Factor w/ 1 level "REL606": 1 1 1 1 1 1 1 1 1 1 ... $ cit : Factor w/ 3 levels "minus","plus",..: 3 3 3 3 3 3 3 3 3 3 ... $ run : Factor w/ 30 levels "","SRR097977",..: 1 5 22 23 24 25 26 27 28 29 ... $ genome_size: num 4.62 4.63 4.6 4.59 4.66 4.63 4.62 4.61 4.65 4.59 ...
You can see here that the str output has a detailed description of the data structure.
As you can see, many of the columns in our data frame are of a special class called
factor
. Before we learn more about the data.frame
class, we are going to
talk about factors. They are very useful but not necessarily intuitive, and
therefore require some attention.
Factors
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting. Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
From str(metadata)
, we can see the names of the multiple columns. And, we see that
some say things like “Factor w/ 30 levels”
When we read in a file using read.csv
, any column that contains text is automatically
assumed to be a factor. Once created, factors can only contain a pre-defined set values, known as
levels. By default, R always sorts levels in alphabetical order.
For instance, we see that cit
is a factor with three levels: minus
, plus
and unknown
.
Inspecting data.frame
objects
We already saw how the functions head()
and str()
can be useful to check the
content and the structure of a data.frame
. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data.
- Size:
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columns
- Content:
head()
- shows the first 6 rowstail()
- shows the last 6 rows
- Names:
names()
- returns the column names (synonym ofcolnames()
fordata.frame
objects)rownames()
- returns the row names
- Summary:
str()
- structure of the object and information about the class, length and content of each columnsummary()
- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of
objects besides data.frame
.
Challenge
- What is the class of the object
metadata
?- How many rows and how many columns are in this object?
- How many citrate+ mutants have been recorded in this population?
Solution
class(metadata)
[1] "data.frame"
## how many rows? nrow(metadata)
[1] 30
## how many columns ncol(metadata)
[1] 7
## how many citrate+ sum(metadata$cit == "plus")
[1] 9
Statistics in R
As the language favored by statisticians, R has a wealth of statistical functions (and a even greater wealth of packages that incorporate more advanced statistical methods). In this section, we will provide an overview of some basic statistics that can be done in R.
Setting up
For this section and the next, we will require some toy data to work with. We will use the famous iris dataset that is provided in R. To load the dataset, simply issue the command
data(iris)
. This will create an object iris in the workspace. Throughout, we will refer to this dataset in the code snippets provided.
Summarizing data and descriptive statistics
The summary()
function provides a way to quickly summarize the data at hand. The following is the output from summary()
when we run it on iris:
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Noticeably, R was able to figure out that the data in the last column (that of species information) cannot be represented using summary statistics such as the mean or the median because it is a categorical variable. This is another key feature of R: the same function can have different effects depending on the type of data the function is being applied on. For further illustration, lets see what happens if iris is to be a matrix instead of a data frame:
> summary(as.matrix(iris))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.0 :10 3.0 :26 1.4 :13 0.2 :29 setosa :50
5.1 : 9 2.8 :14 1.5 :13 1.3 :13 versicolor:50
6.3 : 9 3.2 :13 4.5 : 8 1.5 :12 virginica :50
5.7 : 8 3.4 :12 5.1 : 8 1.8 :12
6.7 : 8 3.1 :11 1.3 : 7 1.4 : 8
5.5 : 7 2.9 :10 1.6 : 7 2.3 : 8
(Other):99 (Other):64 (Other):94 (Other):68
Instead of summarizing it using summary statistics such as the mean and the median as observed in the previous output, summary now returns us the frequency of each observed value.
Two other important statistics that are commonly calculated are the mean and the standard deviation. Unlike summary()
that can be called on dataframes and matrices, the mean()
and the sd()
functions require the input be a numeric vector.
Finding row and column means
Although
mean()
is used to calculate the mean of a numeric vector, a common task is to calculate the average of columns and/or row. There are two ways to do so: one usingapply()
and one using a built in R function. Using apply to find the column mean, one can do the following:
apply(iris[,1:4], 2, mean)
The snippet can be read as: “On the first 4 columns from iris, calculate the mean along the columns (margin=2)”. Instead of using ‘2’ as the margin, one can use ‘1’ if they are interested in row means.
One can alternatively use the built in R functions
colMeans()
androwMeans()
respectivey for the same purpose.
Finding column and row standard deviations
There does not exist any function analgous to
rowMeans()
andcolMeans()
for the calculation of standard deviation. How will you write a script to calculate the row and column standard deviation? Hint: You can use apply to do so.
Measuring correlations
From the manual
cor package:stats R Documentation
Correlation, Variance and Covariance (Matrices)
Description:
'var', 'cov' and 'cor' compute the variance of 'x' and the
covariance or correlation of 'x' and 'y' if these are vectors. If
'x' and 'y' are matrices then the covariances (or correlations)
between the columns of 'x' and the columns of 'y' are computed.
'cov2cor' scales a covariance matrix into the corresponding
correlation matrix _efficiently_.
We can calculate very quickly and efficiently the correlation coefficients between two variables using the cor. For example, to find how well correlated sepal and petal width are, we can do the following:
> cor(iris$Sepal.Width, iris$Petal.Width)
[1] -0.3661259
In the above snippet, we calculated the correlation between two variables – sepal width and petal width. What if we wanted to calculate the pairwise correlation of all the different variables? Do we calculate all the possible pairs, and then tabulate our own table? Fortunately, this is not the case. From the manual, we see that ‘x’ can be a dataframe or a matrix. Furthermore, we also know that if x is a data frame or a matrix, we get pairwise correlations. So for instance
> cor(iris)
Error in cor(iris) : 'x' must be numeric
What was happening? As it turns out, the error was thrown because column 5 of our data contains information about Species. However, this information is not stored as a numeric data. As a result, the correlation cannot be calculated. Throwing a quick fix by excluding the problematic column (or including only the first 4 columns):
> cor(iris[,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Instead of a number, cor()
now provides us a correlation matrix between all 4 variables. Quite handy!
The importance of knowing what the default options are
It is important to be aware of what the output represents. Specifically, because we did not specify the type of correlation coefficient, R simply used the default (which is the Pearson correlation coefficient). However, it is also able to calculate a few other correlation coefficients, including the Spearman correlation coefficient and Kendall’s tau. The Pearson correlation coefficient might not always be the best or even the appropriate measure. This highlights the need to know and understand what exactly are the default options when we do any sort of analysis in R.
Calculating correlations is one of the most common tasks in analysis. However, it is important to be aware of the limitations of using correlation coefficients as well as the underlying assumptions for each type of correlation coefficient.
Key Points
The
summary
function is used to provide a quick overview of numeric data frames.The same function in R can give different results depending on the data type.
mean and standard deviation for a vector are calculated using the
mean
andsd
functions respectively.
apply
can be used to extend themean
andsd
functions to perform summary statistics on data frames or matrices.
cor
is used to calculate correlation coefficient between two vectors, or if a matrix is provided, pairwise correlation coefficients between variables.It is important to know what type of data one has on hand, as functions require specific data types as inputs.
Default options might not always be the most appropriate for the data at hand, and hence it is important to know what the defaults are when writing scripts.