R data types
Overview
Teaching: 15 min
Exercises: 10 minLearning Objectives
Recognise and the different basic data types in R.
Be able to index and subset different classes of data in R.
Be able to read in and explore data in data frames
R data has types
Objects in R represent data, and data comes in different types. The data type may by simple types, such as numeric, logical, character, etc., or complex types like data frames, lists, and sophisticated objects. In R, variable typing is dynamic: you don’t have to specify the type of a variable before you assign a value to it. But the data type is important, because it determines what operations are valid: you can take the sum of numeric data, but not character data.
Atomic data types
The simplest data in R is atomic, it cannot be broken down into smaller pieces of data. R has six atomic data types:
- logical
- Logical data is either
TRUE
orFALSE
- integer
- integer data should be self explanatory, but numeric data is represented as real numbers unless you ask for it as integer. As far as R is concerned,
1
is real number, but1L
is an integer. - numeric
- numeric data is the default for real numbers.
- complex
- R supports complex numbers as a basic data type.
- character
- a string, like “this is a string” is called “character” data.
- raw
- The “raw” type is not something we are likely to use.
Most functions expect data to be of a certain type, but they may be written to handle different types. Remember sqrt()
? It expects a real value as its argument, and expects to return a real value as a result. But if you want to deal with complex numbers, you can:
sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN
# Try it with a complex number
sqrt(as.complex(-1))
[1] 0+1i
Object classes in R
All objects in R have an attribute called a “class”, which affects how the object behaves. You can find the class of an object foo with class(foo)
. You can even use class on simple numbers:
class(1)
[1] "numeric"
class(1L)
[1] "integer"
class(as.complex(1L))
[1] "complex"
Vectors
Perhaps more surprisingly, all data in R is actually stored as vectors. A simple number, for example, is a vector of length 1. So is a simple quoted string. Try it!
In our previous examples on classes, every result was preceeded by a [1]
; that indicates the index of the first entry of the vector.
The index is shown even though the vector only has one element!
If the vector were long enough to spill over multiple lines, the index of the first entry on each
line would be indicated. For example, the integers from 1 to 100 (inclusive) can be
represented by the convenient shorthand:
x <- 1:100
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
To access an element of a vector, use the index of the desired elements in square
brackets, e.g., x[10]
for the tenth element of x
. The first element of R objects is
element 1.
A vector contains data of a single basic data type. Basic data types in R include the following
A bit more about basic data types
Logical data can only take on TRUE or FALSE values, and are commonly encountered when we test if a certain condition is fulfilled. For example:
5 > 4
[1] TRUE
class(5 > 4)
[1] "logical"
The numeric data type is used for ordinary real numbers, including integers. You can
also use the integer data type if the rare case where you know that’s appropriate and
useful. The default in R, however, is for numeric data to use the numeric data type, even if handed an
integer. You can force the conversion with an “L” after the number, or by using as.integer()
.
> x <- 3.14
> class(x)
[1] "numeric"
> class(3)
[1] "numeric"
> class(3L)
[1] "integer"
> class(as.integer(3))
[1] "integer"
Explicit and implicit conversion
Explicitly converting a real number to an integer truncates, rather than rounds. For example:
as.integer(3.6)
[1] 3
This behavior is extremely important to bear in mind, as it defies common practice.
R will implicitly convert (coerce) data between different classes to try to Do The Right Thing. For example, adding an integer to a numeric will promote the integer to a numeric, and result in a numeric, even though adding two integers would result in an integer.
> class(3L + 3L) [1] "integer" > class(3L + 3) [1] "numeric"
A user can often force explicit conversion between different classes of data using the
as.*()
functions. For example, we were able to convert a numeric value to an integer usingas.integer()
. R will try to convert the class accordingly, but where it cannot do so, will return you NA with a warning message. For example:> as.integer("TRUE") [1] NA Warning message: NAs introduced by coercion
This conversion cannot be done because TRUE cannot be represented as a number. Instead, R will convert it to NA.
Working with vectors
Creating vectors
Vectors can be created from elements by combining them using the c()
function, which takes an arbitrary number of arguments and combines them into a vector. If they are of different types, it tries to coerce them into the same type. If it cannot coerce them, it throws an error. Coersion is often one-way: the number 1 can be coerced to the string “1”, but the string “hello” cannot be coerced to a number.
# The number will be coerced to a string
c("hello",1)
[1] "hello" "1"
# The string "hello" cannot be coerced to a number
as.integer("hello",1)
Warning: NAs introduced by coercion
[1] NA
Lists
Ordinary vectors in R can only contain simple data types, including the data types shown above, and the elements of an ordinary vector must all be of the same basic data type. Lists relax both these restrictions: a list can contain elements of any data type, even complex data types such as other lists, and the elements of a list can be of different data types. Each list element can be a different length. This allows for important and powerful abstractions.
Suppose we have three character vectors, each containing the names of different types of organisms.
birds <- c("myna","sparrow","swift","robin")
insects <- c("cockroach","butterfly","caterpillar")
mammals <- c("hamster","rat","human")
# Combine these into a list.
animals <- list(birds, insects, mammals)
animals
[[1]]
[1] "myna" "sparrow" "swift" "robin"
[[2]]
[1] "cockroach" "butterfly" "caterpillar"
[[3]]
[1] "hamster" "rat" "human"
As you can see, each element of animals
is a character vector. We can give them names:
names(animals) <- c("birds", "insects", "mammals")
animals
$birds
[1] "myna" "sparrow" "swift" "robin"
$insects
[1] "cockroach" "butterfly" "caterpillar"
$mammals
[1] "hamster" "rat" "human"
The addition of names to the elements of the list facilitates subsetting, which will be discussed later. Objects with named items function a lot like Python dictionaries.
Matrices
A matrix is a two-dimensional vector that has rows and columns. Like vectors, all the
entries in a matrix must be the same basic data type – that is, either numeric, integer,
logical or character. If there are more than one data type, R will simply convert
everything to a compatible data type. A matrix can be created with the matrix()
function
which has the general usage as shown:
args(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
NULL
The nrow
and ncol
argument specifies the number of rows and columns respectively, while the byrow
argument indicates if the matrix will be filled row-wise or column wise. The behavior of the byrow argument can be seen below:
matrix(1:20, nrow=2, ncol=10, byrow=TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 6 7 8 9 10
[2,] 11 12 13 14 15 16 17 18 19 20
matrix(1:20, nrow=2, ncol=10, byrow=FALSE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 3 5 7 9 11 13 15 17 19
[2,] 2 4 6 8 10 12 14 16 18 20
Data frames
A data frame is the data structure you will probably use more than any other. You can think of a data frame as a hybrid of a list an a matrix. Like a matrix, a data frame has both row and columns. But unlike a matrix, data frame columns can be of different types, and columns are often referred to by by names, like a list. If you prefer, you can think of a data frame as a list of vectors, where *all elements of the have the same length. One can create a data frame using the data.frame()
function, which has the usage
args(data.frame)
function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
NULL
For example, the following can be used to construct a data frame where we explicitly gave row names and column names to both rows and columns.
data.frame(Gender=c("male","female"),
Count=c(10,5),
row.names=c("M","F"))
Gender Count
M male 10
F female 5
Subsetting in R
How can we refer to subsets, or specific entries, of a vector, list, matrix or data frame?
In R, subsetting can be done using [n]
, where n is the n-th entry that one wishes to extract. It is also worth noting that in R, the first entry starts from 1 and not zero like in other languages (such as C). For example,
demoVector <- 1:10
demoVector[5]
[1] 5
For 2 dimensional objects such as data frames and matrix, one will use [m,n]
to extract the entry in the m-th row and n-th coumn of the array.
For example, consider the following data frame:
df <- data.frame(A=1:10,B=11:20,C=21:30)
df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
A few operations of extraction are illustrated below.
## Extract a single entry
df[1,1]
[1] 1
## Extract a row
df[1,]
A B C
1 1 11 21
## Extract a column as a vector
df[,1]
[1] 1 2 3 4 5 6 7 8 9 10
## Extract a column by name, still as a vector
df[,"A"]
[1] 1 2 3 4 5 6 7 8 9 10
## Extract a column by name, as a vector
df[["A"]]
[1] 1 2 3 4 5 6 7 8 9 10
## Extract a column by name, as a vector
df$A
[1] 1 2 3 4 5 6 7 8 9 10
## Extract a column by name, as a data frame!
df['A']
A
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Subsetting your data frame
Create your own data frame
df
similar to what we did above. Try out the extraction operations above. Now see if you can extract two columns instead of one. Hint: use thec()
function.Solution
df[c("A","B")]
A B 1 1 11 2 2 12 3 3 13 4 4 14 5 5 15 6 6 16 7 7 17 8 8 18 9 9 19 10 10 20
df[1:2]
A B 1 1 11 2 2 12 3 3 13 4 4 14 5 5 15 6 6 16 7 7 17 8 8 18 9 9 19 10 10 20
df[c(1,3)]
A C 1 1 21 2 2 22 3 3 23 4 4 24 5 5 25 6 6 26 7 7 27 8 8 28 9 9 29 10 10 30
Working with list subsets and extractions
Create an
animals
list as above.
- How many ways can you successfully extract “butterfly” from this list?
- Can you add an element to the insects without knowing the length?
Solution
animals[[2]][2]
[1] "butterfly"
animals$insects[2]
[1] "butterfly"
animals[['insects']][2]
[1] "butterfly"
## This one doesn't work? Why not? animals['insects'][2]
$<NA> NULL
animals[['insects']] <- c(animals$insects, "honey bee") animals
$birds [1] "myna" "sparrow" "swift" "robin" $insects [1] "cockroach" "butterfly" "caterpillar" "honey bee" $mammals [1] "hamster" "rat" "human"
Key Points
Both data frames and matrices are two dimensional objects with rows and columns, but data frame columns can be of different types.
Subsetting can be done with [] or [[]] in R.