R data types

Overview

Teaching: 15 min
Exercises: 10 min
Learning Objectives
  • Recognise and the different basic data types in R.

  • Be able to index and subset different classes of data in R.

  • Be able to read in and explore data in data frames

R data has types

Objects in R represent data, and data comes in different types. The data type may by simple types, such as numeric, logical, character, etc., or complex types like data frames, lists, and sophisticated objects. In R, variable typing is dynamic: you don’t have to specify the type of a variable before you assign a value to it. But the data type is important, because it determines what operations are valid: you can take the sum of numeric data, but not character data.

Atomic data types

The simplest data in R is atomic, it cannot be broken down into smaller pieces of data. R has six atomic data types:

logical
Logical data is either TRUE or FALSE
integer
integer data should be self explanatory, but numeric data is represented as real numbers unless you ask for it as integer. As far as R is concerned, 1 is real number, but 1L is an integer.
numeric
numeric data is the default for real numbers.
complex
R supports complex numbers as a basic data type.
character
a string, like “this is a string” is called “character” data.
raw
The “raw” type is not something we are likely to use.

Most functions expect data to be of a certain type, but they may be written to handle different types. Remember sqrt()? It expects a real value as its argument, and expects to return a real value as a result. But if you want to deal with complex numbers, you can:

sqrt(-1)
Warning in sqrt(-1): NaNs produced
[1] NaN
# Try it with a complex number
sqrt(as.complex(-1))
[1] 0+1i

Object classes in R

All objects in R have an attribute called a “class”, which affects how the object behaves. You can find the class of an object foo with class(foo). You can even use class on simple numbers:

class(1)
[1] "numeric"
class(1L)
[1] "integer"
class(as.complex(1L))
[1] "complex"

Vectors

Perhaps more surprisingly, all data in R is actually stored as vectors. A simple number, for example, is a vector of length 1. So is a simple quoted string. Try it!

In our previous examples on classes, every result was preceeded by a [1]; that indicates the index of the first entry of the vector. The index is shown even though the vector only has one element! If the vector were long enough to spill over multiple lines, the index of the first entry on each line would be indicated. For example, the integers from 1 to 100 (inclusive) can be represented by the convenient shorthand:

x <- 1:100
x
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

To access an element of a vector, use the index of the desired elements in square brackets, e.g., x[10] for the tenth element of x. The first element of R objects is element 1.

A vector contains data of a single basic data type. Basic data types in R include the following

A bit more about basic data types

Logical data can only take on TRUE or FALSE values, and are commonly encountered when we test if a certain condition is fulfilled. For example:

5 > 4
[1] TRUE
class(5 > 4)
[1] "logical"

The numeric data type is used for ordinary real numbers, including integers. You can also use the integer data type if the rare case where you know that’s appropriate and useful. The default in R, however, is for numeric data to use the numeric data type, even if handed an integer. You can force the conversion with an “L” after the number, or by using as.integer().

> x <- 3.14
> class(x)
[1] "numeric"
> class(3)
[1] "numeric"
> class(3L)
[1] "integer"
> class(as.integer(3))
[1] "integer"

Explicit and implicit conversion

Explicitly converting a real number to an integer truncates, rather than rounds. For example:

as.integer(3.6)
[1] 3

This behavior is extremely important to bear in mind, as it defies common practice.

R will implicitly convert (coerce) data between different classes to try to Do The Right Thing. For example, adding an integer to a numeric will promote the integer to a numeric, and result in a numeric, even though adding two integers would result in an integer.

> class(3L + 3L)
[1] "integer"
> class(3L + 3)
[1] "numeric"

A user can often force explicit conversion between different classes of data using the as.*() functions. For example, we were able to convert a numeric value to an integer using as.integer(). R will try to convert the class accordingly, but where it cannot do so, will return you NA with a warning message. For example:

 > as.integer("TRUE")
[1] NA
Warning message:
NAs introduced by coercion 

This conversion cannot be done because TRUE cannot be represented as a number. Instead, R will convert it to NA.

Working with vectors

Creating vectors

Vectors can be created from elements by combining them using the c() function, which takes an arbitrary number of arguments and combines them into a vector. If they are of different types, it tries to coerce them into the same type. If it cannot coerce them, it throws an error. Coersion is often one-way: the number 1 can be coerced to the string “1”, but the string “hello” cannot be coerced to a number.

# The number will be coerced to a string
c("hello",1)
[1] "hello" "1"    
# The string "hello" cannot be coerced to a number
as.integer("hello",1)
Warning: NAs introduced by coercion
[1] NA

Lists

Ordinary vectors in R can only contain simple data types, including the data types shown above, and the elements of an ordinary vector must all be of the same basic data type. Lists relax both these restrictions: a list can contain elements of any data type, even complex data types such as other lists, and the elements of a list can be of different data types. Each list element can be a different length. This allows for important and powerful abstractions.

Suppose we have three character vectors, each containing the names of different types of organisms.

birds <- c("myna","sparrow","swift","robin")
insects <- c("cockroach","butterfly","caterpillar")
mammals <- c("hamster","rat","human") 

# Combine these into a list.
animals <- list(birds, insects, mammals)
animals
[[1]]
[1] "myna"    "sparrow" "swift"   "robin"  

[[2]]
[1] "cockroach"   "butterfly"   "caterpillar"

[[3]]
[1] "hamster" "rat"     "human"  

As you can see, each element of animals is a character vector. We can give them names:

names(animals) <- c("birds", "insects", "mammals")
animals
$birds
[1] "myna"    "sparrow" "swift"   "robin"  

$insects
[1] "cockroach"   "butterfly"   "caterpillar"

$mammals
[1] "hamster" "rat"     "human"  

The addition of names to the elements of the list facilitates subsetting, which will be discussed later. Objects with named items function a lot like Python dictionaries.

Matrices

A matrix is a two-dimensional vector that has rows and columns. Like vectors, all the entries in a matrix must be the same basic data type – that is, either numeric, integer, logical or character. If there are more than one data type, R will simply convert everything to a compatible data type. A matrix can be created with the matrix() function which has the general usage as shown:

args(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) 
NULL

The nrow and ncol argument specifies the number of rows and columns respectively, while the byrow argument indicates if the matrix will be filled row-wise or column wise. The behavior of the byrow argument can be seen below:

matrix(1:20, nrow=2, ncol=10, byrow=TRUE)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]   11   12   13   14   15   16   17   18   19    20
matrix(1:20, nrow=2, ncol=10, byrow=FALSE)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    3    5    7    9   11   13   15   17    19
[2,]    2    4    6    8   10   12   14   16   18    20

Data frames

A data frame is the data structure you will probably use more than any other. You can think of a data frame as a hybrid of a list an a matrix. Like a matrix, a data frame has both row and columns. But unlike a matrix, data frame columns can be of different types, and columns are often referred to by by names, like a list. If you prefer, you can think of a data frame as a list of vectors, where *all elements of the have the same length. One can create a data frame using the data.frame() function, which has the usage

args(data.frame)
function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, 
    fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors()) 
NULL

For example, the following can be used to construct a data frame where we explicitly gave row names and column names to both rows and columns.

data.frame(Gender=c("male","female"),
		       Count=c(10,5), 
		       row.names=c("M","F"))
  Gender Count
M   male    10
F female     5

Subsetting in R

How can we refer to subsets, or specific entries, of a vector, list, matrix or data frame?

In R, subsetting can be done using [n], where n is the n-th entry that one wishes to extract. It is also worth noting that in R, the first entry starts from 1 and not zero like in other languages (such as C). For example,

demoVector <- 1:10
demoVector[5] 
[1] 5

For 2 dimensional objects such as data frames and matrix, one will use [m,n] to extract the entry in the m-th row and n-th coumn of the array.

For example, consider the following data frame:

df <- data.frame(A=1:10,B=11:20,C=21:30)
df
    A  B  C
1   1 11 21
2   2 12 22
3   3 13 23
4   4 14 24
5   5 15 25
6   6 16 26
7   7 17 27
8   8 18 28
9   9 19 29
10 10 20 30

A few operations of extraction are illustrated below.

## Extract a single entry
df[1,1]
[1] 1
## Extract a row
df[1,]
  A  B  C
1 1 11 21
## Extract a column as a vector
df[,1]
 [1]  1  2  3  4  5  6  7  8  9 10
## Extract a column by name, still as a vector
df[,"A"]
 [1]  1  2  3  4  5  6  7  8  9 10
## Extract a column by name, as a vector
df[["A"]]
 [1]  1  2  3  4  5  6  7  8  9 10
## Extract a column by name, as a vector
df$A
 [1]  1  2  3  4  5  6  7  8  9 10
## Extract a column by name, as a data frame!
df['A'] 
    A
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
10 10

Subsetting your data frame

Create your own data frame df similar to what we did above. Try out the extraction operations above. Now see if you can extract two columns instead of one. Hint: use the c() function.

Solution

df[c("A","B")]
    A  B
1   1 11
2   2 12
3   3 13
4   4 14
5   5 15
6   6 16
7   7 17
8   8 18
9   9 19
10 10 20
df[1:2]
    A  B
1   1 11
2   2 12
3   3 13
4   4 14
5   5 15
6   6 16
7   7 17
8   8 18
9   9 19
10 10 20
df[c(1,3)]
    A  C
1   1 21
2   2 22
3   3 23
4   4 24
5   5 25
6   6 26
7   7 27
8   8 28
9   9 29
10 10 30

Working with list subsets and extractions

Create an animals list as above.

  • How many ways can you successfully extract “butterfly” from this list?
  • Can you add an element to the insects without knowing the length?

Solution

animals[[2]][2]
[1] "butterfly"
animals$insects[2]
[1] "butterfly"
animals[['insects']][2] 
[1] "butterfly"
## This one doesn't work? Why not?
animals['insects'][2] 
$<NA>
NULL
animals[['insects']] <- c(animals$insects, "honey bee")
animals
$birds
[1] "myna"    "sparrow" "swift"   "robin"  

$insects
[1] "cockroach"   "butterfly"   "caterpillar" "honey bee"  

$mammals
[1] "hamster" "rat"     "human"  


Key Points

  • Both data frames and matrices are two dimensional objects with rows and columns, but data frame columns can be of different types.

  • Subsetting can be done with [] or [[]] in R.