Dependencies

This document has no dependencies.

Corrections

Improvements and corrections to this document can be submitted on its GitHub in its repository.

Overview

A very brief overview of core R object types and how to subset them.

Other Resources

Atomic Vectors

The most basic object in R is an atomic vector. Examples includes numeric, integer, logical, character and factor. These objects have a single length and can have names, which can be used for indexing

x <- 1:10
names(x) <- letters[1:10]
class(x)
## [1] "integer"
x[1:3]
## a b c 
## 1 2 3
x[c("a", "b")]
## a b 
## 1 2

The following types of atomic vectors are used frequently

All vectors can have missing values.

Note: names of vectors does not need to be unique. This can lead to subsetting problems:

x <- 1:3
names(x) <- c("A", "A", "B")
x
## A A B 
## 1 2 3
x["A"]
## A 
## 1

Note that you don’t even get a warning, so watch out for non-unique names! You can check for unique names by using the functions unique, duplicated or (easiest) anyDuplicated.

anyDuplicated(names(x))
## [1] 2
names(x) <- c("A", "B", "C")
anyDuplicated(names(x))
## [1] 0

anyDuplicated returns the index of the first duplicated name, so 0 indicates nothing is duplicated.

Integers in R

The default in R is to represent numbers as numeric, NOT integer. This is something that can usually be ignored, but you might run into some issues in Bioconductor with this. Note that even constructions that looks like integer are really numeric:

x <- 1
class(x)
## [1] "numeric"
x <- 1:3
class(x)
## [1] "integer"

The way to make sure to get an integer in R is to append L to the numbers

x <- 1L
class(x)
## [1] "integer"

So why the distinguishing between integer and numeric? Internally, the way computers represents and calculates numbers are different between integer and numeric.

Point 2 is something you can sometimes run into, in Bioconductor. The maximum integer is

.Machine$integer.max
## [1] 2147483647
2^31 -1 == .Machine$integer.max
## [1] TRUE
round(.Machine$integer.max / 10^6, 1)
## [1] 2147.5

This number is smaller than the number of bases in the human genome. So we sometimes (accidentally) add up numbers which exceeds this. The fix is to use as.numeric to convert the integer to numeric.

This number is also the limit for how long an atomic vector can be. So you cannot have a single vector which is as long as the human genome. In R we are beginning to get support for something called “long vectors” which basically are … long vectors. But the support for long vectors is not yet pervasive.

Matrices

matrix is a two-dimensional object. All values in a matrix has to have the same type (numeric or character or any of the other atomic vector types). It is optional to have rownames or colnames and these names does not have to be unique.

x <- matrix(1:9, ncol = 3, nrow = 3)
rownames(x) <- c("A","B", "B")
x
##   [,1] [,2] [,3]
## A    1    4    7
## B    2    5    8
## B    3    6    9
dim(x)
## [1] 3 3
nrow(x)
## [1] 3
ncol(x)
## [1] 3

Subsetting is two-dimensional; the first dimension is rows and the second is columns. You can even subset with a matrix of the same dimension, but watch out for the return object.

x[1:2,]
##   [,1] [,2] [,3]
## A    1    4    7
## B    2    5    8
x["B",]
## [1] 2 5 8
x[x >= 5]
## [1] 5 6 7 8 9

(note how subsetting with a non-unique name does not lead to an error). If you grab a single row or a single column from a matrix you get a vector. Sometimes, it is really nice to get a matrix; you do that by using drop=FALSE in the subsetting:

x[1,]
## [1] 1 4 7
x[1,,drop=FALSE]
##   [,1] [,2] [,3]
## A    1    4    7

There are a lot of mathematical operations working on matrices, for example rowSums, colSums and things like eigen for eigenvector decomposition. I am a heavy user of the package matrixStats for the full suite of rowXX and colXX with XX being any standard statistical function such as sd(), var(), quantiles() etc.

Internally, a matrix is just a vector with a dimension attribute. In R we have column-first orientation, so the columns are filled up first:

matrix(1:9, 3, 3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
matrix(1:9, 3, 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Lists

lists are like vectors, but can hold together objects of arbitrary kind.

x <- list(1:3, letters[1:3], is.numeric)
x
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
## function (x)  .Primitive("is.numeric")
names(x) <- c("numbers", "letters", "function")
x[1:2]
## $numbers
## [1] 1 2 3
## 
## $letters
## [1] "a" "b" "c"
x[1]
## $numbers
## [1] 1 2 3
x[[1]]
## [1] 1 2 3

See how subsetting creates another list. To get to the actual content of the first element, you need double brackets [[. The distinction between [ and [[ is critical to understand.

You can use $ on a named list. However, R has something called “partial” matching for $:

x$letters
## [1] "a" "b" "c"
x["letters"]
## $letters
## [1] "a" "b" "c"
x$let
## [1] "a" "b" "c"
x["let"]
## $<NA>
## NULL

Trick: sometimes you want a list where each element is a single number. Use as.list:

as.list(1:3)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
list(1:3)
## [[1]]
## [1] 1 2 3

lapply and sapply

It is quite common to have a list where each element is of the same kind, for example a numeric vector. You can apply a function to each element in the list by using lapply(); this returns another list which is named if the input is.

x <- list(a = rnorm(3), b = rnorm(3))
lapply(x, mean)
## $a
## [1] -0.6467477
## 
## $b
## [1] 0.6932006

If the output of the function is of the same kind, you can simplify the output using sapply (simplify apply). This is particularly useful if the function in question returns a single number.

sapply(x, mean)
##          a          b 
## -0.6467477  0.6932006

Data frames

data.frame are fundamental to data analysis. They look like matrices, but each column can be a separate type, so you can mix and match different data types. They are required to have unique column and row names. If no rowname is given, it’ll use 1:nrow.

x <- data.frame(sex = c("M", "M", "F"), age = c(32,34,29))
x
##   sex age
## 1   M  32
## 2   M  34
## 3   F  29

You access columns by $ or [[:

x$sex
## [1] M M F
## Levels: F M
x[["sex"]]
## [1] M M F
## Levels: F M

Note how sex was converted into a factor. This is a frequent source of errors, so much that I highly encourage users to make sure they never have factors in their data.frames. This conversion can be disabled by stringsAsfactors=FALSE:

x <- data.frame(sex = c("M", "M", "F"), age = c(32,34,29), stringsAsFactors = FALSE)
x$sex
## [1] "M" "M" "F"

Behind the scenes, a data.frame is really a list. Why does this matter? Well, for one, it allows you to use lapply and sapply across the columns:

sapply(x, class)
##         sex         age 
## "character"   "numeric"

Conversion

We often have to convert R objects from one type to another. For basic R types (as described above), you have the as.XX family of functions, with XX being all the types of objects listed above.

x
##   sex age
## 1   M  32
## 2   M  34
## 3   F  29
as.matrix(x)
##      sex age 
## [1,] "M" "32"
## [2,] "M" "34"
## [3,] "F" "29"
as.list(x)
## $sex
## [1] "M" "M" "F"
## 
## $age
## [1] 32 34 29

When we convert the data.frame to a matrix it becomes a character matrix, because there is a character column and this is the only way to keep the contents.

For more “complicated” objects there is a suite of as() functions, which you use as follows

library(methods)
as(x, "matrix")
##      sex age 
## [1,] "M" "32"
## [2,] "M" "34"
## [3,] "F" "29"

This is how you convert most Bioconductor objects.

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] BiocStyle_1.6.0 rmarkdown_0.7  
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.1     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   knitr_1.11      stringr_1.0.0  
##  [9] digest_0.6.8    evaluate_0.7.2