This document has no dependencies.
Improvements and corrections to this document can be submitted on its GitHub in its repository.
A very brief overview of core R object types and how to subset them.
The most basic object in R is an atomic vector. Examples includes numeric
, integer
, logical
, character
and factor
. These objects have a single length and can have names, which can be used for indexing
x <- 1:10
names(x) <- letters[1:10]
class(x)
## [1] "integer"
x[1:3]
## a b c
## 1 2 3
x[c("a", "b")]
## a b
## 1 2
The following types of atomic vectors are used frequently
numeric
- for numeric values.integer
- for integer values.character
- for characters (strings).factor
- for factors.logical
- for logical values.All vectors can have missing values.
Note: names of vectors does not need to be unique. This can lead to subsetting problems:
x <- 1:3
names(x) <- c("A", "A", "B")
x
## A A B
## 1 2 3
x["A"]
## A
## 1
Note that you don’t even get a warning, so watch out for non-unique names! You can check for unique names by using the functions unique
, duplicated
or (easiest) anyDuplicated
.
anyDuplicated(names(x))
## [1] 2
names(x) <- c("A", "B", "C")
anyDuplicated(names(x))
## [1] 0
anyDuplicated
returns the index of the first duplicated name, so 0
indicates nothing is duplicated.
The default in R is to represent numbers as numeric
, NOT integer
. This is something that can usually be ignored, but you might run into some issues in Bioconductor with this. Note that even constructions that looks like integer
are really numeric
:
x <- 1
class(x)
## [1] "numeric"
x <- 1:3
class(x)
## [1] "integer"
The way to make sure to get an integer
in R is to append L
to the numbers
x <- 1L
class(x)
## [1] "integer"
So why the distinguishing between integer
and numeric
? Internally, the way computers represents and calculates numbers are different between integer
and numeric
.
integer
mathematics are different.numeric
can hold much larger values than integer
.numeric
takes up slightly more RAM (but nothing to worry about).Point 2 is something you can sometimes run into, in Bioconductor. The maximum integer
is
.Machine$integer.max
## [1] 2147483647
2^31 -1 == .Machine$integer.max
## [1] TRUE
round(.Machine$integer.max / 10^6, 1)
## [1] 2147.5
This number is smaller than the number of bases in the human genome. So we sometimes (accidentally) add up numbers which exceeds this. The fix is to use as.numeric
to convert the integer
to numeric
.
This number is also the limit for how long an atomic vector can be. So you cannot have a single vector which is as long as the human genome. In R we are beginning to get support for something called “long vectors” which basically are … long vectors. But the support for long vectors is not yet pervasive.
matrix
is a two-dimensional object. All values in a matrix
has to have the same type (numeric
or character
or any of the other atomic vector types). It is optional to have rownames
or colnames
and these names does not have to be unique.
x <- matrix(1:9, ncol = 3, nrow = 3)
rownames(x) <- c("A","B", "B")
x
## [,1] [,2] [,3]
## A 1 4 7
## B 2 5 8
## B 3 6 9
dim(x)
## [1] 3 3
nrow(x)
## [1] 3
ncol(x)
## [1] 3
Subsetting is two-dimensional; the first dimension is rows and the second is columns. You can even subset with a matrix of the same dimension, but watch out for the return object.
x[1:2,]
## [,1] [,2] [,3]
## A 1 4 7
## B 2 5 8
x["B",]
## [1] 2 5 8
x[x >= 5]
## [1] 5 6 7 8 9
(note how subsetting with a non-unique name does not lead to an error). If you grab a single row or a single column from a matrix
you get a vector. Sometimes, it is really nice to get a matrix
; you do that by using drop=FALSE
in the subsetting:
x[1,]
## [1] 1 4 7
x[1,,drop=FALSE]
## [,1] [,2] [,3]
## A 1 4 7
There are a lot of mathematical operations working on matrices, for example rowSums
, colSums
and things like eigen
for eigenvector decomposition. I am a heavy user of the package matrixStats for the full suite of rowXX
and colXX
with XX
being any standard statistical function such as sd()
, var()
, quantiles()
etc.
Internally, a matrix
is just a vector
with a dimension attribute. In R we have column-first orientation, so the columns are filled up first:
matrix(1:9, 3, 3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
matrix(1:9, 3, 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
list
s are like vector
s, but can hold together objects of arbitrary kind.
x <- list(1:3, letters[1:3], is.numeric)
x
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "a" "b" "c"
##
## [[3]]
## function (x) .Primitive("is.numeric")
names(x) <- c("numbers", "letters", "function")
x[1:2]
## $numbers
## [1] 1 2 3
##
## $letters
## [1] "a" "b" "c"
x[1]
## $numbers
## [1] 1 2 3
x[[1]]
## [1] 1 2 3
See how subsetting creates another list
. To get to the actual content of the first element, you need double brackets [[
. The distinction between [
and [[
is critical to understand.
You can use $
on a named list. However, R has something called “partial” matching for $
:
x$letters
## [1] "a" "b" "c"
x["letters"]
## $letters
## [1] "a" "b" "c"
x$let
## [1] "a" "b" "c"
x["let"]
## $<NA>
## NULL
Trick: sometimes you want a list where each element is a single number. Use as.list
:
as.list(1:3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
list(1:3)
## [[1]]
## [1] 1 2 3
It is quite common to have a list
where each element is of the same kind, for example a numeric
vector. You can apply a function to each element in the list
by using lapply()
; this returns another list
which is named if the input is.
x <- list(a = rnorm(3), b = rnorm(3))
lapply(x, mean)
## $a
## [1] -0.6467477
##
## $b
## [1] 0.6932006
If the output of the function is of the same kind, you can simplify the output using sapply
(simplify apply). This is particularly useful if the function in question returns a single number.
sapply(x, mean)
## a b
## -0.6467477 0.6932006
data.frame
are fundamental to data analysis. They look like matrices, but each column can be a separate type, so you can mix and match different data types. They are required to have unique column and row names. If no rowname is given, it’ll use 1:nrow
.
x <- data.frame(sex = c("M", "M", "F"), age = c(32,34,29))
x
## sex age
## 1 M 32
## 2 M 34
## 3 F 29
You access columns by $
or [[
:
x$sex
## [1] M M F
## Levels: F M
x[["sex"]]
## [1] M M F
## Levels: F M
Note how sex
was converted into a factor
. This is a frequent source of errors, so much that I highly encourage users to make sure they never have factor
s in their data.frame
s. This conversion can be disabled by stringsAsfactors=FALSE
:
x <- data.frame(sex = c("M", "M", "F"), age = c(32,34,29), stringsAsFactors = FALSE)
x$sex
## [1] "M" "M" "F"
Behind the scenes, a data.frame
is really a list
. Why does this matter? Well, for one, it allows you to use lapply
and sapply
across the columns:
sapply(x, class)
## sex age
## "character" "numeric"
We often have to convert R objects from one type to another. For basic R types (as described above), you have the as.XX
family of functions, with XX
being all the types of objects listed above.
x
## sex age
## 1 M 32
## 2 M 34
## 3 F 29
as.matrix(x)
## sex age
## [1,] "M" "32"
## [2,] "M" "34"
## [3,] "F" "29"
as.list(x)
## $sex
## [1] "M" "M" "F"
##
## $age
## [1] 32 34 29
When we convert the data.frame
to a matrix
it becomes a character
matrix, because there is a character
column and this is the only way to keep the contents.
For more “complicated” objects there is a suite of as()
functions, which you use as follows
library(methods)
as(x, "matrix")
## sex age
## [1,] "M" "32"
## [2,] "M" "34"
## [3,] "F" "29"
This is how you convert most Bioconductor objects.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] BiocStyle_1.6.0 rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.1 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 knitr_1.11 stringr_1.0.0
## [9] digest_0.6.8 evaluate_0.7.2