```{r front, child="front.Rmd", echo=FALSE}
```
## Dependencies
This document has no dependencies.
```{r dependencies, warning=FALSE, message=FALSE}
```
## Corrections
Improvements and corrections to this document can be submitted on its [GitHub](https://github.com/kasperdanielhansen/genbioconductor/blob/master/Rmd/R_Base_Types.Rmd) in its [repository](https://github.com/kasperdanielhansen/genbioconductor).
## Overview
A very brief overview of core R object types and how to subset them.
## Other Resources
- "An Introduction to R" ships with R and can also be access on the web ([HTML](https://cran.r-project.org/doc/manuals/r-release/R-intro.html) |
[PDF](https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf)). This introduction contains a lot of useful material but it is written very terse; you will need to pay close attention to the details. It is useful to re-read this introduction after you have used R for a while; you are likely to learn new details you had missed at first.
## Atomic Vectors
The most basic object in R is an atomic vector. Examples includes `numeric`, `integer`, `logical`, `character` and `factor`. These objects have a single length and can have names, which can be used for indexing
```{r numeric}
x <- 1:10
names(x) <- letters[1:10]
class(x)
x[1:3]
x[c("a", "b")]
```
The following types of atomic vectors are used frequently
- `numeric` - for numeric values.
- `integer` - for integer values.
- `character` - for characters (strings).
- `factor` - for factors.
- `logical` - for logical values.
All vectors can have missing values.
Note: names of vectors does not need to be unique. This can lead to subsetting problems:
```{r uNames}
x <- 1:3
names(x) <- c("A", "A", "B")
x
x["A"]
```
Note that you don't even get a warning, so watch out for non-unique names! You can check for unique names by using the functions `unique`, `duplicated` or (easiest) `anyDuplicated`.
```{r uNames2}
anyDuplicated(names(x))
names(x) <- c("A", "B", "C")
anyDuplicated(names(x))
```
`anyDuplicated` returns the index of the first duplicated name, so `0` indicates nothing is duplicated.
### Integers in R
The default in R is to represent numbers as `numeric`, NOT `integer`. This is something that can usually be ignored, but you might run into some issues in Bioconductor with this. Note that even constructions that looks like `integer` are really `numeric`:
```{r intNum}
x <- 1
class(x)
x <- 1:3
class(x)
```
The way to make sure to get an `integer` in R is to append `L` to the numbers
```{r intNum2}
x <- 1L
class(x)
```
So why the distinguishing between `integer` and `numeric`? Internally, the way computers represents and calculates numbers are different between `integer` and `numeric`.
- `integer` mathematics are different.
- `numeric` can hold much larger values than `integer`.
- `numeric` takes up slightly more RAM (but nothing to worry about).
Point 2 is something you can sometimes run into, in Bioconductor. The maximum `integer` is
```{r machine}
.Machine$integer.max
2^31 -1 == .Machine$integer.max
round(.Machine$integer.max / 10^6, 1)
```
This number is smaller than the number of bases in the human genome. So we sometimes (accidentally) add up numbers which exceeds this. The fix is to use `as.numeric` to convert the `integer` to `numeric`.
This number is also the limit for how long an atomic vector can be. So you cannot have a single vector which is as long as the human genome. In R we are beginning to get support for something called "long vectors" which basically are ... long vectors. But the support for long vectors is not yet pervasive.
## Matrices
`matrix` is a two-dimensional object. All values in a `matrix` has to have the same type (`numeric` or `character` or any of the other atomic vector types). It is optional to have `rownames` or `colnames` and these names does not have to be unique.
```{r matrices}
x <- matrix(1:9, ncol = 3, nrow = 3)
rownames(x) <- c("A","B", "B")
x
dim(x)
nrow(x)
ncol(x)
```
Subsetting is two-dimensional; the first dimension is rows and the second is columns. You can even subset with a matrix of the same dimension, but watch out for the return object.
```{r matrices2}
x[1:2,]
x["B",]
x[x >= 5]
```
(note how subsetting with a non-unique name does not lead to an error). If you grab a single row or a single column from a `matrix` you get a vector. Sometimes, it is really nice to get a `matrix`; you do that by using `drop=FALSE` in the subsetting:
```{r matrixSubset2}
x[1,]
x[1,,drop=FALSE]
```
There are a lot of mathematical operations working on matrices, for example `rowSums`, `colSums` and things like `eigen` for eigenvector decomposition. I am a heavy user of the package `r CRANpkg("matrixStats")` for the full suite of `rowXX` and `colXX` with `XX` being any standard statistical function such as `sd()`, `var()`, `quantiles()` etc.
Internally, a `matrix` is just a `vector` with a dimension attribute. In R we have column-first orientation, so the columns are filled up first:
```{r createMatrix}
matrix(1:9, 3, 3)
matrix(1:9, 3, 3, byrow = TRUE)
```
## Lists
`list`s are like `vector`s, but can hold together objects of arbitrary kind.
```{r list}
x <- list(1:3, letters[1:3], is.numeric)
x
names(x) <- c("numbers", "letters", "function")
x[1:2]
x[1]
x[[1]]
```
See how subsetting creates another `list`. To get to the actual content of the first element, you need double brackets `[[`. The distinction between `[` and `[[` is critical to understand.
You can use `$` on a named list. However, R has something called "partial" matching for `$`:
```{r list2}
x$letters
x["letters"]
x$let
x["let"]
```
Trick: sometimes you want a list where each element is a single number. Use `as.list`:
```{r as.list}
as.list(1:3)
list(1:3)
```
### lapply and sapply
It is quite common to have a `list` where each element is of the same kind, for example a `numeric` vector. You can apply a function to each element in the `list` by using `lapply()`; this returns another `list` which is named if the input is.
```{r lapply}
x <- list(a = rnorm(3), b = rnorm(3))
lapply(x, mean)
```
If the output of the function is of the same kind, you can simplify the output using `sapply` (simplify apply). This is particularly useful if the function in question returns a single number.
```{r sapply}
sapply(x, mean)
```
## Data frames
`data.frame` are fundamental to data analysis. They look like matrices, but each column can be a separate type, so you can mix and match different data types. They are required to have unique column and row names. If no rowname is given, it'll use `1:nrow`.
```{r df}
x <- data.frame(sex = c("M", "M", "F"), age = c(32,34,29))
x
```
You access columns by `$` or `[[`:
```{r dfColumns}
x$sex
x[["sex"]]
```
Note how `sex` was converted into a `factor`. This is a frequent source of errors, so much that I highly encourage users to make sure they never have `factor`s in their `data.frame`s. This conversion can be disabled by `stringsAsfactors=FALSE`:
```{r df2}
x <- data.frame(sex = c("M", "M", "F"), age = c(32,34,29), stringsAsFactors = FALSE)
x$sex
```
Behind the scenes, a `data.frame` is really a `list`. Why does this matter? Well, for one, it allows you to use `lapply` and `sapply` across the columns:
```{r dfApply}
sapply(x, class)
```
## Conversion
We often have to convert R objects from one type to another. For basic R types (as described above), you have the `as.XX` family of functions, with `XX` being all the types of objects listed above.
```{r as.X, error=TRUE}
x
as.matrix(x)
as.list(x)
```
When we convert the `data.frame` to a `matrix` it becomes a `character` matrix, because there is a `character` column and this is the only way to keep the contents.
For more "complicated" objects there is a suite of `as()` functions, which you use as follows
```{r as}
library(methods)
as(x, "matrix")
```
This is how you convert most Bioconductor objects.
## SessionInfo
\scriptsize
```{r sessionInfo, echo=FALSE}
sessionInfo()
```