This document has the following dependencies:
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer2)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("BSgenome", "BSgenome.Scerevisiae.UCSC.sacCer2"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
The BSgenome package contains infrastructure for representing genome sequences in Bioconductor.
The BSgenome package provides support for genomes. In Bioconductor, we have special classes for genomes, because the chromosomes can get really big. For example, the human genome takes up several GB of memory.
The available.genomes()
function lists which genomes are currently available from from Bioconductor (it is possible to make your own genome package). Note that there are several so-called “masked” genomes, where some parts of the genome are masked. We will avoid this subject for now.
Let us load the latest yeast genome
available.genomes()
## [1] "BSgenome.Alyrata.JGI.v1"
## [2] "BSgenome.Amellifera.BeeBase.assembly4"
## [3] "BSgenome.Amellifera.UCSC.apiMel2"
## [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Athaliana.TAIR.04232008"
## [6] "BSgenome.Athaliana.TAIR.TAIR9"
## [7] "BSgenome.Btaurus.UCSC.bosTau3"
## [8] "BSgenome.Btaurus.UCSC.bosTau3.masked"
## [9] "BSgenome.Btaurus.UCSC.bosTau4"
## [10] "BSgenome.Btaurus.UCSC.bosTau4.masked"
## [11] "BSgenome.Btaurus.UCSC.bosTau6"
## [12] "BSgenome.Btaurus.UCSC.bosTau6.masked"
## [13] "BSgenome.Btaurus.UCSC.bosTau8"
## [14] "BSgenome.Celegans.UCSC.ce10"
## [15] "BSgenome.Celegans.UCSC.ce2"
## [16] "BSgenome.Celegans.UCSC.ce6"
## [17] "BSgenome.Cfamiliaris.UCSC.canFam2"
## [18] "BSgenome.Cfamiliaris.UCSC.canFam2.masked"
## [19] "BSgenome.Cfamiliaris.UCSC.canFam3"
## [20] "BSgenome.Cfamiliaris.UCSC.canFam3.masked"
## [21] "BSgenome.Dmelanogaster.UCSC.dm2"
## [22] "BSgenome.Dmelanogaster.UCSC.dm2.masked"
## [23] "BSgenome.Dmelanogaster.UCSC.dm3"
## [24] "BSgenome.Dmelanogaster.UCSC.dm3.masked"
## [25] "BSgenome.Dmelanogaster.UCSC.dm6"
## [26] "BSgenome.Drerio.UCSC.danRer5"
## [27] "BSgenome.Drerio.UCSC.danRer5.masked"
## [28] "BSgenome.Drerio.UCSC.danRer6"
## [29] "BSgenome.Drerio.UCSC.danRer6.masked"
## [30] "BSgenome.Drerio.UCSC.danRer7"
## [31] "BSgenome.Drerio.UCSC.danRer7.masked"
## [32] "BSgenome.Ecoli.NCBI.20080805"
## [33] "BSgenome.Gaculeatus.UCSC.gasAcu1"
## [34] "BSgenome.Gaculeatus.UCSC.gasAcu1.masked"
## [35] "BSgenome.Ggallus.UCSC.galGal3"
## [36] "BSgenome.Ggallus.UCSC.galGal3.masked"
## [37] "BSgenome.Ggallus.UCSC.galGal4"
## [38] "BSgenome.Ggallus.UCSC.galGal4.masked"
## [39] "BSgenome.Hsapiens.NCBI.GRCh38"
## [40] "BSgenome.Hsapiens.UCSC.hg17"
## [41] "BSgenome.Hsapiens.UCSC.hg17.masked"
## [42] "BSgenome.Hsapiens.UCSC.hg18"
## [43] "BSgenome.Hsapiens.UCSC.hg18.masked"
## [44] "BSgenome.Hsapiens.UCSC.hg19"
## [45] "BSgenome.Hsapiens.UCSC.hg19.masked"
## [46] "BSgenome.Hsapiens.UCSC.hg38"
## [47] "BSgenome.Hsapiens.UCSC.hg38.masked"
## [48] "BSgenome.Mfascicularis.NCBI.5.0"
## [49] "BSgenome.Mfuro.UCSC.musFur1"
## [50] "BSgenome.Mmulatta.UCSC.rheMac2"
## [51] "BSgenome.Mmulatta.UCSC.rheMac2.masked"
## [52] "BSgenome.Mmulatta.UCSC.rheMac3"
## [53] "BSgenome.Mmulatta.UCSC.rheMac3.masked"
## [54] "BSgenome.Mmusculus.UCSC.mm10"
## [55] "BSgenome.Mmusculus.UCSC.mm10.masked"
## [56] "BSgenome.Mmusculus.UCSC.mm8"
## [57] "BSgenome.Mmusculus.UCSC.mm8.masked"
## [58] "BSgenome.Mmusculus.UCSC.mm9"
## [59] "BSgenome.Mmusculus.UCSC.mm9.masked"
## [60] "BSgenome.Osativa.MSU.MSU7"
## [61] "BSgenome.Ptroglodytes.UCSC.panTro2"
## [62] "BSgenome.Ptroglodytes.UCSC.panTro2.masked"
## [63] "BSgenome.Ptroglodytes.UCSC.panTro3"
## [64] "BSgenome.Ptroglodytes.UCSC.panTro3.masked"
## [65] "BSgenome.Rnorvegicus.UCSC.rn4"
## [66] "BSgenome.Rnorvegicus.UCSC.rn4.masked"
## [67] "BSgenome.Rnorvegicus.UCSC.rn5"
## [68] "BSgenome.Rnorvegicus.UCSC.rn5.masked"
## [69] "BSgenome.Rnorvegicus.UCSC.rn6"
## [70] "BSgenome.Scerevisiae.UCSC.sacCer1"
## [71] "BSgenome.Scerevisiae.UCSC.sacCer2"
## [72] "BSgenome.Scerevisiae.UCSC.sacCer3"
## [73] "BSgenome.Sscrofa.UCSC.susScr3"
## [74] "BSgenome.Sscrofa.UCSC.susScr3.masked"
## [75] "BSgenome.Tgondii.ToxoDB.7.0"
## [76] "BSgenome.Tguttata.UCSC.taeGut1"
## [77] "BSgenome.Tguttata.UCSC.taeGut1.masked"
library("BSgenome.Scerevisiae.UCSC.sacCer2")
Scerevisiae
## Yeast genome:
## # organism: Saccharomyces cerevisiae (Yeast)
## # provider: UCSC
## # provider version: sacCer2
## # release date: June 2008
## # release name: SGD June 2008 sequence
## # 18 sequences:
## # chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX
## # chrX chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM 2micron
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)
A BSgenome package contains a single object which is the second component of the name. At first, nothing is loaded into memory, which makes it very fast. You can get the length and names of the chromosomes without actually loading them.
seqlengths(Scerevisiae)
## chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX
## 230208 813178 316617 1531919 576869 270148 1090947 562643 439885
## chrX chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM 2micron
## 745742 666454 1078175 924429 784333 1091289 948062 85779 6318
seqnames(Scerevisiae)
## [1] "chrI" "chrII" "chrIII" "chrIV" "chrV" "chrVI" "chrVII"
## [8] "chrVIII" "chrIX" "chrX" "chrXI" "chrXII" "chrXIII" "chrXIV"
## [15] "chrXV" "chrXVI" "chrM" "2micron"
We load a chromosome by using the [[
or $
operators:
Scerevisiae$chrI
## 230208-letter "DNAString" instance
## seq: CCACACCACACCCACACACCCACACACCACACCA...GTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
We can now do things like compute the GC content of the first chromosome
letterFrequency(Scerevisiae$chrI, "CG", as.prob = TRUE)
## C|G
## 0.3927361
To iterate over chromosomes seems straightforward with lapply
. However, this function may end up using a lot of memory because the entire genome is loaded. Instead there is the bsapply
function which handles loading and unloading of different chromosomes. The interface to bsapply
is weird at first; you set up a BSparams
object which contains which function you are using and which genome you are using it on (and a bit more information). This paradigm is being used in other packages these days, for example BiocParallel. An example will make this clear:
param <- new("BSParams", X = Scerevisiae, FUN = letterFrequency)
head(bsapply(param, letters = "GC"))
## $chrI
## G|C
## 90411
##
## $chrII
## G|C
## 311807
##
## $chrIII
## G|C
## 121998
##
## $chrIV
## G|C
## 580699
##
## $chrV
## G|C
## 222141
##
## $chrVI
## G|C
## 104636
note how the additional argument letters
to the letterFrequency
function is given as an argument to bsapply
, not to the BSParams
object. This gives us a list; you can simplify the output (like the difference between lapply
and sapply
) by
param <- new("BSParams", X = Scerevisiae, FUN = letterFrequency, simplify = TRUE)
bsapply(param, letters = "GC")
## chrI.G|C chrII.G|C chrIII.G|C chrIV.G|C chrV.G|C chrVI.G|C
## 90411 311807 121998 580699 222141 104636
## chrVII.G|C chrVIII.G|C chrIX.G|C chrX.G|C chrXI.G|C chrXII.G|C
## 415227 216586 171122 286167 253728 414843
## chrXIII.G|C chrXIV.G|C chrXV.G|C chrXVI.G|C chrM.G|C 2micron.G|C
## 353167 303042 416443 360871 14676 2463
Note how the mitochondria chromosome is very different. To conclude, the GC percentage of the genome is
sum(bsapply(param, letters = "GC")) / sum(seqlengths(Scerevisiae))
## [1] 0.3814872
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.4.0
## [2] BSgenome_1.36.3
## [3] rtracklayer_1.28.8
## [4] Biostrings_2.36.3
## [5] XVector_0.8.0
## [6] GenomicRanges_1.20.5
## [7] GenomeInfoDb_1.4.2
## [8] IRanges_2.2.7
## [9] S4Vectors_0.6.3
## [10] BiocGenerics_0.14.0
## [11] BiocStyle_1.6.0
## [12] rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] knitr_1.11 magrittr_1.5
## [3] zlibbioc_1.14.0 GenomicAlignments_1.4.1
## [5] BiocParallel_1.2.20 stringr_1.0.0
## [7] tools_3.2.1 lambda.r_1.1.7
## [9] futile.logger_1.4.1 htmltools_0.2.6
## [11] yaml_2.1.13 digest_0.6.8
## [13] formatR_1.2 futile.options_1.0.0
## [15] bitops_1.0-6 RCurl_1.95-4.7
## [17] evaluate_0.7.2 stringi_0.5-5
## [19] BiocInstaller_1.18.4 Rsamtools_1.20.4
## [21] XML_3.98-1.3