Dependencies

This document has the following dependencies:

library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer2)

Use the following commands to install these packages in R.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("BSgenome", "BSgenome.Scerevisiae.UCSC.sacCer2"))

Corrections

Improvements and corrections to this document can be submitted on its GitHub in its repository.

Overview

The BSgenome package contains infrastructure for representing genome sequences in Bioconductor.

Other Resources

Genomes

The BSgenome package provides support for genomes. In Bioconductor, we have special classes for genomes, because the chromosomes can get really big. For example, the human genome takes up several GB of memory.

The available.genomes() function lists which genomes are currently available from from Bioconductor (it is possible to make your own genome package). Note that there are several so-called “masked” genomes, where some parts of the genome are masked. We will avoid this subject for now.

Let us load the latest yeast genome

available.genomes()
##  [1] "BSgenome.Alyrata.JGI.v1"                  
##  [2] "BSgenome.Amellifera.BeeBase.assembly4"    
##  [3] "BSgenome.Amellifera.UCSC.apiMel2"         
##  [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"  
##  [5] "BSgenome.Athaliana.TAIR.04232008"         
##  [6] "BSgenome.Athaliana.TAIR.TAIR9"            
##  [7] "BSgenome.Btaurus.UCSC.bosTau3"            
##  [8] "BSgenome.Btaurus.UCSC.bosTau3.masked"     
##  [9] "BSgenome.Btaurus.UCSC.bosTau4"            
## [10] "BSgenome.Btaurus.UCSC.bosTau4.masked"     
## [11] "BSgenome.Btaurus.UCSC.bosTau6"            
## [12] "BSgenome.Btaurus.UCSC.bosTau6.masked"     
## [13] "BSgenome.Btaurus.UCSC.bosTau8"            
## [14] "BSgenome.Celegans.UCSC.ce10"              
## [15] "BSgenome.Celegans.UCSC.ce2"               
## [16] "BSgenome.Celegans.UCSC.ce6"               
## [17] "BSgenome.Cfamiliaris.UCSC.canFam2"        
## [18] "BSgenome.Cfamiliaris.UCSC.canFam2.masked" 
## [19] "BSgenome.Cfamiliaris.UCSC.canFam3"        
## [20] "BSgenome.Cfamiliaris.UCSC.canFam3.masked" 
## [21] "BSgenome.Dmelanogaster.UCSC.dm2"          
## [22] "BSgenome.Dmelanogaster.UCSC.dm2.masked"   
## [23] "BSgenome.Dmelanogaster.UCSC.dm3"          
## [24] "BSgenome.Dmelanogaster.UCSC.dm3.masked"   
## [25] "BSgenome.Dmelanogaster.UCSC.dm6"          
## [26] "BSgenome.Drerio.UCSC.danRer5"             
## [27] "BSgenome.Drerio.UCSC.danRer5.masked"      
## [28] "BSgenome.Drerio.UCSC.danRer6"             
## [29] "BSgenome.Drerio.UCSC.danRer6.masked"      
## [30] "BSgenome.Drerio.UCSC.danRer7"             
## [31] "BSgenome.Drerio.UCSC.danRer7.masked"      
## [32] "BSgenome.Ecoli.NCBI.20080805"             
## [33] "BSgenome.Gaculeatus.UCSC.gasAcu1"         
## [34] "BSgenome.Gaculeatus.UCSC.gasAcu1.masked"  
## [35] "BSgenome.Ggallus.UCSC.galGal3"            
## [36] "BSgenome.Ggallus.UCSC.galGal3.masked"     
## [37] "BSgenome.Ggallus.UCSC.galGal4"            
## [38] "BSgenome.Ggallus.UCSC.galGal4.masked"     
## [39] "BSgenome.Hsapiens.NCBI.GRCh38"            
## [40] "BSgenome.Hsapiens.UCSC.hg17"              
## [41] "BSgenome.Hsapiens.UCSC.hg17.masked"       
## [42] "BSgenome.Hsapiens.UCSC.hg18"              
## [43] "BSgenome.Hsapiens.UCSC.hg18.masked"       
## [44] "BSgenome.Hsapiens.UCSC.hg19"              
## [45] "BSgenome.Hsapiens.UCSC.hg19.masked"       
## [46] "BSgenome.Hsapiens.UCSC.hg38"              
## [47] "BSgenome.Hsapiens.UCSC.hg38.masked"       
## [48] "BSgenome.Mfascicularis.NCBI.5.0"          
## [49] "BSgenome.Mfuro.UCSC.musFur1"              
## [50] "BSgenome.Mmulatta.UCSC.rheMac2"           
## [51] "BSgenome.Mmulatta.UCSC.rheMac2.masked"    
## [52] "BSgenome.Mmulatta.UCSC.rheMac3"           
## [53] "BSgenome.Mmulatta.UCSC.rheMac3.masked"    
## [54] "BSgenome.Mmusculus.UCSC.mm10"             
## [55] "BSgenome.Mmusculus.UCSC.mm10.masked"      
## [56] "BSgenome.Mmusculus.UCSC.mm8"              
## [57] "BSgenome.Mmusculus.UCSC.mm8.masked"       
## [58] "BSgenome.Mmusculus.UCSC.mm9"              
## [59] "BSgenome.Mmusculus.UCSC.mm9.masked"       
## [60] "BSgenome.Osativa.MSU.MSU7"                
## [61] "BSgenome.Ptroglodytes.UCSC.panTro2"       
## [62] "BSgenome.Ptroglodytes.UCSC.panTro2.masked"
## [63] "BSgenome.Ptroglodytes.UCSC.panTro3"       
## [64] "BSgenome.Ptroglodytes.UCSC.panTro3.masked"
## [65] "BSgenome.Rnorvegicus.UCSC.rn4"            
## [66] "BSgenome.Rnorvegicus.UCSC.rn4.masked"     
## [67] "BSgenome.Rnorvegicus.UCSC.rn5"            
## [68] "BSgenome.Rnorvegicus.UCSC.rn5.masked"     
## [69] "BSgenome.Rnorvegicus.UCSC.rn6"            
## [70] "BSgenome.Scerevisiae.UCSC.sacCer1"        
## [71] "BSgenome.Scerevisiae.UCSC.sacCer2"        
## [72] "BSgenome.Scerevisiae.UCSC.sacCer3"        
## [73] "BSgenome.Sscrofa.UCSC.susScr3"            
## [74] "BSgenome.Sscrofa.UCSC.susScr3.masked"     
## [75] "BSgenome.Tgondii.ToxoDB.7.0"              
## [76] "BSgenome.Tguttata.UCSC.taeGut1"           
## [77] "BSgenome.Tguttata.UCSC.taeGut1.masked"
library("BSgenome.Scerevisiae.UCSC.sacCer2")
Scerevisiae
## Yeast genome:
## # organism: Saccharomyces cerevisiae (Yeast)
## # provider: UCSC
## # provider version: sacCer2
## # release date: June 2008
## # release name: SGD June 2008 sequence
## # 18 sequences:
## #   chrI    chrII   chrIII  chrIV   chrV    chrVI   chrVII  chrVIII chrIX  
## #   chrX    chrXI   chrXII  chrXIII chrXIV  chrXV   chrXVI  chrM    2micron
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

A BSgenome package contains a single object which is the second component of the name. At first, nothing is loaded into memory, which makes it very fast. You can get the length and names of the chromosomes without actually loading them.

seqlengths(Scerevisiae)
##    chrI   chrII  chrIII   chrIV    chrV   chrVI  chrVII chrVIII   chrIX 
##  230208  813178  316617 1531919  576869  270148 1090947  562643  439885 
##    chrX   chrXI  chrXII chrXIII  chrXIV   chrXV  chrXVI    chrM 2micron 
##  745742  666454 1078175  924429  784333 1091289  948062   85779    6318
seqnames(Scerevisiae)
##  [1] "chrI"    "chrII"   "chrIII"  "chrIV"   "chrV"    "chrVI"   "chrVII" 
##  [8] "chrVIII" "chrIX"   "chrX"    "chrXI"   "chrXII"  "chrXIII" "chrXIV" 
## [15] "chrXV"   "chrXVI"  "chrM"    "2micron"

We load a chromosome by using the [[ or $ operators:

Scerevisiae$chrI
##   230208-letter "DNAString" instance
## seq: CCACACCACACCCACACACCCACACACCACACCA...GTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG

We can now do things like compute the GC content of the first chromosome

letterFrequency(Scerevisiae$chrI, "CG", as.prob = TRUE)
##       C|G 
## 0.3927361

To iterate over chromosomes seems straightforward with lapply. However, this function may end up using a lot of memory because the entire genome is loaded. Instead there is the bsapply function which handles loading and unloading of different chromosomes. The interface to bsapply is weird at first; you set up a BSparams object which contains which function you are using and which genome you are using it on (and a bit more information). This paradigm is being used in other packages these days, for example BiocParallel. An example will make this clear:

param <- new("BSParams", X = Scerevisiae, FUN = letterFrequency)
head(bsapply(param, letters = "GC"))
## $chrI
##   G|C 
## 90411 
## 
## $chrII
##    G|C 
## 311807 
## 
## $chrIII
##    G|C 
## 121998 
## 
## $chrIV
##    G|C 
## 580699 
## 
## $chrV
##    G|C 
## 222141 
## 
## $chrVI
##    G|C 
## 104636

note how the additional argument letters to the letterFrequency function is given as an argument to bsapply, not to the BSParams object. This gives us a list; you can simplify the output (like the difference between lapply and sapply) by

param <- new("BSParams", X = Scerevisiae, FUN = letterFrequency, simplify = TRUE)
bsapply(param, letters = "GC")
##    chrI.G|C   chrII.G|C  chrIII.G|C   chrIV.G|C    chrV.G|C   chrVI.G|C 
##       90411      311807      121998      580699      222141      104636 
##  chrVII.G|C chrVIII.G|C   chrIX.G|C    chrX.G|C   chrXI.G|C  chrXII.G|C 
##      415227      216586      171122      286167      253728      414843 
## chrXIII.G|C  chrXIV.G|C   chrXV.G|C  chrXVI.G|C    chrM.G|C 2micron.G|C 
##      353167      303042      416443      360871       14676        2463

Note how the mitochondria chromosome is very different. To conclude, the GC percentage of the genome is

sum(bsapply(param, letters = "GC")) / sum(seqlengths(Scerevisiae))
## [1] 0.3814872

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.4.0
##  [2] BSgenome_1.36.3                        
##  [3] rtracklayer_1.28.8                     
##  [4] Biostrings_2.36.3                      
##  [5] XVector_0.8.0                          
##  [6] GenomicRanges_1.20.5                   
##  [7] GenomeInfoDb_1.4.2                     
##  [8] IRanges_2.2.7                          
##  [9] S4Vectors_0.6.3                        
## [10] BiocGenerics_0.14.0                    
## [11] BiocStyle_1.6.0                        
## [12] rmarkdown_0.7                          
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.11              magrittr_1.5           
##  [3] zlibbioc_1.14.0         GenomicAlignments_1.4.1
##  [5] BiocParallel_1.2.20     stringr_1.0.0          
##  [7] tools_3.2.1             lambda.r_1.1.7         
##  [9] futile.logger_1.4.1     htmltools_0.2.6        
## [11] yaml_2.1.13             digest_0.6.8           
## [13] formatR_1.2             futile.options_1.0.0   
## [15] bitops_1.0-6            RCurl_1.95-4.7         
## [17] evaluate_0.7.2          stringi_0.5-5          
## [19] BiocInstaller_1.18.4    Rsamtools_1.20.4       
## [21] XML_3.98-1.3