Dependencies

This document has the following dependencies:

library(AnnotationHub)

Use the following commands to install these packages in R.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("AnnotationHub"))

Corrections

Improvements and corrections to this document can be submitted on its GitHub in its repository.

Overview

Annotation information is extremely important for putting your data into context. There are many online resources for doing this, and many different databases organizes different information suing different approaches.

There are multiple ways to access annotation information in Bioconductor.

Here we discuss a new way of doing so, through the package AnnotationHub. This package provides access to a ton of online resources through a unified interface. However, each data resource has its own peculiarities, so a user still needs to understand what the different datasets are.

In a recent paper I was involved in (Hansen et al. 2014), I used AnnotationHub to interrogate my data against all transcription factor data available through the ENCODE project. I managed to write the code and conduct the analysis in the matter of a single evening, which I think is pretty awesome.

Other Resources

Usage

First we create an AnnotationHub instance. The first time you do this, it will create a local cache on your system, so that repeat queries for the same information (even in different R sessions) will be very fast.

ah <- AnnotationHub()
ah

As you can see, ah contains tons of information. The information content is constantly changing, which is why there is a snapshotDate. While the object is big, it actually only contains pointers to online information. Actually downloading all the resources available in an AnnotationHub is prohibitive.

The object is organized as a vector, with single-dimension indexing. You can get information about a single resource by indexing with a single [; using two brackets ([[) downloads the object:

ah[1]
## AnnotationHub with 1 record
## # snapshotDate(): 2015-08-17 
## # names(): AH2
## # $dataprovider: Ensembl
## # $species: Ailuropoda melanoleuca
## # $rdataclass: FaFile
## # $title: Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa
## # $description: FASTA DNA sequence for Ailuropoda melanoleuca
## # $taxonomyid: 9646
## # $genome: ailMel1
## # $sourcetype: FASTA
## # $sourceurl: ftp://ftp.ensembl.org/pub/release-69/fasta/ailuropoda_mel...
## # $sourcelastmodifieddate: 2012-10-12
## # $sourcesize: 693412448
## # $tags: FASTA, ensembl, sequence 
## # retrieve record with 'object[["AH2"]]'

The way you use AnnotationHub is by using various tools to narrow down your hub to a single or a small number of datasets. Then you download these datasets for your own usage.

Let us first explore some high-level features of the hub:

unique(ah$dataprovider)
##  [1] "Ensembl"                              
##  [2] "EncodeDCC"                            
##  [3] "UCSC"                                 
##  [4] "Inparanoid8"                          
##  [5] "NCBI"                                 
##  [6] "NHLBI"                                
##  [7] "ChEA"                                 
##  [8] "Pazar"                                
##  [9] "NIH Pathway Interaction Database"     
## [10] "RefNet"                               
## [11] "Haemcode"                             
## [12] "GEO"                                  
## [13] "BroadInstitute"                       
## [14] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
## [15] "dbSNP"
unique(ah$rdataclass)
##  [1] "FaFile"           "GRanges"          "Inparanoid8Db"   
##  [4] "OrgDb"            "TwoBitFile"       "ChainFile"       
##  [7] "SQLiteConnection" "data.frame"       "biopax"          
## [10] "BigWigFile"       "ExpressionSet"    "VcfFile"

(we will discuss many of these data classes in future sessions).

You can narrow down the hub by using one (or more) of the following strategies

It is often useful to start with a very rough subsetting, for example to data from a specific species. The subset function is useful for doing a standard R subsetting (the function also works on data.frames).

ah <- subset(ah, species == "Homo sapiens")
ah
## AnnotationHub with 24236 records
## # snapshotDate(): 2015-08-17 
## # $dataprovider: BroadInstitute, UCSC, Ensembl, dbSNP, NIH Pathway Inte...
## # $species: Homo sapiens
## # $rdataclass: GRanges, BigWigFile, ChainFile, FaFile, VcfFile, biopax,...
## # additional mcols(): taxonomyid, genome, description, tags,
## #   sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH133"]]' 
## 
##             title                                    
##   AH133   | Homo_sapiens.GRCh37.69.cdna.all.fa       
##   AH134   | Homo_sapiens.GRCh37.69.dna.toplevel.fa   
##   AH135   | Homo_sapiens.GRCh37.69.dna_rm.toplevel.fa
##   AH136   | Homo_sapiens.GRCh37.69.dna_sm.toplevel.fa
##   AH137   | Homo_sapiens.GRCh37.69.ncrna.fa          
##   ...       ...                                      
##   AH49184 | Homo_sapiens.GRCh38.dna_rm.toplevel.fa   
##   AH49185 | Homo_sapiens.GRCh38.dna_sm.toplevel.fa   
##   AH49186 | Homo_sapiens.GRCh38.dna.toplevel.fa      
##   AH49187 | Homo_sapiens.GRCh38.ncrna.fa             
##   AH49188 | Homo_sapiens.GRCh38.pep.all.fa

We can use query to search the hub. The (possible) drawback to query is that it searches over different fields in the hub, so watch out with using a search term which is very non-specific. The query is a regular expression, which by default is case-insensitive. Here we locate all datasets on the H3K4me3 histone modification (in H. sapiens because we selected this species above)

query(ah, "H3K4me3")
## AnnotationHub with 2018 records
## # snapshotDate(): 2015-08-17 
## # $dataprovider: BroadInstitute, UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges, BigWigFile
## # additional mcols(): taxonomyid, genome, description, tags,
## #   sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH23256"]]' 
## 
##             title                                                         
##   AH23256 | wgEncodeBroadHistoneGm12878H3k4me3StdPk.broadPeak.gz          
##   AH23273 | wgEncodeBroadHistoneH1hescH3k4me3StdPk.broadPeak.gz           
##   AH23297 | wgEncodeBroadHistoneHelas3H3k4me3StdPk.broadPeak.gz           
##   AH23311 | wgEncodeBroadHistoneHepg2H3k4me3StdPk.broadPeak.gz            
##   AH23324 | wgEncodeBroadHistoneHmecH3k4me3StdPk.broadPeak.gz             
##   ...       ...                                                           
##   AH46826 | UW.Fetal_Muscle_Leg.H3K4me3.H-24644.Histone.DS21536.gappedP...
##   AH46833 | UW.Fetal_Muscle_Trunk.H3K4me3.H-24851.Histone.DS23302.gappe...
##   AH46839 | UW.Fetal_Placenta.H3K4me3.H-24996.Histone.DS23300.gappedPea...
##   AH46845 | UW.Fetal_Stomach.H3K4me3.H-24639.Histone.DS22598.gappedPeak.gz
##   AH46851 | UW.Fetal_Thymus.H3K4me3.H-24644.Histone.DS21539.gappedPeak.gz

Another way of searching a hub is by using a browser. Notice how we assign the output of display to make sure that we can capture our selection in the browser

hist <- display(ah)

display(ah)

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] AnnotationHub_2.0.3 BiocStyle_1.6.0     rmarkdown_0.7      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0                  AnnotationDbi_1.30.1        
##  [3] knitr_1.11                   magrittr_1.5                
##  [5] BiocGenerics_0.14.0          IRanges_2.2.7               
##  [7] xtable_1.7-4                 R6_2.1.0                    
##  [9] stringr_1.0.0                httr_1.0.0                  
## [11] GenomeInfoDb_1.4.2           tools_3.2.1                 
## [13] parallel_3.2.1               Biobase_2.28.0              
## [15] DBI_0.3.1                    htmltools_0.2.6             
## [17] yaml_2.1.13                  digest_0.6.8                
## [19] interactiveDisplayBase_1.6.0 shiny_0.12.2                
## [21] formatR_1.2                  S4Vectors_0.6.3             
## [23] curl_0.9.2                   mime_0.3                    
## [25] evaluate_0.7.2               RSQLite_1.0.0               
## [27] stringi_0.5-5                BiocInstaller_1.18.4        
## [29] stats4_3.2.1                 httpuv_1.3.3

References

Hansen, Kasper D, Sarven Sabunciyan, Ben Langmead, Noemi Nagy, Rebecca Curley, Georg Klein, Eva Klein, Daniel Salamon, and Andrew P Feinberg. 2014. “Large-scale hypomethylated blocks associated with Epstein-Barr virus-induced B-cell immortalization.” Genome Research 24 (2): 177–84. doi:10.1101/gr.157743.113.