Dependencies

This document has the following dependencies:

library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

Use the following commands to install these packages in R.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicFeatures", "TxDb.Hsapiens.UCSC.hg19.knownGene"))

Corrections

Improvements and corrections to this document can be submitted on its GitHub in its repository.

Overview

The GenomicFeatures package contains functionality for so-called transcript database or TxDb objects. These objects contains a coherent interface to transcripts. Transcripts are complicated because higher organisms usually have many different transcripts for each gene locus.

Other Resources

Examples

We will show the TxDb functionality by examining a database of human transcripts. Unlike genomes in Bioconductor, there is no shorthand object; we reassign the long name to a shorter for convenience:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-03-19 13:55:51 -0700 (Thu, 19 Mar 2015)
## # GenomicFeatures version at creation time: 1.19.32
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

A TxDb object is really an interface to a SQLite database. You can query the database using a number of tools detailed in the package vignette, but usually you use convenience functions to extract the relevant information.

Extract basic quantities

Extract quantities and group

(Note: there are grouping functions without the non-grouping function and vice versa; there is for example no introns() function.

Other functions

Mapping between genome and transcript coordinates

Caution: Terminology

The TxDb object approach is powerful but it suffers (in my opinion) from a lack of clearly defined terminology. Even worse, the meaning of terminology changes depending on the function. For example transcript is sometimes used to refer to un-spliced transcripts (pre-mRNA) and sometimes to splices transcripts.

Gene, exons and transcripts

Let us start by examining genes, exons and transcripts. Let us focus on a single gene on chr1: DDX11L1.

gr <- GRanges(seqnames = "chr1", strand = "+", ranges = IRanges(start = 11874, end = 14409))
subsetByOverlaps(genes(txdb), gr)
## GRanges object with 1 range and 1 metadata column:
##             seqnames         ranges strand |     gene_id
##                <Rle>      <IRanges>  <Rle> | <character>
##   100287102     chr1 [11874, 14409]      + |   100287102
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
subsetByOverlaps(genes(txdb), gr, ignore.strand = TRUE)
## GRanges object with 2 ranges and 1 metadata column:
##             seqnames         ranges strand |     gene_id
##                <Rle>      <IRanges>  <Rle> | <character>
##   100287102     chr1 [11874, 14409]      + |   100287102
##      653635     chr1 [14362, 29961]      - |      653635
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

The genes() output contains a single gene with these coordinates, overlapping another gene on the opposite strand. Note that the gene is represented as a single range; so this output tells us nothing about exons and splicing. There is a single identifier called gene_id. If you look at the output of txdb you’ll see that this is an “Entrex Gene ID”.

subsetByOverlaps(transcripts(txdb), gr)
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames         ranges strand |     tx_id     tx_name
##          <Rle>      <IRanges>  <Rle> | <integer> <character>
##   [1]     chr1 [11874, 14409]      + |         1  uc001aaa.3
##   [2]     chr1 [11874, 14409]      + |         2  uc010nxq.1
##   [3]     chr1 [11874, 14409]      + |         3  uc010nxr.1
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

The gene has 3 transcripts; again we only have coordinates of the pre-mRNA here. There are 3 different transcript names (tx_name) which are identifiers from UCSC and then we have a TxDb specific transcript id (tx_id) which is an integer. Let’s look at exons:

subsetByOverlaps(exons(txdb), gr)
## GRanges object with 6 ranges and 1 metadata column:
##       seqnames         ranges strand |   exon_id
##          <Rle>      <IRanges>  <Rle> | <integer>
##   [1]     chr1 [11874, 12227]      + |         1
##   [2]     chr1 [12595, 12721]      + |         2
##   [3]     chr1 [12613, 12721]      + |         3
##   [4]     chr1 [12646, 12697]      + |         4
##   [5]     chr1 [13221, 14409]      + |         5
##   [6]     chr1 [13403, 14409]      + |         6
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Here we get 6 exons, but no indication of which exons makes up which transcripts. To get this, we can do

subsetByOverlaps(exonsBy(txdb, by = "tx"), gr)
## GRangesList object of length 3:
## $1 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand |   exon_id   exon_name exon_rank
##          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 [11874, 12227]      + |         1        <NA>         1
##   [2]     chr1 [12613, 12721]      + |         3        <NA>         2
##   [3]     chr1 [13221, 14409]      + |         5        <NA>         3
## 
## $2 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12595, 12721]      + |       2      <NA>         2
##   [3]     chr1 [13403, 14409]      + |       6      <NA>         3
## 
## $3 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12646, 12697]      + |       4      <NA>         2
##   [3]     chr1 [13221, 14409]      + |       5      <NA>         3
## 
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome

Here we now finally see the structure of the three transcripts in the form of a GRangesList.

Let us include the coding sequence (CDS). Now, it can be extremely hard to computationally infer the coding sequence from a fully spliced mRNA.

subsetByOverlaps(cds(txdb), gr)
## GRanges object with 3 ranges and 1 metadata column:
##       seqnames         ranges strand |    cds_id
##          <Rle>      <IRanges>  <Rle> | <integer>
##   [1]     chr1 [12190, 12227]      + |         1
##   [2]     chr1 [12595, 12721]      + |         2
##   [3]     chr1 [13403, 13639]      + |         3
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
subsetByOverlaps(cdsBy(txdb, by = "tx"), gr)
## GRangesList object of length 1:
## $2 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand |    cds_id    cds_name exon_rank
##          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 [12190, 12227]      + |         1        <NA>         1
##   [2]     chr1 [12595, 12721]      + |         2        <NA>         2
##   [3]     chr1 [13403, 13639]      + |         3        <NA>         3
## 
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome

The output of cds() is not very useful by itself, since each range is part of a CDS, not the entire cds. We need to know how these ranges together form a CDS, and for that we need cdsBy(by = "tx"). We can see that only one of the three transcripts has a CDS by looking at their CDS lengths:

subset(transcriptLengths(txdb, with.cds_len = TRUE), gene_id == "100287102")
##   tx_id    tx_name   gene_id nexon tx_len cds_len
## 1     1 uc001aaa.3 100287102     3   1652       0
## 2     2 uc010nxq.1 100287102     3   1488     402
## 3     3 uc010nxr.1 100287102     3   1595       0

(here we subset a data.frame).

Note: as an example of terminology mixup, consider that the output of transcripts() are coordinates for the unspliced transcript, whereas extractTranscriptSeqs() is the RNA sequence of the spliced transcripts.

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] XVector_0.8.0                          
##  [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
##  [3] GenomicFeatures_1.20.2                 
##  [4] AnnotationDbi_1.30.1                   
##  [5] Biobase_2.28.0                         
##  [6] GenomicRanges_1.20.5                   
##  [7] GenomeInfoDb_1.4.2                     
##  [8] IRanges_2.2.7                          
##  [9] S4Vectors_0.6.3                        
## [10] BiocGenerics_0.14.0                    
## [11] BiocStyle_1.6.0                        
## [12] rmarkdown_0.7                          
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.11              magrittr_1.5           
##  [3] GenomicAlignments_1.4.1 zlibbioc_1.14.0        
##  [5] BiocParallel_1.2.20     stringr_1.0.0          
##  [7] tools_3.2.1             DBI_0.3.1              
##  [9] lambda.r_1.1.7          futile.logger_1.4.1    
## [11] htmltools_0.2.6         yaml_2.1.13            
## [13] digest_0.6.8            rtracklayer_1.28.8     
## [15] formatR_1.2             futile.options_1.0.0   
## [17] bitops_1.0-6            RCurl_1.95-4.7         
## [19] biomaRt_2.24.0          evaluate_0.7.2         
## [21] RSQLite_1.0.0           stringi_0.5-5          
## [23] Rsamtools_1.20.4        Biostrings_2.36.3      
## [25] XML_3.98-1.3