This document has the following dependencies:
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicFeatures", "TxDb.Hsapiens.UCSC.hg19.knownGene"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
The GenomicFeatures package contains functionality for so-called transcript database or TxDb objects. These objects contains a coherent interface to transcripts. Transcripts are complicated because higher organisms usually have many different transcripts for each gene locus.
We will show the TxDb
functionality by examining a database of human transcripts. Unlike genomes in Bioconductor, there is no shorthand object; we reassign the long name to a shorter for convenience:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-03-19 13:55:51 -0700 (Thu, 19 Mar 2015)
## # GenomicFeatures version at creation time: 1.19.32
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
A TxDb
object is really an interface to a SQLite
database. You can query the database using a number of tools detailed in the package vignette, but usually you use convenience functions to extract the relevant information.
Extract basic quantities
genes()
transcripts()
cds()
exons()
microRNAs()
tRNAs()
promoters()
Extract quantities and group
transcriptsBy(by = c("gene", "exon", "cds"))
cdsBy(by = c("tx", "gene"))
exonsBy(by = c("tx", "gene"))
intronsByTranscript()
fiveUTRsByTranscript()
threeUTRsByTranscript()
(Note: there are grouping functions without the non-grouping function and vice versa; there is for example no introns()
function.
Other functions
transcriptLengths()
(optionally include CDS length etc).XXByOverlaps()
(select features based on overlaps with XX
being transcript
, cds
or exon
).Mapping between genome and transcript coordinates
extractTranscriptSeqs()
(getting RNA sequencing of the transcripts).The TxDb
object approach is powerful but it suffers (in my opinion) from a lack of clearly defined terminology. Even worse, the meaning of terminology changes depending on the function. For example transcript
is sometimes used to refer to un-spliced transcripts (pre-mRNA) and sometimes to splices transcripts.
Let us start by examining genes, exons and transcripts. Let us focus on a single gene on chr1: DDX11L1.
gr <- GRanges(seqnames = "chr1", strand = "+", ranges = IRanges(start = 11874, end = 14409))
subsetByOverlaps(genes(txdb), gr)
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 100287102 chr1 [11874, 14409] + | 100287102
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
subsetByOverlaps(genes(txdb), gr, ignore.strand = TRUE)
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 100287102 chr1 [11874, 14409] + | 100287102
## 653635 chr1 [14362, 29961] - | 653635
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The genes()
output contains a single gene with these coordinates, overlapping another gene on the opposite strand. Note that the gene is represented as a single range; so this output tells us nothing about exons and splicing. There is a single identifier called gene_id
. If you look at the output of txdb
you’ll see that this is an “Entrex Gene ID”.
subsetByOverlaps(transcripts(txdb), gr)
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr1 [11874, 14409] + | 1 uc001aaa.3
## [2] chr1 [11874, 14409] + | 2 uc010nxq.1
## [3] chr1 [11874, 14409] + | 3 uc010nxr.1
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The gene has 3 transcripts; again we only have coordinates of the pre-mRNA here. There are 3 different transcript names (tx_name
) which are identifiers from UCSC and then we have a TxDb
specific transcript id (tx_id
) which is an integer. Let’s look at exons:
subsetByOverlaps(exons(txdb), gr)
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 [11874, 12227] + | 1
## [2] chr1 [12595, 12721] + | 2
## [3] chr1 [12613, 12721] + | 3
## [4] chr1 [12646, 12697] + | 4
## [5] chr1 [13221, 14409] + | 5
## [6] chr1 [13403, 14409] + | 6
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Here we get 6 exons, but no indication of which exons makes up which transcripts. To get this, we can do
subsetByOverlaps(exonsBy(txdb, by = "tx"), gr)
## GRangesList object of length 3:
## $1
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 [11874, 12227] + | 1 <NA> 1
## [2] chr1 [12613, 12721] + | 3 <NA> 2
## [3] chr1 [13221, 14409] + | 5 <NA> 3
##
## $2
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## [1] chr1 [11874, 12227] + | 1 <NA> 1
## [2] chr1 [12595, 12721] + | 2 <NA> 2
## [3] chr1 [13403, 14409] + | 6 <NA> 3
##
## $3
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## [1] chr1 [11874, 12227] + | 1 <NA> 1
## [2] chr1 [12646, 12697] + | 4 <NA> 2
## [3] chr1 [13221, 14409] + | 5 <NA> 3
##
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Here we now finally see the structure of the three transcripts in the form of a GRangesList
.
Let us include the coding sequence (CDS). Now, it can be extremely hard to computationally infer the coding sequence from a fully spliced mRNA.
subsetByOverlaps(cds(txdb), gr)
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | cds_id
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 [12190, 12227] + | 1
## [2] chr1 [12595, 12721] + | 2
## [3] chr1 [13403, 13639] + | 3
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
subsetByOverlaps(cdsBy(txdb, by = "tx"), gr)
## GRangesList object of length 1:
## $2
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 [12190, 12227] + | 1 <NA> 1
## [2] chr1 [12595, 12721] + | 2 <NA> 2
## [3] chr1 [13403, 13639] + | 3 <NA> 3
##
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
The output of cds()
is not very useful by itself, since each range is part of a CDS, not the entire cds. We need to know how these ranges together form a CDS, and for that we need cdsBy(by = "tx")
. We can see that only one of the three transcripts has a CDS by looking at their CDS lengths:
subset(transcriptLengths(txdb, with.cds_len = TRUE), gene_id == "100287102")
## tx_id tx_name gene_id nexon tx_len cds_len
## 1 1 uc001aaa.3 100287102 3 1652 0
## 2 2 uc010nxq.1 100287102 3 1488 402
## 3 3 uc010nxr.1 100287102 3 1595 0
(here we subset a data.frame
).
Note: as an example of terminology mixup, consider that the output of transcripts()
are coordinates for the unspliced transcript, whereas extractTranscriptSeqs()
is the RNA sequence of the spliced transcripts.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] XVector_0.8.0
## [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [3] GenomicFeatures_1.20.2
## [4] AnnotationDbi_1.30.1
## [5] Biobase_2.28.0
## [6] GenomicRanges_1.20.5
## [7] GenomeInfoDb_1.4.2
## [8] IRanges_2.2.7
## [9] S4Vectors_0.6.3
## [10] BiocGenerics_0.14.0
## [11] BiocStyle_1.6.0
## [12] rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] knitr_1.11 magrittr_1.5
## [3] GenomicAlignments_1.4.1 zlibbioc_1.14.0
## [5] BiocParallel_1.2.20 stringr_1.0.0
## [7] tools_3.2.1 DBI_0.3.1
## [9] lambda.r_1.1.7 futile.logger_1.4.1
## [11] htmltools_0.2.6 yaml_2.1.13
## [13] digest_0.6.8 rtracklayer_1.28.8
## [15] formatR_1.2 futile.options_1.0.0
## [17] bitops_1.0-6 RCurl_1.95-4.7
## [19] biomaRt_2.24.0 evaluate_0.7.2
## [21] RSQLite_1.0.0 stringi_0.5-5
## [23] Rsamtools_1.20.4 Biostrings_2.36.3
## [25] XML_3.98-1.3