This document has the following dependencies:
library(GenomeInfoDb)
library(GenomicRanges)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomeInfoDb", "GenomicRanges"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
The GRanges
class contains seqinfo
information about the length and the names of the chromosomes. Here we will briefly discuss strategies for harmonizing this information.
The GenomeInfoDb package addresses a seemingly small, but consistent problem: different online resources uses different naming conventions for chromosomes. In more technical jargon, this package helps keeping your seqinfo
and seqlevels
harmonized.
It is common to want to remove seqlevels
from a GRanges
object. Here are some equivalent methods
gr <- GRanges(seqnames = c("chr1", "chr2"),
ranges = IRanges(start = 1:2, end = 4:5))
seqlevels(gr, force=TRUE) <- "chr1"
gr
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [1, 4] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
In GenomeInfoDb (loaded when you load GenomicRanges) you find dropSeqlevels()
and keepSeqlevels()
.
gr <- GRanges(seqnames = c("chr1", "chr2"),
ranges = IRanges(start = 1:2, end = 4:5))
dropSeqlevels(gr, "chr1")
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 [2, 5] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
keepSeqlevels(gr, "chr2")
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 [2, 5] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
You can also just get rid of weird looking chromosome names with keepStandardChromosomes()
.
gr <- GRanges(seqnames = c("chr1", "chrU345"),
ranges = IRanges(start = 1:2, end = 4:5))
keepStandardChromosomes(gr)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [1, 4] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
It is an inconvenient truth that different online resources uses different naming convention for chromosomes. This can even be different from organism to organism. For example, for the fruitfly (Drosophila Melanogaster) NCBI and Ensembl uses “2L” and UCSC uses “chr2L”. But NCBI and Ensembl differs on some contigs: NCBI uses “Un” and Ensembl used “U”.
gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = 1:2, width = 2))
Let us remap
newStyle <- mapSeqlevels(seqlevels(gr), "NCBI")
gr <- renameSeqlevels(gr, newStyle)
This can in principle go wrong, if the original set of seqlevels
are inconsistent (not a single style).
The GenomeInfoDb also contains information for dropping / keeping various classes of chromosomes:
BSgenome
packages contains seqinfo
on their genome objects. This contains seqlengths
and other information. An easy trick is to use these packages to correct your seqinfo
information.
Hopefully, in Bioconductor 3.2 we will get support for seqlengths
in GenomeInfoDb so we can avoid using the big genome packages.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] GenomicRanges_1.20.5 GenomeInfoDb_1.4.2 IRanges_2.2.7
## [4] S4Vectors_0.6.3 BiocGenerics_0.14.0 BiocStyle_1.6.0
## [7] rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 formatR_1.2 magrittr_1.5 evaluate_0.7.2
## [5] stringi_0.5-5 XVector_0.8.0 tools_3.2.1 stringr_1.0.0
## [9] yaml_2.1.13 htmltools_0.2.6 knitr_1.11