Dependencies

This document has the following dependencies:

library(GenomeInfoDb)
library(GenomicRanges)

Use the following commands to install these packages in R.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomeInfoDb", "GenomicRanges"))

Corrections

Improvements and corrections to this document can be submitted on its GitHub in its repository.

Overview

The GRanges class contains seqinfo information about the length and the names of the chromosomes. Here we will briefly discuss strategies for harmonizing this information.

The GenomeInfoDb package addresses a seemingly small, but consistent problem: different online resources uses different naming conventions for chromosomes. In more technical jargon, this package helps keeping your seqinfo and seqlevels harmonized.

Other Resources

Drop and keep seqlevels

It is common to want to remove seqlevels from a GRanges object. Here are some equivalent methods

gr <- GRanges(seqnames = c("chr1", "chr2"),
              ranges = IRanges(start = 1:2, end = 4:5))
seqlevels(gr, force=TRUE) <- "chr1"
gr
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    [1, 4]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

In GenomeInfoDb (loaded when you load GenomicRanges) you find dropSeqlevels() and keepSeqlevels().

gr <- GRanges(seqnames = c("chr1", "chr2"),
              ranges = IRanges(start = 1:2, end = 4:5))
dropSeqlevels(gr, "chr1")
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2    [2, 5]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
keepSeqlevels(gr, "chr2")
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2    [2, 5]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

You can also just get rid of weird looking chromosome names with keepStandardChromosomes().

gr <- GRanges(seqnames = c("chr1", "chrU345"),
              ranges = IRanges(start = 1:2, end = 4:5))
keepStandardChromosomes(gr)
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    [1, 4]      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Changing style

It is an inconvenient truth that different online resources uses different naming convention for chromosomes. This can even be different from organism to organism. For example, for the fruitfly (Drosophila Melanogaster) NCBI and Ensembl uses “2L” and UCSC uses “chr2L”. But NCBI and Ensembl differs on some contigs: NCBI uses “Un” and Ensembl used “U”.

gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = 1:2, width = 2))

Let us remap

newStyle <- mapSeqlevels(seqlevels(gr), "NCBI")
gr <- renameSeqlevels(gr, newStyle)

This can in principle go wrong, if the original set of seqlevels are inconsistent (not a single style).

The GenomeInfoDb also contains information for dropping / keeping various classes of chromosomes:

Using information from BSgenome packages

BSgenome packages contains seqinfo on their genome objects. This contains seqlengths and other information. An easy trick is to use these packages to correct your seqinfo information.

Hopefully, in Bioconductor 3.2 we will get support for seqlengths in GenomeInfoDb so we can avoid using the big genome packages.

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
## [1] GenomicRanges_1.20.5 GenomeInfoDb_1.4.2   IRanges_2.2.7       
## [4] S4Vectors_0.6.3      BiocGenerics_0.14.0  BiocStyle_1.6.0     
## [7] rmarkdown_0.7       
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.8    formatR_1.2     magrittr_1.5    evaluate_0.7.2 
##  [5] stringi_0.5-5   XVector_0.8.0   tools_3.2.1     stringr_1.0.0  
##  [9] yaml_2.1.13     htmltools_0.2.6 knitr_1.11