```{r front, child="front.Rmd", echo=FALSE} ``` ## Dependencies This document has the following dependencies: ```{r dependencies, warning=FALSE, message=FALSE} library(GenomeInfoDb) library(GenomicRanges) ``` Use the following commands to install these packages in R. ```{r biocLite, eval=FALSE} source("http://www.bioconductor.org/biocLite.R") biocLite(c("GenomeInfoDb", "GenomicRanges")) ``` ## Corrections Improvements and corrections to this document can be submitted on its [GitHub](https://github.com/kasperdanielhansen/genbioconductor/blob/master/Rmd/GenomicRanges_seqinfo.Rmd) in its [repository](https://github.com/kasperdanielhansen/genbioconductor). ## Overview The `GRanges` class contains `seqinfo` information about the length and the names of the chromosomes. Here we will briefly discuss strategies for harmonizing this information. The `r Biocpkg("GenomeInfoDb")` package addresses a seemingly small, but consistent problem: different online resources uses different naming conventions for chromosomes. In more technical jargon, this package helps keeping your `seqinfo` and `seqlevels` harmonized. ## Other Resources - The vignette from the [GenomeInfoDb webpage](http://bioconductor.org/packages/GenomeInfoDb). ## Drop and keep seqlevels It is common to want to remove `seqlevels` from a `GRanges` object. Here are some equivalent methods ```{r seqlevelsForce} gr <- GRanges(seqnames = c("chr1", "chr2"), ranges = IRanges(start = 1:2, end = 4:5)) seqlevels(gr, force=TRUE) <- "chr1" gr ``` In `r Biocpkg("GenomeInfoDb")` (loaded when you load `r Biocpkg("GenomicRanges")`) you find `dropSeqlevels()` and `keepSeqlevels()`. ```{r dropSeqlevels} gr <- GRanges(seqnames = c("chr1", "chr2"), ranges = IRanges(start = 1:2, end = 4:5)) dropSeqlevels(gr, "chr1") keepSeqlevels(gr, "chr2") ``` You can also just get rid of weird looking chromosome names with `keepStandardChromosomes()`. ```{r keepStandard} gr <- GRanges(seqnames = c("chr1", "chrU345"), ranges = IRanges(start = 1:2, end = 4:5)) keepStandardChromosomes(gr) ``` ## Changing style It is an inconvenient truth that different online resources uses different naming convention for chromosomes. This can even be different from organism to organism. For example, for the fruitfly (Drosophila Melanogaster) NCBI and Ensembl uses "2L" and UCSC uses "chr2L". But NCBI and Ensembl differs on some contigs: NCBI uses "Un" and Ensembl used "U". ```{r GRanges} gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = 1:2, width = 2)) ``` Let us remap ```{r seqStyle} newStyle <- mapSeqlevels(seqlevels(gr), "NCBI") gr <- renameSeqlevels(gr, newStyle) ``` This can in principle go wrong, if the original set of `seqlevels` are inconsistent (not a single style). The `r Biocpkg("GenomeInfoDb")` also contains information for dropping / keeping various classes of chromosomes: ## Using information from BSgenome packages `BSgenome` packages contains `seqinfo` on their genome objects. This contains `seqlengths` and other information. An easy trick is to use these packages to correct your `seqinfo` information. Hopefully, in Bioconductor 3.2 we will get support for `seqlengths` in `r Biocpkg("GenomeInfoDb")` so we can avoid using the big genome packages. ## SessionInfo \scriptsize ```{r sessionInfo, echo=FALSE} sessionInfo() ```