This document has the following dependencies:
library(GenomicRanges)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicRanges"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
The S4Vectors package introduced the DataFrame
class. This class is very similar to the base data.frame
class from R, but it allows columns of any class, provided a number of required methods are supported. For example, DataFrame
can have IRanges
as columns, unlike data.frame
:
ir <- IRanges(start = 1:2, width = 3)
df1 <- DataFrame(iranges = ir)
df1
## DataFrame with 2 rows and 1 column
## iranges
## <IRanges>
## 1 [1, 3]
## 2 [2, 4]
df1$iranges
## IRanges of length 2
## start end width
## [1] 1 3 3
## [2] 2 4 3
df2 <- data.frame(iranges = ir)
df2
## iranges.start iranges.end iranges.width
## 1 1 3 3
## 2 2 4 3
In the data.frame
case, the IRanges
gives rise to 4 columns, whereas it is a single column when a DataFrame
is used.
Think of this as an expanded and more versatile class.
GRanges
(unlike IRanges
) may have associated metadata. This is immensely useful. The formal way to access and set this metadata is through values
or elementMetadata
or mcols
, like
gr <- GRanges(seqnames = "chr1", strand = c("+", "-", "+"),
ranges = IRanges(start = c(1,3,5), width = 3))
values(gr) <- DataFrame(score = c(0.1, 0.5, 0.3))
gr
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 [1, 3] + | 0.1
## [2] chr1 [3, 5] - | 0.5
## [3] chr1 [5, 7] + | 0.3
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A much easier way to set and access metadata is through the $
operator
gr$score
## [1] 0.1 0.5 0.3
gr$score2 = gr$score * 0.2
gr
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score score2
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 [1, 3] + | 0.1 0.02
## [2] chr1 [3, 5] - | 0.5 0.1
## [3] chr1 [5, 7] + | 0.3 0.06
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
findOverlaps
works exactly as for IRanges
. But the strand
information can be confusing. Let us make an example
gr2 <- GRanges(seqnames = c("chr1", "chr2", "chr1"), strand = "*",
ranges = IRanges(start = c(1, 3, 5), width = 3))
gr2
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [1, 3] *
## [2] chr2 [3, 5] *
## [3] chr1 [5, 7] *
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
gr
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score score2
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 [1, 3] + | 0.1 0.02
## [2] chr1 [3, 5] - | 0.5 0.1
## [3] chr1 [5, 7] + | 0.3 0.06
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note how the ranges
in the two GRanges
object are the same coordinates, they just have different seqnames
and strand
. Let us try to do a standard findOverlaps
:
findOverlaps(gr, gr2)
## Hits object with 4 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 2 1
## [3] 2 3
## [4] 3 3
## -------
## queryLength: 3
## subjectLength: 3
Notice how the *
strand overlaps both +
and -
. There is an argument ignore.strand
to findOverlaps
which will … ignore the strand information (so +
overlaps -
). Several other functions in GenomicRanges
have an ignore.strand
argument as well.
A common operation is to select only certain ranges from a GRanges
which overlap something else. Enter the convenience function subsetByOverlaps
subsetByOverlaps(gr, gr2)
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | score score2
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 [1, 3] + | 0.1 0.02
## [2] chr1 [3, 5] - | 0.5 0.1
## [3] chr1 [5, 7] + | 0.3 0.06
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A common situation is that you have data which looks like a GRanges
but is really stored as a classic data.frame
, with chr
, start
etc. The makeGRangesFromDataFrame
converts this data.frame
into a GRanges
. An argument tells you whether you want to keep any additional columns.
df <- data.frame(chr = "chr1", start = 1:3, end = 4:6, score = 7:9)
makeGRangesFromDataFrame(df)
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [1, 4] *
## [2] chr1 [2, 5] *
## [3] chr1 [3, 6] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
makeGRangesFromDataFrame(df, keep.extra.columns = TRUE)
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr1 [1, 4] * | 7
## [2] chr1 [2, 5] * | 8
## [3] chr1 [3, 6] * | 9
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Here are some simple usecases with pseudo-code showing how we can accomplish various tasks with GRanges
objects and functionality.
Suppose we want to identify transcription factor (TF) binding sites that overlaps known SNPs.
Input objects are
snps: a GRanges
(of width 1)
TF: a GRanges
pseudocode:
findOverlaps(snps, TF)
(watch out for strand)
Suppose we have a set of differentially methylation regions (DMRs) (think genomic regions) and a set of CpG Islands and we want to find all DMRs within 10kb of a CpG Island.
Input objects are
dmrs: a GRanges
islands: a GRanges
pseudocode:
big_islands <- resize(islands, width = 20000 + width(islands), fix = "center")
findOverlaps(dmrs, big_islands)
(watch out for strand)
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] GenomicRanges_1.20.5 GenomeInfoDb_1.4.2 IRanges_2.2.7
## [4] S4Vectors_0.6.3 BiocGenerics_0.14.0 BiocStyle_1.6.0
## [7] rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 formatR_1.2 magrittr_1.5 evaluate_0.7.2
## [5] stringi_0.5-5 XVector_0.8.0 tools_3.2.1 stringr_1.0.0
## [9] yaml_2.1.13 htmltools_0.2.6 knitr_1.11
Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for computing and annotating genomic ranges.” PLoS Computational Biology 9 (8): e1003118. doi:10.1371/journal.pcbi.1003118.