This document has the following dependencies:
library(IRanges)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("IRanges"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
A surprising amount of objects/tasks in computational biology can be formulated in terms of integer intervals, manipulation of integer intervals and overlap of integer intervals.
Objects: A transcript (a union of integer intervals), a collection of SNPs (intervals of width 1), transcription factor binding sites, a collection of aligned short reads.
Tasks: Which transcription factor binding sites hit the promoter of genes (overlap between two sets of intervals), which SNPs hit a collection of exons, which short reads hit a predetermined set of exons.
IRanges
are collections of integer intervals. GRanges
are like IRanges
, but with an associated chromosome and strand, taking care of some book keeping.
Here we discuss IRanges
, which provides the foundation for GRanges
. This package implements (amongst other things) an algebra for handling integer intervals.
Specify IRanges by 2 of start, end, width (SEW).
ir1 <- IRanges(start = c(1,3,5), end = c(3,5,7))
ir1
## IRanges of length 3
## start end width
## [1] 1 3 3
## [2] 3 5 3
## [3] 5 7 3
ir2 <- IRanges(start = c(1,3,5), width = 3)
all.equal(ir1, ir2)
## [1] TRUE
An IRanges
consist of separate intervals; each interval is called a range. So ir1
above contains 3 ranges.
Assessor methods: start()
, end()
, width()
and also replacement methods.
start(ir1)
## [1] 1 3 5
width(ir2) <- 1
ir2
## IRanges of length 3
## start end width
## [1] 1 1 1
## [2] 3 3 1
## [3] 5 5 1
They may have names
names(ir1) <- paste("A", 1:3, sep = "")
ir1
## IRanges of length 3
## start end width names
## [1] 1 3 3 A1
## [2] 3 5 3 A2
## [3] 5 7 3 A3
They have a single dimension
dim(ir1)
## NULL
length(ir1)
## [1] 3
Because of this, subsetting works like a vector
ir1[1]
## IRanges of length 1
## start end width names
## [1] 1 3 3 A1
ir1["A1"]
## IRanges of length 1
## start end width names
## [1] 1 3 3 A1
Like vectors, you can concatenate two IRanges
with the c()
function
c(ir1, ir2)
## IRanges of length 6
## start end width names
## [1] 1 3 3 A1
## [2] 3 5 3 A2
## [3] 5 7 3 A3
## [4] 1 1 1
## [5] 3 3 1
## [6] 5 5 1
A normal IRanges is a minimal representation of the IRanges viewed as a set. Each integer only occur in a single range and there are as few ranges as possible. In addition, it is ordered. Many functions produce a normal IRanges
. Created by reduce()
.
ir
## IRanges of length 4
## start end width
## [1] 1 4 4
## [2] 3 4 2
## [3] 7 8 2
## [4] 9 10 2
reduce(ir)
## IRanges of length 2
## start end width
## [1] 1 4 4
## [2] 7 10 4
Answers: “Given a set of overlapping exons, which bases belong to an exon?”
From some perspective, disjoin()
is the opposite of reduce()
. An example explains better:
disjoin(ir1)
Answers: “Give a set of overlapping exons, which bases belong to the same set of exons?”
“Intra-range” manipulations are manipulations where each original range gets mapped to a new range.
Examples of these are: shift()
, narrow()
, flank()
, resize()
, restrict()
.
For example, resize()
can be extremely useful. It has a fix
argument controlling where the resizing occurs from. Use fix="center"
to resize around the center of the ranges; I use this a lot.
resize(ir, width = 1, fix = "start")
## IRanges of length 4
## start end width
## [1] 1 1 1
## [2] 3 3 1
## [3] 7 7 1
## [4] 9 9 1
resize(ir, width = 1, fix = "center")
## IRanges of length 4
## start end width
## [1] 2 2 1
## [2] 3 3 1
## [3] 7 7 1
## [4] 9 9 1
The help page is ?"intra-range-methods"
(note that there is both a help page in IRanges and GenomicRanges).
Manipulating IRanges
as sets means that we view each IRanges
as a set of integers; individual integers is either contained in one or more ranges or they are not. This is equivalent to calling reduce()
on the IRanges
first.
Once this is done, we can use standard: union()
, intersect()
, setdiff()
, gaps()
between two IRanges
(which all returns normalized IRanges
).
ir1 <- IRanges(start = c(1, 3, 5), width = 1)
ir2 <- IRanges(start = c(4, 5, 6), width = 1)
union(ir1, ir2)
## IRanges of length 2
## start end width
## [1] 1 1 1
## [2] 3 6 4
intersect(ir1, ir2)
## IRanges of length 1
## start end width
## [1] 5 5 1
Because they return normalized IRanges
, an alternative to union()
is
reduce(c(ir1, ir2))
## IRanges of length 2
## start end width
## [1] 1 1 1
## [2] 3 6 4
There is also an element-wise (pair-wise) version of these: punion()
, pintersect()
, psetdiff()
, pgap()
; this is similar to say pmax
from base R. In my experience, these functions are seldom used.
Finding (pairwise) overlaps between two IRanges
is done by findOverlaps()
. This function is very important and amazingly fast!
ir1 <- IRanges(start = c(1,4,8), end = c(3,7,10))
ir2 <- IRanges(start = c(3,4), width = 3)
ov <- findOverlaps(ir1, ir2)
ov
## Hits object with 3 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 1
## [2] 2 1
## [3] 2 2
## -------
## queryLength: 3
## subjectLength: 2
It returns a Hits
object which describes the relationship between the two IRanges
. This object is basically a two-column matrix of indicies into the two IRanges
.
The two columns of the hits object can be accessed by queryHits()
and subjectHits()
(often used with unique()
).
For example, the first row of the matrix describes that the first range of ir1
overlaps with the first range of ir2
. Or said differently, they have a non-empty intersection:
intersect(ir1[subjectHits(ov)[1]],
ir2[queryHits(ov)[2]])
## IRanges of length 0
The elements of unique(queryHits)
gives you the indices of the query ranges which actually had an overlap; you need unique
because a query range may overlap multiple subject ranges.
queryHits(ov)
## [1] 1 2 2
unique(queryHits(ov))
## [1] 1 2
The list of arguments to findOverlaps()
is long; there are a few hidden treasures here. For example, you can ask to only get an overlap if two ranges overlap by a certain number of bases.
args(findOverlaps)
## function (query, subject, maxgap = 0L, minoverlap = 1L, type = c("any",
## "start", "end", "within", "equal"), select = c("all", "first",
## "last", "arbitrary"), algorithm = c("nclist", "intervaltree"),
## ...)
## NULL
For efficiency, there is also countOverlaps()
, which just returns the number of overlaps. This function is faster and takes up less memory because it does not have to keep track of which ranges overlap, just the number of overlaps.
countOverlaps(ir1, ir2)
## [1] 1 2 0
Sometimes you have two sets of IRanges
and you need to know which ones are closest to each other. Functions for this include nearest()
, precede()
, follow()
. Watch out for ties!
ir1
## IRanges of length 3
## start end width
## [1] 1 3 3
## [2] 4 7 4
## [3] 8 10 3
ir2
## IRanges of length 2
## start end width
## [1] 3 5 3
## [2] 4 6 3
nearest(ir1, ir2)
## [1] 1 1 2
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] IRanges_2.2.7 S4Vectors_0.6.3 BiocGenerics_0.14.0
## [4] BiocStyle_1.6.0 rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.2 tools_3.2.1 htmltools_0.2.6
## [5] yaml_2.1.13 stringi_0.5-5 knitr_1.11 stringr_1.0.0
## [9] digest_0.6.8 evaluate_0.7.2