This document has the following dependencies:
library(GenomicRanges)
library(rtracklayer)
library(AnnotationHub)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicRanges", "rtracklayer", "AnnotationHub"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
We are going to use AnnotationHub and GenomicRanges to access ENCODE data on the H3K4me3 histone modification in a specific cell line. This histone modification is believed to mark active promoters, and we are going to attempt to verify this statement. This involves
findOverlaps
in GenomicRanges.First we use AnnotationHub to get data on homo sapiens.
ah <- AnnotationHub()
ah <- subset(ah, species == "Homo sapiens")
Next we search for two keywords: H3K4me3 and Gm12878 which is the name of the cell line we are interested in.
qhs <- query(ah, "H3K4me3")
qhs <- query(qhs, "Gm12878")
(Note: I like to keep my full annotation hub around, so I can re-do my query with a different search term in case I end up with no hits. This is why I start assigning output to the qhs
object and not ah
).
Lets have a look
qhs
## AnnotationHub with 11 records
## # snapshotDate(): 2015-08-17
## # $dataprovider: BroadInstitute, UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges, BigWigFile
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH23256"]]'
##
## title
## AH23256 | wgEncodeBroadHistoneGm12878H3k4me3StdPk.broadPeak.gz
## AH27075 | wgEncodeUwHistoneGm12878H3k4me3StdHotspotsRep1.broadPeak.gz
## AH27076 | wgEncodeUwHistoneGm12878H3k4me3StdHotspotsRep2.broadPeak.gz
## AH27077 | wgEncodeUwHistoneGm12878H3k4me3StdPkRep1.narrowPeak.gz
## AH27078 | wgEncodeUwHistoneGm12878H3k4me3StdPkRep2.narrowPeak.gz
## ... ...
## AH30747 | E116-H3K4me3.narrowPeak.gz
## AH31690 | E116-H3K4me3.gappedPeak.gz
## AH32869 | E116-H3K4me3.fc.signal.bigwig
## AH33901 | E116-H3K4me3.pval.signal.bigwig
## AH40294 | E116-H3K4me3.imputed.pval.signal.bigwig
Note how some of these hits don’t contain Gm12878 in their title. This is a useful illustration of how query
searches over multiple fields.
Lets have a closer look at this
qhs$title
## [1] "wgEncodeBroadHistoneGm12878H3k4me3StdPk.broadPeak.gz"
## [2] "wgEncodeUwHistoneGm12878H3k4me3StdHotspotsRep1.broadPeak.gz"
## [3] "wgEncodeUwHistoneGm12878H3k4me3StdHotspotsRep2.broadPeak.gz"
## [4] "wgEncodeUwHistoneGm12878H3k4me3StdPkRep1.narrowPeak.gz"
## [5] "wgEncodeUwHistoneGm12878H3k4me3StdPkRep2.narrowPeak.gz"
## [6] "E116-H3K4me3.broadPeak.gz"
## [7] "E116-H3K4me3.narrowPeak.gz"
## [8] "E116-H3K4me3.gappedPeak.gz"
## [9] "E116-H3K4me3.fc.signal.bigwig"
## [10] "E116-H3K4me3.pval.signal.bigwig"
## [11] "E116-H3K4me3.imputed.pval.signal.bigwig"
qhs$dataprovider
## [1] "UCSC" "UCSC" "UCSC" "UCSC"
## [5] "UCSC" "BroadInstitute" "BroadInstitute" "BroadInstitute"
## [9] "BroadInstitute" "BroadInstitute" "BroadInstitute"
This result is a great illustration of the mess of public data. It turns our that E116 is a Roadmap Epigenetics code for the Gm12878 cell line. The first 5 hits are from ENCODE, hosted at UCSC and the last 6 hits are from Roadmap Epigenomics hosted at the Broad Institute. The Roadmap data is different representation (and peaks) from the same underlying data. For the ENCODE data, two different centers did the same experiment in the same cell line (Broad
, hit 1) and (Uw
, hit 2-5), where Uw
exposed data on two replicates (whatever that means). These two experiments seems to be analyzed using different algorithms. It is even possible that the Roadmap data is from the same raw data but just analyzed using different algorithms.
Lets take a look at the narrowPeak
data:
gr1 <- subset(qhs, title == "wgEncodeUwHistoneGm12878H3k4me3StdPkRep1.narrowPeak.gz")[[1]]
gr1
## GRanges object with 74470 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 [713301, 713450] * | <NA> 0
## [2] chr1 [713501, 713650] * | <NA> 0
## [3] chr1 [713881, 714030] * | <NA> 0
## [4] chr1 [714181, 714330] * | <NA> 0
## [5] chr1 [714481, 714630] * | <NA> 0
## ... ... ... ... ... ... ...
## [74466] chrX [154492741, 154492890] * | <NA> 0
## [74467] chrX [154493401, 154493550] * | <NA> 0
## [74468] chrX [154560441, 154560590] * | <NA> 0
## [74469] chrX [154562121, 154562270] * | <NA> 0
## [74470] chrX [154842061, 154842210] * | <NA> 0
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <numeric>
## [1] 91 112.7680 -1 -1
## [2] 25 26.7181 -1 -1
## [3] 81 77.4798 -1 -1
## [4] 32 106.5650 -1 -1
## [5] 122 153.8320 -1 -1
## ... ... ... ... ...
## [74466] 43 52.20930 -1 -1
## [74467] 122 203.16800 -1 -1
## [74468] 8 4.49236 -1 -1
## [74469] 8 4.41978 -1 -1
## [74470] 125 170.20100 -1 -1
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
gr2 <- subset(qhs, title == "E116-H3K4me3.narrowPeak.gz")[[1]]
gr2
## GRanges object with 76188 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr9 [123553464, 123557122] * | Rank_1 623
## [2] chr3 [ 53196213, 53197995] * | Rank_2 622
## [3] chr18 [ 9137534, 9142676] * | Rank_3 583
## [4] chr11 [ 75110593, 75111943] * | Rank_4 548
## [5] chr13 [ 41343776, 41345943] * | Rank_5 543
## ... ... ... ... ... ... ...
## [76184] chr8 [ 98881411, 98881613] * | Rank_76184 20
## [76185] chr9 [ 26812360, 26812533] * | Rank_76185 20
## [76186] chrX [ 49019816, 49020016] * | Rank_76186 20
## [76187] chrX [ 55932872, 55933045] * | Rank_76187 20
## [76188] chr5 [118322235, 118322408] * | Rank_76188 20
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <numeric>
## [1] 23.09216 62.34132 52.85911 1736
## [2] 24.91976 62.22835 52.85911 439
## [3] 22.48938 58.39180 50.67302 442
## [4] 22.47056 54.84914 47.77176 752
## [5] 20.82804 54.31837 47.29083 909
## ... ... ... ... ...
## [76184] 2.63471 2.02539 0.27458 94
## [76185] 2.63471 2.02539 0.27458 32
## [76186] 2.63471 2.02539 0.27458 82
## [76187] 2.63471 2.02539 0.27458 41
## [76188] 2.62206 2.00970 0.26790 52
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
(Note: I use this code, where I use title
to refer to the different resources, to make this script more robust over time).
This gives us two GRanges
objects. Let us look at the distribution of peak widths:
summary(width(gr1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150.0 150.0 150.0 150.3 150.0 410.0
table(width(gr1))
##
## 150 210 230 250 270 290 390 410
## 74313 1 13 24 37 77 4 1
summary(width(gr2))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 174 243 387 662 770 12340
Turns out that almost all peaks in gr1
have a width of 150bp, whereas gr2
is much more variable. This is likely a product of the data processing algorithm; it can be very hard to figure out the details of this.
At this time one can spend a lot of time thinking about which datasets is best. We will avoid this (important) discussion; since we referred to ENCODE data above (in the Overview section), we will stick with gr1
.
Now we need to get some promoter coordinates. There are multiple ways to do this in Bioconductor, but here I will do a quick lookup for RefSeq in my annotation hub. RefSeq is a highly curated (aka conservative) collection of genes.
Lets get started
qhs <- query(ah, "RefSeq")
qhs
## AnnotationHub with 8 records
## # snapshotDate(): 2015-08-17
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH5040"]]'
##
## title
## AH5040 | RefSeq Genes
## AH5041 | Other RefSeq
## AH5155 | RefSeq Genes
## AH5156 | Other RefSeq
## AH5306 | RefSeq Genes
## AH5307 | Other RefSeq
## AH5431 | RefSeq Genes
## AH5432 | Other RefSeq
So this gives us multiple datasets, all with very similar names. We probably need the thing called RefSeq Genes
and not Other RefSeq
but why are there multiple resources with the same name?
Turns out the answer makes sense:
qhs$genome
## [1] "hg19" "hg19" "hg18" "hg18" "hg17" "hg17" "hg16" "hg16"
This looks like the same resources, but in different genome builds. We have
genome(gr1)
## chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2
## NA NA NA NA NA NA NA NA NA NA NA NA
## chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX
## NA NA NA NA NA NA NA NA NA NA NA
so we know which one to get:
refseq <- qhs[qhs$genome == "hg19" & qhs$title == "RefSeq Genes"]
refseq
## AnnotationHub with 1 record
## # snapshotDate(): 2015-08-17
## # names(): AH5040
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $title: RefSeq Genes
## # $description: GRanges object from UCSC track 'RefSeq Genes'
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: UCSC track
## # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/dat...
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: refGene, UCSC, track, Gene, Transcript, Annotation
## # retrieve record with 'object[["AH5040"]]'
refseq <- refseq[[1]] ## Downloads
Lets have a look
refseq
## UCSC track 'refGene'
## UCSCData object with 50066 ranges and 5 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 [66999825, 67210768] + |
## [2] chr1 [ 8378145, 8404227] + |
## [3] chr1 [48998527, 50489626] - |
## [4] chr1 [16767167, 16786584] + |
## [5] chr1 [16767167, 16786584] + |
## ... ... ... ... ...
## [50062] chr19_gl000209_random [ 57209, 68123] + |
## [50063] chr19_gl000209_random [ 46646, 68123] + |
## [50064] chr19_gl000209_random [ 98135, 112667] + |
## [50065] chr19_gl000209_random [ 70071, 84658] + |
## [50066] chr19_gl000209_random [131433, 145745] + |
## name score itemRgb thick
## <character> <numeric> <character> <IRanges>
## [1] NM_032291 0 <NA> [67000042, 67208778]
## [2] NM_001080397 0 <NA> [ 8378169, 8404073]
## [3] NM_032785 0 <NA> [48999845, 50489468]
## [4] NM_001145277 0 <NA> [16767257, 16785491]
## [5] NM_001145278 0 <NA> [16767257, 16785385]
## ... ... ... ... ...
## [50062] NM_002255 0 <NA> [ 57249, 67717]
## [50063] NM_001258383 0 <NA> [ 57132, 67717]
## [50064] NM_012313 0 <NA> [ 98146, 112480]
## [50065] NM_001083539 0 <NA> [ 70108, 83979]
## [50066] NM_012312 0 <NA> [131468, 145120]
## blocks
## <IRangesList>
## [1] [ 1, 227] [91706, 91769] [98929, 98953] ...
## [2] [ 1, 102] [6222, 6642] [7214, 7306] ...
## [3] [ 1, 1439] [2036, 2062] [6788, 6884] ...
## [4] [ 1, 182] [2961, 3061] [7199, 7303] ...
## [5] [ 1, 104] [2961, 3061] [7199, 7303] ...
## ... ...
## [50062] [ 1, 80] [ 280, 315] [1182, 1466] ...
## [50063] [ 1, 86] [10414, 10643] [10843, 10878] ...
## [50064] [ 1, 46] [1523, 1557] [4002, 4301] ...
## [50065] [ 1, 71] [1071, 1106] [1851, 2135] ...
## [50066] [ 1, 69] [ 862, 897] [3334, 3633] ...
## -------
## seqinfo: 93 sequences from hg19 genome
Let us look at the number of isoforms per gene name:
table(table(refseq$name))
##
## 1 2 3 4 5 6 7 8 9 10 12 13
## 45178 568 102 27 74 52 171 129 13 19 1 1
## 19
## 5
(the table(table())
construction may seems weird at first, but its a great way to get a quick tabular summary of occurrences with the same name). This shows that almost all genes have a single transcript, which reflects how conservative RefSeq is.
Notice that each transcript is a separate range. Therefore, each transcript is represented as the “outer” coordinates of the gene; the ranges does not exclude introns. Because we got this from UCSC I happen to know that the $blocks
metadata really contains the coordinates of the different exons. In a later example we will introduce a gene representation where we keep track of exons, introns and transcripts from the same gene, through something known as a TxDb
(transcript database) object.
For now, we just need the promoters, so we don’t really care about introns. We need to keep track of which strand each transcript is on, to get the transcription start site.
Or we could just use the convenience function promoters()
:
promoters <- promoters(refseq)
table(width(promoters))
##
## 2200
## 50066
args(promoters)
## NULL
There are many definitions of promoters based on transcription start site. The default in this function is to use 2kb upstream and 200bp downstream of the start site. Let’s keep this.
Now we compute which promoters have a H3K4me3 peak in them:
ov <- findOverlaps(promoters, gr1)
ov
## Hits object with 46022 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 2 387
## [2] 3 2523
## [3] 4 774
## [4] 5 774
## [5] 6 1114
## ... ... ...
## [46018] 47457 47491
## [46019] 47459 47493
## [46020] 47459 47494
## [46021] 47460 47493
## [46022] 47460 47494
## -------
## queryLength: 50066
## subjectLength: 74470
Let us compute how many percent of the peaks are in a promoter
length(unique(queryHits(ov))) / length(gr1)
## [1] 0.3333691
and how many percent of promoters have a peak in them
length(unique(subjectHits(ov))) / length(promoters)
## [1] 0.440878
My rule of thumb is that any cell type has at most 50% of genes expressed, which fits well with these numbers. We also see that there are many H3K4me3 peaks which do not lie in a genic promoter. This is actually expected.
Now, the overlap is non-empty, but is it interesting or “significant”. Answering this question thoroughly requires thinking deeply about background distributions etc. We will avoid a careful statistical approach to this question, and will instead do a few back-of-the-envelope calculations to get a feel for it.
First we notice that both promoters and peaks are small compared to the size of the human genome:
sum(width(reduce(gr1))) / 10^6
## [1] 11.19044
sum(width(reduce(promoters))) / 10^6
## [1] 64.3005
(result is in megabases; the human genome is 3000 megabases). Just this fact alone should convince you that the overlap is highly unlikely happen purely by chance. Let us calculate the overlap in megabases:
sum(width(intersect(gr1, promoters))) / 10^6
## [1] 0
That’s weird; why is the overlap empty when we know (using findOverlaps
) that there is plenty of overlap? Turns out the answer is strand. We need to ignore the strand when we do this calculation:
sum(width(intersect(gr1, promoters, ignore.strand = TRUE))) / 10^6
## [1] 3.019608
This brings us back to the question of the size of the promoter set above; this is also affected by strand when there is a promoter on each strand which overlaps. Contrast
sum(width(reduce(promoters))) / 10^6
## [1] 64.3005
sum(width(reduce(promoters, ignore.strand = TRUE))) / 10^6
## [1] 62.27957
Strand can be troublesome!
Let us compute a small 2x2 matrix for which bases are in promoters and/or peaks:
prom <- reduce(promoters, ignore.strand = TRUE)
peaks <- reduce(gr1)
both <- intersect(prom, peaks)
only.prom <- setdiff(prom, both)
only.peaks <- setdiff(peaks, both)
overlapMat <- matrix(0,, ncol = 2, nrow = 2)
colnames(overlapMat) <- c("in.peaks", "out.peaks")
rownames(overlapMat) <- c("in.promoters", "out.promoter")
overlapMat[1,1] <- sum(width(both))
overlapMat[1,2] <- sum(width(only.prom))
overlapMat[2,1] <- sum(width(only.peaks))
overlapMat[2,2] <- 3*10^9 - sum(overlapMat)
round(overlapMat / 10^6, 2)
## in.peaks out.peaks
## in.promoters 3.02 59.26
## out.promoter 8.17 2929.55
Here we have just used the genome size of 3 billion bases. This is not correct. There are many of these bases which have not been sequenced and in addition, there are many bases which cannot be mapped using short reads. This will reduce the genome size.
Nevertheless, let us compute an odds-ratio for this table:
oddsRatio <- overlapMat[1,1] * overlapMat[2,2] / (overlapMat[2,1] * overlapMat[1,2])
oddsRatio
## [1] 18.26938
This odds-ratio shows an enrichment of peaks in promoters. We can get a feel for how much the genome size (which we use incorrectly) affects our result by using a lower bound on the genome size. Let us say 1.5 billion bases:
overlapMat[2,2] <- 1.5*10^9
oddsRatio <- overlapMat[1,1] * overlapMat[2,2] / (overlapMat[2,1] * overlapMat[1,2])
oddsRatio
## [1] 9.354362
The odds-ratio got smaller, but it is still bigger than 1.
Here we basically says that each base can be assigned to peaks or promoters independently, which is definitely false. So there are many, many reasons why this calculation is not the “right” one. Nevertheless, it appears that we have some enrichment, as expected.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] AnnotationHub_2.0.3 rtracklayer_1.28.8 GenomicRanges_1.20.5
## [4] GenomeInfoDb_1.4.2 IRanges_2.2.7 S4Vectors_0.6.3
## [7] BiocGenerics_0.14.0 BiocStyle_1.6.0 rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 BiocInstaller_1.18.4
## [3] formatR_1.2 futile.logger_1.4.1
## [5] XVector_0.8.0 bitops_1.0-6
## [7] futile.options_1.0.0 tools_3.2.1
## [9] zlibbioc_1.14.0 digest_0.6.8
## [11] evaluate_0.7.2 RSQLite_1.0.0
## [13] shiny_0.12.2 DBI_0.3.1
## [15] curl_0.9.2 yaml_2.1.13
## [17] stringr_1.0.0 httr_1.0.0
## [19] knitr_1.11 Biostrings_2.36.3
## [21] Biobase_2.28.0 R6_2.1.0
## [23] AnnotationDbi_1.30.1 XML_3.98-1.3
## [25] BiocParallel_1.2.20 lambda.r_1.1.7
## [27] magrittr_1.5 Rsamtools_1.20.4
## [29] htmltools_0.2.6 GenomicAlignments_1.4.1
## [31] mime_0.3 interactiveDisplayBase_1.6.0
## [33] xtable_1.7-4 httpuv_1.3.3
## [35] stringi_0.5-5 RCurl_1.95-4.7