This document has the following dependencies:
library(Biostrings)
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer2)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biostrings", "BSgenome",
"BSgenome.Scerevisiae.UCSC.sacCer2", "AnnotationHub"))
Improvements and corrections to this document can be submitted on its GitHub in its repository.
We continue our treatment of Biostrings and BSgenome
We often want to find patterns in (long) sequences. Biostrings have a number of functions for doing so
matchPattern
and vmatchPattern
: match a single sequence against one sequence (matchPattern
) or more than one (vmatchPattern
) sequences.matchPDict
and vmatchPDict
: match a (possibly large) set of sequences against one sequence (matchPDict
) or more than one (vmatchPDict
) sequences.These functions allows a small set of mismatches and some small indels. The Dict
term is used because the function builds a “dictionary” over the sequences.
There are also functions with similar naming using count
instead of match
(eg. countPatterns
). These functions returns the number of matches instead of precise information about where the matches occur.
In many ways, these functions are similar to using short read aligners like Bowtie. But these functions are designed to be comprehensive (return all matches satisfying certain criteria). Having this functionality available in Bioconductor can sometimes be very useful.
dnaseq <- DNAString("ACGTACGT")
matchPattern(dnaseq, Scerevisiae$chrI)
## Views on a 230208-letter DNAString subject
## subject: CCACACCACACCCACACACCCACACACCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
## views:
## start end width
## [1] 57932 57939 8 [ACGTACGT]
countPattern(dnaseq, Scerevisiae$chrI)
## [1] 1
vmatchPattern(dnaseq, Scerevisiae)
## GRanges object with 170 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI [ 57932, 57939] +
## [2] chrI [ 57932, 57939] -
## [3] chrII [ 49581, 49588] +
## [4] chrII [411291, 411298] +
## [5] chrII [491129, 491136] +
## ... ... ... ...
## [166] chrXVI [195477, 195484] -
## [167] chrXVI [683620, 683627] -
## [168] chrXVI [837296, 837303] -
## [169] chrXVI [906938, 906945] -
## [170] chrXVI [943045, 943052] -
## -------
## seqinfo: 18 sequences from an unspecified genome
head(vcountPattern(dnaseq, Scerevisiae))
## seqname strand count
## 1 chrI + 1
## 2 chrI - 1
## 3 chrII + 4
## 4 chrII - 4
## 5 chrIII + 3
## 6 chrIII - 3
See how we use vmatchPattern
to examine across all chromosomes.
First, note how the return object of vmatchPattern
is a GRanges
given the exact information of where the string matches. Note sequence we search for is its own reverse complement, so we get hits on both strands (which makes sense). Obviously, not all sequences are like this
dnaseq == reverseComplement(dnaseq)
## [1] TRUE
Second, note how the return object of matchPattern
looks like an IRanges
but is really something called a Views
(see another session).
There are a number of other, specialized, alignment functions in Biostrings. They include
matchPWM
: a position weight matrix is a common way to represent for example a transcription factor binding motif (think sequence logos). This function allows you to search for such motifs in the genome.pairwiseAlignment
: This function implements pairwise alignments using dynamic programming; providing an interface to both the Smith-Waterman local alignment problem and the Needleman-Wunsch global alignment problems, see a thorough description in (Durbin et al. 1998).trimLRpattern
(trim left-right pattern): Takes a set of sequences and looks for whether they start or end with a given (other sequence), for example a sequencing adapter. Used for trimming reads based on adapter sequences.For now, we will avoid further discussion of these functions.
One note: pairwiseAlignment
allows you to do pairwise alignments of millions of short reads against a single sequence, for example a gene or a transposable element. Few people use these algorithms for short read data, because the algorithms scale badly with the length of the sequence (ie. the genome), but they work fine for millions of reads as long as the reference sequence is short. In my opinion this approach might be very fruitful if you are particular interested in high-quality alignments to a specific small gene or region of the genome.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.4.0
## [2] BSgenome_1.36.3
## [3] rtracklayer_1.28.8
## [4] GenomicRanges_1.20.5
## [5] GenomeInfoDb_1.4.2
## [6] Biostrings_2.36.3
## [7] XVector_0.8.0
## [8] IRanges_2.2.7
## [9] S4Vectors_0.6.3
## [10] BiocGenerics_0.14.0
## [11] BiocStyle_1.6.0
## [12] rmarkdown_0.7
##
## loaded via a namespace (and not attached):
## [1] knitr_1.11 magrittr_1.5
## [3] zlibbioc_1.14.0 GenomicAlignments_1.4.1
## [5] BiocParallel_1.2.20 stringr_1.0.0
## [7] tools_3.2.1 lambda.r_1.1.7
## [9] futile.logger_1.4.1 htmltools_0.2.6
## [11] yaml_2.1.13 digest_0.6.8
## [13] formatR_1.2 futile.options_1.0.0
## [15] bitops_1.0-6 RCurl_1.95-4.7
## [17] evaluate_0.7.2 stringi_0.5-5
## [19] Rsamtools_1.20.4 XML_3.98-1.3
Durbin, Richard M, Sean R Eddy, Anders Krogh, Graeme Mitchison, and Sean R Eddy. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.