Biostrings - Matching

Dependencies

This document has the following dependencies:

library(Biostrings)
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer2)

Use the following commands to install these packages in R.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Biostrings", "BSgenome",
           "BSgenome.Scerevisiae.UCSC.sacCer2", "AnnotationHub"))

Corrections

Improvements and corrections to this document can be submitted on its GitHub in its repository.

Overview

We continue our treatment of Biostrings and BSgenome

Pattern matching

We often want to find patterns in (long) sequences. Biostrings have a number of functions for doing so

matchPattern and vmatchPattern: match a single sequence against one sequence (matchPattern) or more than one (vmatchPattern) sequences.
matchPDict and vmatchPDict: match a (possibly large) set of sequences against one sequence (matchPDict) or more than one (vmatchPDict) sequences.

These functions allows a small set of mismatches and some small indels. The Dict term is used because the function builds a “dictionary” over the sequences.

There are also functions with similar naming using count instead of match (eg. countPatterns). These functions returns the number of matches instead of precise information about where the matches occur.

In many ways, these functions are similar to using short read aligners like Bowtie. But these functions are designed to be comprehensive (return all matches satisfying certain criteria). Having this functionality available in Bioconductor can sometimes be very useful.

dnaseq <- DNAString("ACGTACGT")
matchPattern(dnaseq, Scerevisiae$chrI)

##   Views on a 230208-letter DNAString subject
## subject: CCACACCACACCCACACACCCACACACCACAC...GTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
## views:
##     start   end width
## [1] 57932 57939     8 [ACGTACGT]

countPattern(dnaseq, Scerevisiae$chrI)

## [1] 1

vmatchPattern(dnaseq, Scerevisiae)

## GRanges object with 170 ranges and 0 metadata columns:
##         seqnames           ranges strand
##            <Rle>        <IRanges>  <Rle>
##     [1]     chrI [ 57932,  57939]      +
##     [2]     chrI [ 57932,  57939]      -
##     [3]    chrII [ 49581,  49588]      +
##     [4]    chrII [411291, 411298]      +
##     [5]    chrII [491129, 491136]      +
##     ...      ...              ...    ...
##   [166]   chrXVI [195477, 195484]      -
##   [167]   chrXVI [683620, 683627]      -
##   [168]   chrXVI [837296, 837303]      -
##   [169]   chrXVI [906938, 906945]      -
##   [170]   chrXVI [943045, 943052]      -
##   -------
##   seqinfo: 18 sequences from an unspecified genome

head(vcountPattern(dnaseq, Scerevisiae))

##   seqname strand count
## 1    chrI      +     1
## 2    chrI      -     1
## 3   chrII      +     4
## 4   chrII      -     4
## 5  chrIII      +     3
## 6  chrIII      -     3

See how we use vmatchPattern to examine across all chromosomes.

First, note how the return object of vmatchPattern is a GRanges given the exact information of where the string matches. Note sequence we search for is its own reverse complement, so we get hits on both strands (which makes sense). Obviously, not all sequences are like this

dnaseq == reverseComplement(dnaseq)

## [1] TRUE

Second, note how the return object of matchPattern looks like an IRanges but is really something called a Views (see another session).

Specialized alignments

There are a number of other, specialized, alignment functions in Biostrings. They include

matchPWM: a position weight matrix is a common way to represent for example a transcription factor binding motif (think sequence logos). This function allows you to search for such motifs in the genome.
pairwiseAlignment: This function implements pairwise alignments using dynamic programming; providing an interface to both the Smith-Waterman local alignment problem and the Needleman-Wunsch global alignment problems, see a thorough description in (Durbin et al. 1998).
trimLRpattern (trim left-right pattern): Takes a set of sequences and looks for whether they start or end with a given (other sequence), for example a sequencing adapter. Used for trimming reads based on adapter sequences.

For now, we will avoid further discussion of these functions.

One note: pairwiseAlignment allows you to do pairwise alignments of millions of short reads against a single sequence, for example a gene or a transposable element. Few people use these algorithms for short read data, because the algorithms scale badly with the length of the sequence (ie. the genome), but they work fine for millions of reads as long as the reference sequence is short. In my opinion this approach might be very fruitful if you are particular interested in high-quality alignments to a specific small gene or region of the genome.

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] BSgenome.Scerevisiae.UCSC.sacCer2_1.4.0
##  [2] BSgenome_1.36.3                        
##  [3] rtracklayer_1.28.8                     
##  [4] GenomicRanges_1.20.5                   
##  [5] GenomeInfoDb_1.4.2                     
##  [6] Biostrings_2.36.3                      
##  [7] XVector_0.8.0                          
##  [8] IRanges_2.2.7                          
##  [9] S4Vectors_0.6.3                        
## [10] BiocGenerics_0.14.0                    
## [11] BiocStyle_1.6.0                        
## [12] rmarkdown_0.7                          
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.11              magrittr_1.5           
##  [3] zlibbioc_1.14.0         GenomicAlignments_1.4.1
##  [5] BiocParallel_1.2.20     stringr_1.0.0          
##  [7] tools_3.2.1             lambda.r_1.1.7         
##  [9] futile.logger_1.4.1     htmltools_0.2.6        
## [11] yaml_2.1.13             digest_0.6.8           
## [13] formatR_1.2             futile.options_1.0.0   
## [15] bitops_1.0-6            RCurl_1.95-4.7         
## [17] evaluate_0.7.2          stringi_0.5-5          
## [19] Rsamtools_1.20.4        XML_3.98-1.3

References

Durbin, Richard M, Sean R Eddy, Anders Krogh, Graeme Mitchison, and Sean R Eddy. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.