```{r front, child="front.Rmd", echo=FALSE} ``` ## Dependencies This document has the following dependencies: ```{r dependencies, warning=FALSE, message=FALSE} library(Biostrings) ``` Use the following commands to install these packages in R. ```{r biocLite, eval=FALSE} source("http://www.bioconductor.org/biocLite.R") biocLite(c("Biostrings")) ``` ## Corrections Improvements and corrections to this document can be submitted on its [GitHub](https://github.com/kasperdanielhansen/genbioconductor/blob/master/Rmd/Biostrings.Rmd) in its [repository](https://github.com/kasperdanielhansen/genbioconductor). ## Overview The `r Biocpkg("Biostrings")` package contains classes and functions for representing biological strings such as DNA, RNA and amino acids. In addition the package has functionality for pattern matching (short read alignment) as well as a pairwise alignment function implementing Smith-Waterman local alignments and Needleman-Wunsch global alignments used in classic sequence alignment (see [@Durbin:1998] for a description of these algorithms). There are also functions for reading and writing output such as FASTA files. ## Other Resources ## Representing sequences There are two basic types of containers for representing strings. One container represents a single string (say a chromosome or a single short read) and the other container represents a set of strings (say a set of short reads). There are different classes intended to represent different types of sequences such as DNA or RNA sequences. ```{r DNAString, error=TRUE} dna1 <- DNAString("ACGT-N") dna1 DNAStringSet("ADE") dna2 <- DNAStringSet(c("ACGT", "GTCA", "GCTA")) dna2 ``` Note that the alphabet of a `DNAString` is an extended alphabet: `-` (for insertion) and `N` are allowed. In fact, IUPAC codes are allowed (these codes represent different characters, for example the code "M" represents either and "A" or a "C"). A list of IUPAC codes can be obtained by ```{r IUPAC} IUPAC_CODE_MAP ``` Indexing into a `DNAString` retrieves a subsequence (similar to the standard R function `substr`), whereas indexing into a `DNAStringSet` gives you a subset of sequences. ```{r DNAStringSubset} dna1[2:4] dna2[2:3] ``` Note that `[[` allows you to get a single element of a `DNAStringSet` as a `DNAString`. This is very similar to `[` and `[[` for lists. ```{r DNAStringSubset2} dna2[[2]] ``` `DNAStringSet` objects can have names, like ordinary vectors ```{r DNAStringSetNames} names(dna2) <- paste0("seq", 1:3) dna2 ``` The full set of string classes are - `DNAString[Set]`: DNA sequences. - `RNAString[Set]`: RNA sequences. - `AAString[Set]`: Amino Acids sequences (protein). - `BString[Set]`: "Big" sequences, using any kind of letter. In addition you will often see references to `XString[Set]` in the documentation. An `XString[Set]` is basically any of the above classes. These classes seem very similar to standard `characters()` from base R, but there are important differences. The differences are mostly about efficiencies when you deal with either (a) many sequences or (b) very long strings (think whole chromosomes). ## Basic functionality Basic character functionality is supported, like - `length`, `names`. - `c` and `rev` (reverse the sequence). - `width`, `nchar` (number of characters in each sequence). - `==`, `duplicated`, `unique`. - `as.charcater` or `toString`: converts to a base `character()` vector. - `sort`, `order`. - `chartr`: convert some letters into other letters. - `subseq`, `subseq<-`, `extractAt`, `replaceAt`. - `replaceLetterAt`. Examples ```{r basicFunc} width(dna2) sort(dna2) rev(dna2) rev(dna1) ``` Note that `rev` on a `DNAStringSet` just reverse the order of the elements, whereas `rev` on a `DNAString` actually reverse the string. ## Biological functionality There are also functions which are related to the biological interpretation of the sequences, including - `reverse`: reverse the sequence. - `complement`, `reverseComplement`: (reverse) complement the sequence. - `translate`: translate the DNA or RNA sequence into amino acids. ```{r bioFunc} translate(dna2) reverseComplement(dna1) ``` ## Counting letters We very often want to count sequences in various ways. Examples include: - Compute the GC content of a set of sequences. - Compute the frequencies of dinucleotides in a set of sequences. - Compute a position weight matrix from a set of aligned sequences. There is a rich set of functions for doing this quickly. - `alphabetFrequency`, `letterFrequency`: Compute the frequency of all characters (`alphabetFrequency`) or only specific letters (`letterFrequency`). - `dinucleotideFrequency`, `trinucleotideFrequency`, `oligonucleotideFrequeny`: compute frequencies of dinucleotides (2 bases), trinucleotides (3 bases) and oligonucleotides (general number of bases). - `letterFrequencyInSlidingView`: letter frequencies, but in sliding views along the string. - `consensusMatrix`: consensus matrix; almost a position weight matrix. Let's look at some examples, note how the output expands to a matrix when you use the functions on a `DNAStringSet`: ```{r counting} alphabetFrequency(dna1) alphabetFrequency(dna2) letterFrequency(dna2, "GC") consensusMatrix(dna2, as.prob = TRUE) ``` (most functions allows the return of probabilities with `as.prob = TRUE`). ## SessionInfo \scriptsize ```{r sessionInfo, echo=FALSE} sessionInfo() ``` ## References