This document has the following dependencies:
library(ShortRead)
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("ShortRead"))
The ShortRead package contains functionality for reading and examining raw sequence reads (typically in FASTQ format).
The ShortRead package was one of the first Bioconductor packages to deal with low-level analysis of high-throughput sequencing data. Some of its functionality has now been superseded by other packages, but there is still relevant functionality left.
The FASTQ file format is the standard way of representing raw (unaligned) next generation sequencing reads, particular for the Illumina platform. The format basically consists of 4 lines per read, with the lines containing
Paired-end reads are stored in two separate files, where the reads are ordered the same (this is obviously fragile; what if reads are re-ordered in one file and not the other).
These files are read by readFastq()
which produces an object of class ShortReadQ
fastqDir <- system.file("extdata", "E-MTAB-1147", package = "ShortRead")
fastqPath <- list.files(fastqDir, pattern = ".fastq.gz$", full = TRUE)[1]
reads <- readFastq(fastqPath)
reads
## class: ShortReadQ
## length: 20000 reads; width: 72 cycles
Here we directly point the function to the file path. A paradigm which is often used in Bioconductor is to first put the file path into an object which represents a specific file type and then read it; see
fqFile <- FastqFile(fastqPath)
fqFile
## class: FastqFile
## path: /Library/Frameworks/R.framework/Versio.../ERR127302_1_subset.fastq.gz
## isOpen: FALSE
reads <- readFastq(fqFile)
This appears to make little sense in this situation, but for really big files it makes sense to access them in chunks, see below for a BAM file example.
The ShortReadQ
class is very similar to a DNAStringSet
but it has two sets of strings: one for the read nucleotides and one for the base qualities. They are accessed as
sread(reads)[1:2]
## A DNAStringSet instance of length 2
## width seq
## [1] 72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2] 72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
quality(reads)[1:2]
## class: FastqQuality
## quality:
## A BStringSet instance of length 2
## width seq
## [1] 72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2] 72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
id(reads)[1:2]
## A BStringSet instance of length 2
## width seq
## [1] 53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2] 53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1
Note how the quality scores are listed as characters. You can convert them into standard 0-40 integer quality scores by
as(quality(reads), "matrix")[1:2,1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 39 39 39 39 39 39 39 39 39 39
## [2,] 40 40 40 40 39 40 40 40 38 40
In this conversion, each letter is matched to an integer between 0 and 40. This matching is known as the “encoding” of the quality scores and there has been different ways to do this encoding. Unfortunately, it is not stored in the FASTQ file which encoding is used, so you have to know or guess the encoding. The ShortRead package does this for you.
These numbers are supposed to related to the probability that the reported base is different from the template fragment (ie. a sequence error). One should be aware that this probabilistic interpretation is not always true; methods such as “quality-remapping” helps to ensure this.
In the early days of next generation sequencing, there was no standardized alignment output format. different aligners produced different output file, including Bowtie and MAQ. Later on, the SAM / BAM format was introduced and this is now the standard alignment output. ShortRead contains tools for reading these older alignment formats through the readAligned()
function (the type
argument support options such as type="Bowtie"
and type="MAQMap"
and type="MAQMapShort"
).
The package has some very old support for parsing BAM files, but use Rsamtools and GenomicAlignments for this task instead.
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] ShortRead_1.26.0 GenomicAlignments_1.4.1
## [3] Rsamtools_1.20.4 GenomicRanges_1.20.6
## [5] GenomeInfoDb_1.4.2 Biostrings_2.36.4
## [7] XVector_0.8.0 IRanges_2.2.7
## [9] S4Vectors_0.6.5 BiocParallel_1.2.20
## [11] BiocGenerics_0.14.0 BiocStyle_1.6.0
## [13] rmarkdown_0.8
##
## loaded via a namespace (and not attached):
## [1] knitr_1.11 magrittr_1.5 zlibbioc_1.14.0
## [4] lattice_0.20-33 hwriter_1.3.2 stringr_1.0.0
## [7] tools_3.2.1 grid_3.2.1 Biobase_2.28.0
## [10] latticeExtra_0.6-26 lambda.r_1.1.7 futile.logger_1.4.1
## [13] htmltools_0.2.6 yaml_2.1.13 digest_0.6.8
## [16] RColorBrewer_1.1-2 formatR_1.2 futile.options_1.0.0
## [19] bitops_1.0-6 evaluate_0.7.2 stringi_0.5-5