Dependencies

This document has the following dependencies:

library(ShortRead)

Use the following commands to install these packages in R.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("ShortRead"))

Overview

The ShortRead package contains functionality for reading and examining raw sequence reads (typically in FASTQ format).

Other Resources

ShortRead

The ShortRead package was one of the first Bioconductor packages to deal with low-level analysis of high-throughput sequencing data. Some of its functionality has now been superseded by other packages, but there is still relevant functionality left.

Reading FASTQ files

The FASTQ file format is the standard way of representing raw (unaligned) next generation sequencing reads, particular for the Illumina platform. The format basically consists of 4 lines per read, with the lines containing

  1. Read name (sometimes includes flowcell ID or other information).
  2. Read nucleotides
  3. Either empty or a repeat of line 1
  4. Encoded read quality scores

Paired-end reads are stored in two separate files, where the reads are ordered the same (this is obviously fragile; what if reads are re-ordered in one file and not the other).

These files are read by readFastq() which produces an object of class ShortReadQ

fastqDir <- system.file("extdata", "E-MTAB-1147", package = "ShortRead")
fastqPath <- list.files(fastqDir, pattern = ".fastq.gz$", full = TRUE)[1]
reads <- readFastq(fastqPath)
reads
## class: ShortReadQ
## length: 20000 reads; width: 72 cycles

Here we directly point the function to the file path. A paradigm which is often used in Bioconductor is to first put the file path into an object which represents a specific file type and then read it; see

fqFile <- FastqFile(fastqPath)
fqFile
## class: FastqFile 
## path: /Library/Frameworks/R.framework/Versio.../ERR127302_1_subset.fastq.gz
## isOpen: FALSE
reads <- readFastq(fqFile)

This appears to make little sense in this situation, but for really big files it makes sense to access them in chunks, see below for a BAM file example.

The ShortReadQ class is very similar to a DNAStringSet but it has two sets of strings: one for the read nucleotides and one for the base qualities. They are accessed as

sread(reads)[1:2]
##   A DNAStringSet instance of length 2
##     width seq
## [1]    72 GTCTGCTGTATCTGTGTCGGCTGTCTCGCGG...CAATGAAGGCCTGGAATGTCACTACCCCCAG
## [2]    72 CTAGGGCAATCTTTGCAGCAATGAATGCCAA...GTGGCTTTTGAGGCCAGAGCAGACCTTCGGG
quality(reads)[1:2]
## class: FastqQuality
## quality:
##   A BStringSet instance of length 2
##     width seq
## [1]    72 HHHHHHHHHHHHHHHHHHHHEBDBB?B:BBG...FEFBDBD@DDECEE3>:?;@@@>?=BAB?##
## [2]    72 IIIIHIIIGIIIIIIIHIIIIEGBGHIIIIH...IHIIIHIIIIIGIIIEGIIGBGE@DDGGGIG
id(reads)[1:2]
##   A BStringSet instance of length 2
##     width seq
## [1]    53 ERR127302.8493430 HWI-EAS350_0441:1:34:16191:2123#0/1
## [2]    53 ERR127302.21406531 HWI-EAS350_0441:1:88:9330:2587#0/1

A word on quality scores

Note how the quality scores are listed as characters. You can convert them into standard 0-40 integer quality scores by

as(quality(reads), "matrix")[1:2,1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]   39   39   39   39   39   39   39   39   39    39
## [2,]   40   40   40   40   39   40   40   40   38    40

In this conversion, each letter is matched to an integer between 0 and 40. This matching is known as the “encoding” of the quality scores and there has been different ways to do this encoding. Unfortunately, it is not stored in the FASTQ file which encoding is used, so you have to know or guess the encoding. The ShortRead package does this for you.

These numbers are supposed to related to the probability that the reported base is different from the template fragment (ie. a sequence error). One should be aware that this probabilistic interpretation is not always true; methods such as “quality-remapping” helps to ensure this.

Reading alignment files

In the early days of next generation sequencing, there was no standardized alignment output format. different aligners produced different output file, including Bowtie and MAQ. Later on, the SAM / BAM format was introduced and this is now the standard alignment output. ShortRead contains tools for reading these older alignment formats through the readAligned() function (the type argument support options such as type="Bowtie" and type="MAQMap" and type="MAQMapShort").

The package has some very old support for parsing BAM files, but use Rsamtools and GenomicAlignments for this task instead.

SessionInfo

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] ShortRead_1.26.0        GenomicAlignments_1.4.1
##  [3] Rsamtools_1.20.4        GenomicRanges_1.20.6   
##  [5] GenomeInfoDb_1.4.2      Biostrings_2.36.4      
##  [7] XVector_0.8.0           IRanges_2.2.7          
##  [9] S4Vectors_0.6.5         BiocParallel_1.2.20    
## [11] BiocGenerics_0.14.0     BiocStyle_1.6.0        
## [13] rmarkdown_0.8          
## 
## loaded via a namespace (and not attached):
##  [1] knitr_1.11           magrittr_1.5         zlibbioc_1.14.0     
##  [4] lattice_0.20-33      hwriter_1.3.2        stringr_1.0.0       
##  [7] tools_3.2.1          grid_3.2.1           Biobase_2.28.0      
## [10] latticeExtra_0.6-26  lambda.r_1.1.7       futile.logger_1.4.1 
## [13] htmltools_0.2.6      yaml_2.1.13          digest_0.6.8        
## [16] RColorBrewer_1.1-2   formatR_1.2          futile.options_1.0.0
## [19] bitops_1.0-6         evaluate_0.7.2       stringi_0.5-5