```{r front, child="front.Rmd", echo=FALSE} ``` ## Dependencies This document has the following dependencies: ```{r dependencies, warning=FALSE, message=FALSE} library(minfi) library(GEOquery) ``` Use the following commands to install these packages in R. ```{r biocLite, eval=FALSE} source("http://www.bioconductor.org/biocLite.R") biocLite(c("minfi", "GEOquery")) ``` ## Overview The Illumina 450k DNA methylation microarray is a cheap and relatively comprehensive array for assaying DNA methylation. It is the platform of choice for large sample profiling of DNA methylation, especially for so-called EWAS (Epigenome-wide association studies). The studies are like GWAS (genome-wide association studies) but instead of associated a phenotype (like disease) with genotype (typically measured on SNP arrays or exome/whole-genome sequencing) they associated phenotype with epigenotype. ## Other Resources - The vignette from the [minfi webpage](http://bioconductor.org/packages/minfi). ## DNA methylation DNA methylation is a chemical modification of DNA. In humans, it occurs at CpG dinucleotides where the 'C' can be methylated or not. The methylation state of a given locus in a single cell is binary (technically tertiary since we have two copies of most chromosomes) but we measure DNA methylation across a population of cells. We observe that some loci has intermediate methylation values (between 0 and 1) and we use the methylation percentage (or Beta-value) to describe this. The goal of most analyses of DNA methylation is to associate changes in phenotype with changes in methylation in one or more loci. ## Array Design The 450k array has a very unusual design, which to some extent impact analysis. It is really a mixture of a two-color array and two one-color arrays. There are two main types of probes (type I and type II) and the probe design affects the signal distribution of the probe. The raw data format for the 450k array is known as IDAT. Because the array is measured in two different colors, there are two files for each sample, typically with the extention `_Grn.idat` and `_Red.idat`. Illumina's software suite for analysis of this array is called GenomeStudio. It is not unusual for practitioners to only have access to processed data from GenomeStudio instead of the raw IDAT files, but I and others have shown that there is information in the IDAT files which are beneficial to analysis. ## Data We will access a dataset created with the intention of studying acute mania. Serum samples were obtained from individuals hospitalized with acute mania as well as unaffected controls. We want to obtain the IDAT files which are available as supplementary data. Far from all 450k datasets on GEO has IDAT files available. First we download the files. ```{r geoquery} library(GEOquery) getGEOSuppFiles("GSE68777") untar("GSE68777/GSE68777_RAW.tar", exdir = "GSE68777/idat") head(list.files("GSE68777/idat", pattern = "idat")) ``` Currently `r Biocpkg("minfi")` does not support reading compressed IDAT files. This is clearly a needed functionality and (as the maintainer of this package) I will address this. But for now we will need to decompress the files. ```{r decompress} idatFiles <- list.files("GSE68777/idat", pattern = "idat.gz$", full = TRUE) sapply(idatFiles, gunzip, overwrite = TRUE) ``` Now we read the IDAT files using `read.450k.exp()` which (in this case) reads all the IDAT files in a directory. ```{r readExp} rgSet <- read.450k.exp("GSE68777/idat") rgSet pData(rgSet) head(sampleNames(rgSet)) ``` Now we have the data, but note that we have no pheno data. And the filenames are very unhelpful here. These names consists of a GEO identifier (the `GSM` part) followed by a standard IDAT naming convention with a 10 digit number which is an array identifier followed by an identifier of the form `R01C01`. This is because each array actually allows for the hybridization of 12 samples in a 6x2 arrangement. The `5958091020_R01C0` means row 1 and column 1 on chip `5958091020`. This is all good, but does not help us understand which samples are cases and which are controls. We now get the standard GEO representation to get the phenotype data stored in GEO. Most of the columns in this phenotype data are irrelevant (contains data such as the address of the person who submitted the data); we keep the useful ones. Then we clean it. ```{r geoPheno} geoMat <- getGEO("GSE68777") pD.all <- pData(geoMat[[1]]) pD <- pD.all[, c("title", "geo_accession", "characteristics_ch1.1", "characteristics_ch1.2")] head(pD) names(pD)[c(3,4)] <- c("group", "sex") pD$group <- sub("^diagnosis: ", "", pD$group) pD$sex <- sub("^Sex: ", "", pD$sex) ``` We now need to merge this pheno data into the methylation data. To do so, we need a common sample identifier and we make sure we re-order the phenotype data in the same order as the methylation data. Finally we put the phenotype data inside the methylation data. ```{r merge} sampleNames(rgSet) <- sub(".*_5", "5", sampleNames(rgSet)) rownames(pD) <- pD$title pD <- pD[sampleNames(rgSet),] pData(rgSet) <- pD rgSet ``` ## Preprocessing The `rgSet` object is a class called `RGChannelSet` which represents two color data with a green and a red channel, very similar to an `ExpressionSet`. The first step is usually to preprocess the data, using a number of functions including - `preprocessRaw()` : do nothing. - `preprocessIllumina()` : use Illumina's standard processing choices. - `preprocessQuantile()` : use a version of quantile normalization adapted to methylation arrays. - `preprocessNoob()` : use the NOOB background correction method. - `preprocessSWAN()` : use the SWAN method. - `preprocessFunnorm()` : use functional normalization. These functions output different types of objects. The class hierarchy in minfi is as follows: data can be stored in an Methylation and Unmethylation channel or in a percent methylation (called Beta) channel. For the first case we have the class `MethylSet`, for the second case we have the class `RatioSet`. When you have methylation / unmethylation values you can still compute Beta values on the fly. You convert from a `MethylSet` to a `RatioSet` with `ratioConvert()`. In addition to these two classes, we have `GenomicMethylSet` and `GenomicRatioSet`. The `Genomic` indicates that the data has been associated with genomic coordinates using the `mapToGenome()` function. The starting point for most analyses ought to be a `GenomicRatioSet` class. If your preprocessing method of choice does not get you there, use `ratioConvert()` and `mapToGenome()` to go the last steps. Let us run `preprocessQuantile()` which arrives at a `GenomicRatioSet`: ```{r preprocess} grSet <- preprocessQuantile(rgSet) grSet ``` This is like a `SummarizedExperiment`; we can get the location of the CpGs by ```{r granges} granges(grSet) ``` The usual methylation measure is called "Beta" values; equal to percent methylation and defined as `Meth` divided by `Meth + Unmeth`. ```{r getBeta} getBeta(grSet)[1:3,1:3] ``` CpGs forms clusters known as "CpG Islands". Areas close to CpG Islands are known as CpG Shores, followed by CpG Shelfs and finally CpG Open Sea probes. An easy way to get at this is to use ```{r getIslandStatus} head(getIslandStatus(grSet)) ``` ## Differential Methylation Once the data has been normalized, one possibility is to identify differentially methylated CpGs by using `r Biocpkg("limma")` on the Beta values. Another possibility is to look for clusters of CpGs all changing in the same direction. One method for doing this is through the `bumphunter()` function which interfaces to the `r Biocpkg("bumphunter")` package. ## SessionInfo \scriptsize ```{r sessionInfo, echo=FALSE} sessionInfo() ``` ## References