Statistical Analysis of Genetic and Gene Expression Data

20-24 March 2006, Pavia, Italy

http://www.unipv.it/molpage_training/training1/index.php

Practical workshop for March 20, 2006

Conducted by Ben Bolstad with assistance from Arief Gusnanto and Brian Tom.

Introduction

The aim of this session is for you to gain introductory experience with techniques in pre-processing and other techniques in low-level analysis of microarray gene expression data. Because the session for this workshop is fairly short you will note that some sections of these lab instructions are denoted for reference only. This means that the section can be safely skipped to complete the practical. However, they typically contain topics that build on material in the lab, or are useful if you wish to apply these techniques to your own datasets. In other words they are intended for your future use.

The first section of practical material will focus on pre-processing techniques as applied to two-color microarrays. Later the lab will explore data from Affymetrix GeneChip microarrays.

Although this lab makes extensive use of R and BioConductor you are not expected to have previous knowledge or expertise with these pieces of software. In general, step by step instructions are provided for you to follow. It is expected that you will work at this material in a self-paced manner.

Getting Started

Note you may be using a computer which already has the requisite software installed. If so you can safely skip this step and move to the next section.

Before you begin you will need to get the requisite software installed on your computer. You will need R-2.2.1 (available from www.r-project.org) and a number of BioConductor (www.bioconductor.org) packages. All the exercises can be done on a computer with 512 MB of RAM (or more, which is the recommended memory requirements), and most of them can be done with as little as 256 MB (which is the minimum required memory). If you are not an expert and are using the Microsoft Windows operating system, the recommended procedure is to use the supplied installer. If you consider yourself an expert, or are using the Linux operating system you should read the instructions in the Appendix.

The installer will install R-2.2.1 customized with all the BioConductor packages required for this lab, plus the required datasets. Note that you can safely use this installer even if you have another version of R installed on your computer provided you install this version to a separate directory. You can download the installer at http://bmbolstad.com/PaviaWorkshop/software/WorkshopSetup.exe.

Exploring, pre-processing and normalizing two-color microarray data

1. Details on the files required for analysis (For reference only)

After conducting a microarray experiment on one or more arrays printed with a particular library of probes, the arrays are scanned to produce TIFF images, one for each channel (Cy3 and Cy5). The TIFF images are processed using an image analysis program such as ArrayVision, ImaGene, GenePix, QuantArray or SPOT to acquire the red and green foreground and background intensities for each spot, along with other measurements. The spot intensities are then exported from the image analysis program into a series of text files. There should be one file for each array or, in the case of ImaGene, two files for each array.

To analyze microarray data, we require (i) a file which describes the probes, often a GenePix Array List (GAL) file, and (ii) the image analysis output files. In most cases it is also desirable to have a Targets File, describing which RNA sample was hybridized to each channel of each array. A further optional file is the Spot Types file (STF) which identifies special probes such as control spots.

The Targets File
The Targets File is normally in tab-delimited text format. It should contain a row for each microarray in your experiment. It should contain a FileName column, giving the file from image-analysis containing raw foreground and background intensities for each slide, a Cy3 column giving the RNA type reverse transcribed and labeled with Cy3 dye for that slide (e.g. Wild Type) and a Cy5 column giving the RNA type reverse transcribed and labeled with Cy5 dye for that slide. For ImaGene files, the FileName column is split into a FileNameCy3 column and a FileNameCy5. As well as the essential columns, you can have a Name column giving an alternative slide name to the default name, "Slide n", where n is the SlideNumber and you can have a Date column, listing the date of the hybridization. Additional columns are allowed, provided that the column names are unique. Targets Files can be created in excel or a text editor, and should be saved in Text (Tab Delimited) .txt format.

The Spot Types File
The Spot Types File (STF) is a tab-delimited text file which allows you to identify different types of spots from the gene list. The STF is typically used to distinguish control spots from those corresponding to genes of interest, to distinguish positive from negative controls, ratio from calibration controls and so on. In the first column of this file (named SpotType), names for each class of spot (eg gene, control) on the array should be specified. One or more other columns should have the same names as columns in the gene list file and should contain patterns or regular expressions sufficient to identify the spot-type. Asterisks are wildcards which can represent anything. Be careful to use upper or lower case as appropriate and don't insert any extra spaces. Any other columns are assumed to contain plotting parameters, such as colors (column name Color) or plotting characters (column name cex) to be associated with the different types of points. STF can be created in excel or a text editor, and should be saved in Text (Tab Delimited) .txt format. The STF uses simplified regular expressions to match patterns. For example, 'AA*' means any string starting with 'AA', '*AA' means any code ending with 'AA', 'AA' means exactly these two letters, '*AA*' means any string containing 'AA', 'AA.' means 'AA' followed by exactly one other character and 'AA\.' means exactly 'AA' followed by a period and no other characters. For those familiar with regular expressions, any other regular expressions are allowed but the codes ^ for beginning of string and $ for end of string should be excluded. Note that the patterns are matched sequentially from first to last, so more general patterns should be included first. For example, it is often a good idea to include a default spot-type as the first line in the STF with pattern '*' for all the pattern-matching columns and with default plotting parameters.

2. Swirl Experiment

Background. The experiment was carried out using zebrafish as a model organism to study the early development in vertebrates. Swirl is a point mutant in the BMP2 gene that affects the dorsal/ventral body axis. The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild-type zebrafish.

The hybridizations. Two sets of dye-swap experiments were performed making a total of four replicate hybridizations. Each of the arrays compares RNA from swirl fish with RNA from normal ("wild type") fish. The experimenters have prepared a tab-delimited targets file called "SwirlSample.txt" which describes the four hybridizations:

images/SwirlSampleInExcel.png

On slides 81 and 93, swirl RNA was labeled with green (Cy3) dye and wild type RNA was labeled with red (Cy5) dye. On slides 82 and 94, the labeling was the reversed.

Each of the four hybridized arrays was scanned on an Axon scanner to produce a TIFF image, which was then processed using the image analysis software SPOT. The data from the arrays are stored in the four output files listed under FileName.

The arrays. The microarrays used in this experiment were printed with 8448 probes (spots), including 768 control spots. The array printer uses a print head with a 4x4 arrangement of print-tips and so the microarrays are partitioned into a 4x4 grid of tip groups. Each grid consists of 22x24 spots that were printed with a single print-tip. The gene name associated with each spot is recorded in a GenePix Array List (GAL) file named "fish.gal".

For this example we assume that these files, along with a Targets file ("SwirlSample.txt") and STF ("SpotTypes.txt") are available in the same directory.

2.1 Reading the data using limmaGUI

Before loading limmaGUI, be advised that if you are a Windows user, it is best to run Rgui in Single Document Interface (SDI) mode, otherwise, Rgui often "steals" the focus from limmaGUI. This can be done by selecting "GUI preferences" from the "Edit" menu, selecting "SDI" saving preferences and restarting R. (Or alternatively, you can edit the file rw2011\etc\Rconsole.)

The limmaGUI library needs to be loaded by typing

 library(limmaGUI)

and selecting Yes. The main GUI screen should be displayed. If the GUI window is closed, the command

 limmaGUI()

will re-open it.

Reading the data

From the File menu, select "New".

images/limmaGUINewFileSelected.png

You will be asked to choose a working directory. Select the directory containing the Swirl dataset (if you used the installer by default this will be c:\microarrays\swirl) and click OK.

images/ChooseSwirlWorkingDir.png

Now you can open a GAL (GenePix Array List) file, an RNA Targets file (listing the hybridizations), and a Spot Types file.

images/OpenGALandTargetsandSpotTypesFiles.png

Clicking on the "Select GAL File" button gives the following dialog. Open "fish.gal".

images/OpeningFishGAL.png

Now click on the "Select Targets" file button to open the RNA Targets file.

images/OpeningSwirlTargets.png

66.

Finally, the Spot Types file, which for this experiment is called "SpotTypes.txt" and has the following format

images/SwirlSpotTypesInExcel.png

is selected.

images/OpeningSwirlSpotTypes.png

Once the GAL, Targets and Spot Types files have been selected, click OK.

images/GALTargetsandSpotTypesFilesSelected.png

Now select the type of image-processing file listed in the RNA Targets file ("Spot") and click OK.

images/ImageProcessingFileType.png

For Spot files, using background correction is highly recommended (choose Yes).

images/UseBackgroundCorrection.png

A number of background correction methods are available. Subtract will be used for this dataset.

images/BackgroundCorrectionMethod.png

For the Swirl data set, we will not use any spot quality weighting, so click No.

images/SpotQualityWeighting.png

When prompted for a name for this data set, type in "Swirl".

images/DataSetNameSwirl.png

Once the data set has been loaded, its name is displayed on top of the left status window. The status window shows that Red and Green background-corrected intensities (R and G) have been loaded, and that there is no spot-quality weighting. The data set name can be later modified with the "Data Set Name" button. The data set name is not the same as the file name, displayed in the title bar. For example, the same data set (Swirl) could be saved at two different stages of the analysis, e.g. SwirlArraysLoaded.lma and SwirlLinearModelComputed.lma.

images/limmaGUImainWindowSwirlArraysLoaded.png

Now we can check that the Targets have been read in correctly. From the RNA Targets menu, click on "RNA Targets" to display the information in a table. An Edit menu is provided to allow the user to copy the Targets table to the clipboard.

images/SwirlRNATargets.png

The Spot Types table can viewed from within limmaGUI in a similar manner. Unlike the RNA Targets table, the Spot Types table is actually editable within limmaGUI. You can change the default colors associated with each spot type and you can even create new spot types and save the table to a tab-delimited text file. The arrow keys can be used to select the active cell in the table, while holding down Control and using the arrow keys will move the cursor within the text in one cell. The Rows menu can be used to add or delete spot types.

images/SwirlTypesTable.png

The layout should also be checked by selecting the "Layout Parameters" menu item from the Layout menu. A dialog box with the number of rows and columns of blocks on an array and rows and columns of spots in each block will be displayed. If these values are appropriate, click OK.

images/limmaGUISwirlLayout.png

2.2. Diagnostic plots and normalization using limmaGUI

Once the data has been loaded, various diagnostic plots can be generated by choosing an appropriate option from the Plot menu.

images/ChoosePlotMenu.png

Image plots. It can be interesting to look at the variation of foreground and background values over an array. Consider an image plot of the red foreground for the first array, by selecting:

Plot > Image Array Plot > Choose a slide: "Slide81" > OK > Choose which variable to plot: "R" > OK > Plot title: "Image array plot of R for Slide 81" > OK.

An image plot should be displayed in another window. The top left of the array is on the bottom left of the plot, which represents a counter-clockwise rotation of 90 degrees. We can see a bright streak across the middle two grids of the 3rd row caused by a scratch or dust on the array. Spots which are affected by this artifact will have suspect M-values. The streak also shows up as brighter regions in the image plots for the background. Clicking on a particular spot on this image will bring up a window displaying its ID information retrieved from the GAL file. Other variables which may be plotted in this way include "Mraw" and "Araw", for un-normalized M and A values, "G" for green foreground and "Rb" and "Gb" for red and green background.

MA-plots. An MA-plot plots the log-ratio of R vs G against the overall intensity of each spot. The log-ratio is represented by the M-value, M = log2(R)-log2(G), and the overall intensity by the A-value, A = (log2(R)+log2(G))/2. To get an MA-plot of the un-normalized values for the first array, try the following:

Plot > M A Plot with lowess curves (for one slide) > Choose a slide: "Slide81" > OK > Normalization Within A Single Array: No > Lowess Curve(s) Options: "Print-Tip Group Lowess Curves" > OK > Plot title: "M A Plot for slide 81 with no normalization" > OK.

A different colored curve is displayed for each print-tip group. To see individual MA-plots for each of the print-tip groups on this array, with lowess curves, try the following:

Plot > Print-Tip Group M A Plot (for one slide) > Choose a slide: "Slide81" > OK.

You should be able to notice the points which make up the red streak (Note: this plot is not rotated). The affected spots are in grids 10 and 11, and have very large positive M values at high intensities.

Normalization. Several normalization options are available in the Normalization menu. By choosing the "Select Within-Array Normalization Method" item from the menu,

images/ChooseNormalizationMenu.png

a dialog box with various normalization options is displayed, as below.

images/ChooseNormalizationOption.png

Print-tip group loess normalization is the default method, and will be used for this data. Click OK.

Next choose the "Normalize / Update M and A" item from the Normalization menu

images/ChooseNormalizeUpdateMandA.png

Dialog boxes which ask whether you'd like to "Normalize Within Arrays" (choose Yes) and "Normalize Between Arrays" (choose No) should follow. Now check the status window to see that "Within-Array Normalized" M and A values are available. Try generating an MA plot of the normalized values, with the control spots highlighted (Hint: Plot > Color-Coded M A Plot (for one slide)).

2.3 Reading the data using command line functions (for reference only)

You may prefer to load your data using the command line functions available in limma. Ensure that the R working directory is set to the directory containing the Swirl files (using setwd(), or from the R Console in Windows, select File > Change dir... > Change working directory to: "<directory>"). To load the limma library, type

 library(limma)

at the R command prompt. Next read the targets file using the command:

 targets <- readTargets("SwirlSample.txt")

To read in the intensity data, the function read.maimages is used.

 RG <- read.maimages(targets$FileName, source="spot")

The default for SPOT output is that Rmean and Gmean columns of each file are used as foreground intensities and morphR and morphG are used as background intensities. The object RG is an RGList object which contains a foreground and background intensity for each of the red and green channels for every gene (spot) on every array. To see a summary of the contents of this object, simply type its name and press enter.

 RG

To read in the .gal file, infer the slide layout, and assign this information to the RGList object, use the commands

 RG$genes <- readGAL("fish.gal")

and

 RG$printer <- getLayout(RG$genes)

2.4 Diagnostic plots and normalization using command line functions (for reference only)

Image plots. Consider image plots of the red and green background for the first array:

 imageplot(log2(RG$Rb[,1]), RG$printer, low="white", high="red")
 imageplot(log2(RG$Gb[,1]), RG$printer, low="white", high="green")

MA-plots. The M and A values can be calculated using the function normalizeWithinArrays. The option method="none" calculates raw (un-normalized) M and A values.

 MA <- normalizeWithinArrays(RG, method="none")

or equivalently

 MA <- MA.RG(RG)

Note that 'subtraction' of the background from the foreground is the default background correction method used to construct these log-ratios. Other options are possible using the backgroundCorrect function on the RGList. For example, if you don't want to background correct the data, you would use the following

 RGnobg <- backgroundCorrect(RG, method="none")

and then proceed as before to calculate the M and A values, using RGnobg instead of RG. To plot the raw M and A values for the first array, use the following command

 plotMA(MA, array=1)

By incrementing the array argument (eg array=2), MA plots for other slides can be generated.

Now plot the individual MA-plots for each of the print-tip groups on this array, together with the loess curves which will be used for normalization:

 plotPrintTipLoess(MA)

Normalization. For print-tip loess normalization, use the command:

 MA <- normalizeWithinArrays(RG)

Print-tip loess is the default normalization method in normalizeWithinArrays, however other options are possible, and are specified by the method argument (for example use method="loess" for global intensity based loess normalization of each slide). To plot the normalized M versus A values by print-tip group, type:

 plotPrintTipLoess(MA)

3. Cell line comparison data (for reference only)

In this exercise we consider an experiment where two RNA sources are compared directly on 6 replicate arrays.

Background. This data is taken from a set of quality control hybridizations, which were performed to assess how well the spots were printed on the arrays, rather than a question of biological interest. The RNA compared was from two very different cell lines, which ensured many of the genes were differentially expressed at varying degrees. Data for each slide is available from two image analysis packages (GenePix and SPOT), and will be used to highlight some of the differences between the programs, particularly in terms of the background levels measured for each spot.

The hybridizations. Six microarrays were each hybridized with RNA from cell line A (Cy5) and cell line B (Cy3).

The arrays. A total of 10944 spots were printed on each array. They were arranged in 6x4 print-tip grids, each containing 19x24 rows and columns of spots. The Lucidea Universal ScoreCard controls (Samartzidou et al, 2001) were printed on the arrays, and spiked into the RNA mixes of each experiment. This control series is made up of artificial genes, some of which are added in equal quantities, or at 3 and 10 times as much in one sample compared to the other to give known fold-changes for particular spots.

For this example, you will need the GAL file called "genelist.gal" and the raw SPOT (.spot) and GenePix (.gpr) output files in the current working directory.(if you used the installer default you will find these in c:\microarrays\CellLine)

3.1 Defining Targets files (for reference only)

For the cell line comparison arrays, a targets file would look like

images/CellLineTargetsInExcelSpot.png

for the Spot data, and

images/CellLineTargetsInExcelGenePix.png

for the GenePix data. Files named "TargetsSpot.txt" and "TargetsGenePix.txt" are included in the zip file for the cell line data.

3.2 Defining a Spot Types file (for reference only)

An appropriate STF for these arrays which contain Lucidea Universal ScoreCard controls is:

images/CellLineSpotTypesInExcel.png

A STF named "SpotTypes.txt" is included in the zip file for the cell line data.

3.3 Loading the data using limma (for reference only)

To load the limma library, type

 library(limma)

To read in the STF and targets information, use the following commands

 SpotTypes <- readSpotTypes()

and

 TargetsSpot <- readTargets("TargetsSpot.txt")

The file names stored in the targets file can be used to read in the intensity information contained in each of the image analysis output files with the command

 RGspot <- read.maimages(TargetsSpot$FileName, source="spot")

The object RGspot is an RGList object which contains five components: R, G, Rb, Gb and targets. You can verify this by typing:

 names(RGspot)

The foreground intensities are stored in RGspot$R (Cy5) and RGspot$G (Cy3) and the backgrounds are stored in RGspot$Rb (Cy5) and RGspot$Gb (Cy3).

Other useful information can also be added to the RGList, such as the probe ID information

 RGspot$genes <- readGAL()

This information is read in from a file in the current working directory with an extension .gal. If no such file, or multiple files with this extension exist, an error message will be generated.

The SpotTypes information can now be used to determine the status of each spot (ie control or gene), using the command
 RGspot$genes$Status <- controlStatus(SpotTypes, RGspot$genes)

This information will be used later to highlight the control spots on an MA plot. The printer information is also needed if intensity based print-tip normalization is to be applied to the data. The easiest way to acquire this information is from the GAL file, ie

 RGspot$printer <- getLayout(RGspot$genes)

Now all of the necessary information is stored in the RGList object, RGspot. View a summary of the contents of RGspot by typing its name and pressing enter.

 RGspot

This process can be repeated for the GenePix data:

 TargetsGenePix <- readTargets("TargetsGenePix.txt")
 RGgpr <- read.maimages(TargetsGenePix$FileName, source="genepix")

 RGgpr$genes <- readGAL()
 RGgpr$printer <- getLayout(RGgpr$genes)
 RGgpr$genes$Status <- controlStatus(SpotTypes, RGgpr$genes)

The default for GenePix output is that the F635 Mean (Cy5) and F532 Mean (Cy3) columns of each .gpr file are used as foreground intensities and B635 Median (Cy5) and B532 Median (Cy3) are used as background intensities.

An alternative way of reading data from the image analysis output files without using the Targets file, is

 spotFiles <- dir(pattern="\\.spot$")
 RGspot <- read.maimages(spotFiles, source="spot")

for SPOT data and

 gprFiles <- dir(pattern="\\.gpr$")
 RGgpr <- read.maimages(gprFiles, source="genepix")

for GenePix data. In this case, the order of the files in the working directory (which depends on their names) will determine the order they are read in. A benefit of using a targets file is that the order is controlled by the user. The targets file can also be used later in the linear modeling analysis to set up the design matrix. Assigning the gene ID's, control status of the spots and the printer layout will need to be repeated as before.

The pattern argument of the dir function accepts a regular expression. A detailed understanding of regular expressions is certainly not required to perform basic microarray analysis. The double backslash (\\) tells R that the dot immediately after it should be treated literally as a dot rather than as a special character (as it normally would be in regular expressions). The dollar sign ($) tells R that not only must the file names contain the pattern ".spot" (or ".gpr"), but they must in fact end with ".spot" (or ".gpr").

3.4 Normalization using limma (for reference only)

Un-normalized M and A values for each spot can be constructed using the function MA.RG.

 MAspot <- MA.RG(RGspot)
 MAgpr <- MA.RG(RGgpr)

MA-plots. The MA-plot of the un-normalized values for the first array is obtained using:

 plotMA(MAspot, array=1)

 # Now open a new plot window (system-dependent)
 windows()  # For Windows users
 quartz()   # For Mac users
 x11()      # For Linux users

 plotMA(MAgpr, array=1)

Notice the automatic highlighting of the control spots specified in the STF. To do the same plot for other arrays, the array argument can be changed.

Normalization. For print-tip group normalization, try


 MAspot <- normalizeWithinArrays(RGspot)
 MAgpr <- normalizeWithinArrays(RGgpr)

To do MA plots, 6 to a page and save the output in .png format, the commands

 plotMA3by2(MAspot, prefix="MAspot")
 plotMA3by2(MAgpr, prefix="MAgpr")

can be used. Comparing the MA plots for a given slide, you should notice that the dynamic range of the A values is less for SPOT data compared to GenePix. There is also some fanning of the M values at lower intensities for the GenePix plots. This is a result of the different background estimates used in the two packages. The morphological background in SPOT tends to give a low, stable estimate of slide background, whereas the median of the background pixels used in GenePix is an over-estimate (see Yang et al (2002)), resulting in numerical instability for many of the low intensity log-ratios. Not background correcting GenePix data may be a good option in some instances, ie

 RGgprnobg <- backgroundCorrect(RGgpr, method="none")

then proceed as before to normalize the data, using RGgprnobg instead of RGgpr.

3.5 Diagnostic plots using arrayQuality (for reference only)

For some other diagnostic plots, try the following functions from the arrayQuality library.

 library(arrayQuality)
 controlCode <- as.matrix(read.table("controlCode.txt", header=TRUE,sep="\t"))
 rawdata <- gpQuality(dir(pattern="\\.gpr$"), compBoxplot=FALSE,
                         controlId="Name")

For each slide, a summary diagnostic plot "diagPlot.<slidename>.png" will be saved in the working directory. Each slide summary shows MA plots, image plots of M (before and after normalization) and A values, dot plots of the M and A values for each class of control spot (provided controlCode is set correctly), as well as single channel intensity plots. These plots allow the overall quality of a slide to be assessed visually. The arrayQuality package can also provide a more quantitative measurement of slide quality by comparing each slide to a set of reference slides, where available (not applicable for this example).

Exploring, pre-processing and normalizing Affymetrix GeneChip data

1. Introduction

This lab introduces the basic tools provided for the low-level analysis of Affymetrix data. By the end of this lab you should be able to perform basic inspection of Affymetrix datasets, process raw data to produce expression measures and assess the quality of the arrays.

To begin this lab it is necessary to load the packages and dataset that will be used. Loading the affyPLM package will also load the other packages required in this analysis. The jsmHyperdip package contains the dataset which we use in this lab. Use:

 library(affyPLM)
 library(jsmHyperdip)

to load the packages. To make sure we have access to the data we must attach it to the current R session. This can be done by typing:

 data(jsmHyperdip)

More specific details about this dataset can be found by reading the help page:
?jsmHyperdip

This data is stored in an AffyBatch object. The AffyBatch object is used for storing raw probe-level data. It encapsulates raw probe-intensities and related phenotypic data into a single object. Many R functions can be applied to AffyBatch objects and this lab explores a subset of these methods. Typing the name of the object will show some of the information stored in the object:

 jsmHyperdip

To see the phenotypic data in this AffyBatch type:

 pData(jsmHyperdip)

which in this case contains sample numbers and the original filenames (in this lab the 6 arrays are referred to using the first 6 letters of the alphabet) for this data.

The pm and mm functions provide access to raw probe intensities. Calling the pm function with only the name of the AffyBatch returns all the PM probe intensities for all 6 arrays. However, displaying all the PM values on the screen may take considerable time and is not typical useful for interactive analysis. Often it is more useful to look at the probe intensities for a particular probeset. For instance, to see the PM intensities for all probes in the probeset 101_at across all 6 arrays type:

 pm(jsmHyperdip,"101_at")

The PM probe intensities for a probeset can also be examined graphically. Specifically, for this same probeset use:

 par(mfrow=c(1,2))
 matplot(pm(jsmHyperdip,"101_at"),type="l",ylab="PM intensity",xlab="Probe Number")
 matplot(t(pm(jsmHyperdip,"101_at")),type="l",ylab="PM intensity",xlab="Array Number")

to see the probe-intensity behavior for the probeset across Probe Number and across Array Number.

The functions sampleNames and geneNames show the names of the arrays and the names of the probesets on the array. Typing:

 sampleNames(jsmHyperdip)
 geneNames(jsmHyperdip)[1:10]

shows some of this information for the jsmHyperdip data.

2.1 Loading data using ReadAffy (for reference only)

For this lab the dataset has already been supplied to you in an R library and you don't need to worry about reading it into R. However, in general you start with CEL files (raw intensity data) and read it into R using the ReadAffy function. Reading all the CEL files in the current directory and storing can be accomplished using:

 my.Data <- ReadAffy()

the current working directory is shown by getwd() and may be set using setwd() (on Windows you may also do this using the Change Dir ... option in the File menu). The filenames argument is supplied to the ReadAffy function to read specific files only. For example:

 my.Data <- ReadAffy(filenames=c("A.CEL","B.CEL","C.CEL"))

would read in 3 CEL files and store them in an AffyBatch called my.Data.

3. Exploratory Data Analysis

3.1 Raw Images

Before commencing more involved analysis it is advisable to examine the unprocessed data. A number of functions are provided for doing this in a variety of different ways. An examination of the raw images is the first step in the analysis. The image function can be used to examine log transformed PM probe intensities. To examine images for the first two chips in the dataset use:

 par(mfrow=c(1,2))
 image(jsmHyperdip[,1])
 image(jsmHyperdip[,2])

Examining these two images we see that array B has a clear artifact.

Because of great differences in magnitude between intensities examining images of the untransformed data is less informative. Examining chip B in this manner obscures the previously obvious artifact:

 par(mfrow=c(1,1))
 image(jsmHyperdip[,2],transfo=function(x){x})

3.2 Histogram and Boxplots

It is also often useful to graphically examine the distribution of probe intensities for each array in the dataset. The first method for doing this is to use the hist function. Typing:


 hist(jsmHyperdip,col=1:6,main="Histogram of log2 PM probe intensities")
 legend(13,1.5,sampleNames(jsmHyperdip),col=1:6,lty=1:5,lwd=2)

gives the histograms for the jsmHyperdip dataset. Most differences in position and spread are typically removed by normalization and do not indicate potential problems. However, histograms that have significantly different shapes are often of lesser quality. In this case we see that the histogram for array B has a shape that is distinctly different from the others. This is the array observed earlier to have a large artifact.

Boxplots can also be used to examine probe-intensity distributions. Applying the boxplot function to the jsmHyperdip dataset using:

 boxplot(jsmHyperdip)

shows differences in the spread and center across the arrays. As we shall see momentarily, most of the differences visible on these boxplots are reduced after normalization.

3.3 MA plots

On two channel microarray platforms MA plots are used to compare the two color channels on each arrays. Since there is only a single channel on an Affymetrix chip MA plots can not be used in the same way. Instead, MA plots are used to compare between chips. In this context M values are log fold changes and A values are average log intensities between two arrays. While it would be possible to look at MA plots for every possible pair of arrays, this will be an immense number of plots for any dataset consisting of more than arrays. To reduce the number of possible comparisons a probe-wise median array can be created and then each array compared to this pseudo-array. The MAplot function creates these plots. For the jsmHyperdip data type:

 par(mfrow=c(2,3))
 MAplot(jsmHyperdip)

3.4 Other MA plot options (for reference only)

You can also use MAplot to produce MAplots where each array is compared to a chosen array in the dataset. For instance, suppose we wanted to compare all the arrays with one particular array. We do this using the ref argument of MAplot in the following manner.
 MAplot(jsmHyperdip,ref=2)
And it could be further useful to look at these for only a subset of the arrays. These can be specified using the which argument.
 MAplot(jsmHyperdip,ref=2,which=4:6)
If you don't have a large number of arrays in your dataset it might be useful to look at all pairwise MA plots. This can be done using the pairs argument
MAplot(jsmHyperdip,pairs=TRUE)

4. Producing Expression Values

All the functions that produce expression values take an AffyBatch object and return an exprSet object. Expression values, phenotypic and other related information are stored in exprSet objects. Routines that produce expression values usually carry out a specific sequence of pre-processing steps: background correction, normalization and finally summarization.

4.1 RMA

A popular expression measure known as RMA is computed using the rma function.

 eset.rma <- rma(jsmHyperdip)

Typing the name of the exprSet at the command line will show some basic information about the stored expression values. The computed expression values stored in an exprSet object can be accessed using the exprs function. Using:

 eset.rma
 exprs(eset.rma)[1:10,]

shows some of this information for the RMA values computed for the jsmHyperdip dataset.

As we noted before, often normalization removes significant differences in position and spread between arrays. Using:

 par(mfrow=c(1,2))
 boxplot(jsmHyperdip,col="red")
 title("Raw PM probe intensities")
 boxplot(eset.rma,col="blue")
 title("RMA expression values")

shows this to be true for the RMA expression values we computed.

4.2 GCRMA

Another popular expression measure is known as GCRMA. It replaces the background correction step in RMA with a more sophisticated algorithm which utilizes probe sequence information. To compute GCRMA expression values for the jsmHyperdip data use:

 eset.gcrma <- gcrma(jsmHyperdip)

4.3 Other expression measures (for reference only)

While RMA and GCRMA are the expression measures most widely used by users of this software, they are certainly not the only possibilities. Another popular expression measure algorithm, proposed by Affymetrix, is called MAS 5.0. In the BioConductor software this is implemented in the mas5 function. For example to compute MAS 5.0 expression values for the jsmHyperdip dataset use:

 eset.mas <- mas5(jsmHyperdip)

Some users prefer to construct their own expression measure by combining their own sequence of pre-processing steps. Two functions expresso and threestep are available for this purpose. Of the two functions expresso provides slightly greater flexibility while threestep is faster and more memory efficient.

For example suppose you wanted expression values where you used the MAS 5.0 background correction, a quantile normalization step, subtract the Mismatch (in the manner implemented by MAS 5.0) and then averaged the PM probe intensities. You could do this using:

 eset1 <- expresso(jsmHyperdip,bgcorrect.method="mas",normalize.method="quantiles",
                      pmcorrect.method="mas",summary.method="avgdiff")

An expression measure consisting of the GCRMA background correction, a scaling normalization and then summarization by taking the log of the 2nd highest PM probe intensity could be computed using:

 eset2 <- threestep(jsmHyperdip,background.method="GCRMA",normalize.method="scaling",
                       summary.method="log.2nd.largest")

For further details on these functions the user is referred to the Vignettes named affy: Built-in Processing Methods and affyPLM: the threestep function. Both can be accessed by typing openVignette() or on the web at http://www.bioconductor.org/viglistingindex.html.

5. Assessing Dataset Quality

The probe-level modeling procedures provided by the affyPLM give useful tools for dataset quality assessment. The primary routine for this purpose is the function fitPLM which operates on AffyBatch object and returns a PLMset object. To fit the default model on our dataset using fitPLM you should type:

 Pset <- fitPLM(jsmHyperdip)

For quality assessment purposes we will use the standard errors, weights and residuals stored in the PLMset object. Raw access to these values are available using the se(), weights() and resids() functions. The quality assessment tools are based on visual representations of these quantities.

5.1 Images of weights and residuals

While images of the raw probe intensities make some artifacts clearly visible others are invisible. More useful are chip pseudo-images created using the weights and residuals stored in the PLMset object. These can be created using the image() function. By default the weights are imaged. To create images of all 6 chips in this dataset use:

 par(mfrow=c(2,3))
 image(Pset)

For the weights images, darker green means lower weight. The artifact we saw earlier is clearly visible on the chip pseudo-image for array B. Instead of using the weights, we could look at residuals by using the type="resids" argument. Residual images for the six chips in our data set are created by using:

 par(mfrow=c(2,3))
 image(Pset,type="resids")

Residuals images are colored so that high positive residuals are red, low negative residuals are blue and residuals close to 0 are white. Look closely at the chip image for array A (you may need to enlarge the graphic window), do you notice anything?

Using the which argument we can more closely examine the images for array A. Using type="pos.resids" gives only the positive residuals, type="neg.resids" gives the negative residuals and type="sign.resids" looks only at the sign of the residuals. These images for array A can be produced using:

 par(mfrow=c(2,2))
 image(Pset,which=1,type="resids")
 image(Pset,which=1,type="pos.resids")
 image(Pset,which=1,type="neg.resids")
 image(Pset,which=1,type="sign.resids")

The artifact identified on this array is not easily visible in the image of raw probe intensities.

5.2 NUSE: Normalized Unscaled Standard Errors

One method of deciding whether or not an array is problematics from a quality standpoint is NUSE. The goal of NUSE is to identify any arrays which have elevated standard errors relative to other arrays in the dataset. This is done by standardizing the SE across arrays to have median 1 for each probeset. Our graphical tool consists of boxplots of these quantities for each array. Type:

 par(mfrow=c(1,1))
 NUSE(Pset)
 title("NUSE for jsmHyperdip dataset")

to create the plot. You will notice that array B has a discordant boxplot. This indicates it is of poorer quality relative to the rest of the dataset (not surprising given the large artifact). While we also saw a recognizable artifact on array A, because it was of small size, it did not have large appreciable effect on the overall quality of the array.

Instead of visually examining these quantities suitable numerical summaries such as the median and IQR NUSE could be used. These can be found using:

 NUSE(Pset,type="stats")

5.3 RLE: Relative Log Expression

Another tool for making a decision about whether an array should be removed from subsequent analysis because of poor quality is RLE. These are the log-scale expression values relative to the median expression value computed on a probeset by probeset basis. As with NUSE, we examine these quantities using boxplots. For our data this is accomplished using:

 RLE(Pset)
 title("RLE for jsmHyperdip dataset")

As before, array B has a significantly different boxplot.

The median and IQR RLE may also be used in place of the boxplots. To do this use:

 RLE(Pset,type="stats")

6. Memory Usage (for reference only)

Probe-level analysis of Affymetrix data is very memory intensive. In general you will be able to handle more arrays with more memory. For this lab most activities should be successfully handled using 512MB of RAM.

Note that on Windows operating systems you have some control over the maximum memory available to R. Specifically, by default R will be restricted to the smaller of the amount of physical RAM and 1 GB. The function memory.limit() can be used to alter this limit when R is running. Command line arguments can be supplied to change these limits at launch time. For further details refer to the R for Windows FAQ at http://cran.r-project.org/bin/windows/base/rw-FAQ.html.

On Linux and other Unix based operating systems you are not required to set the maximum memory size.

Two additional functions, justRMA and justGCRMA are available for computing RMA and GCRMA expression measures respectively. These compute the expression measures directly from CEL files. This makes them much more memory efficient but you will not be able to carry out additional probe-level analysis.

Appendix: References

  • Gentleman R, Carey V, Huber W, Irizarry R, and Dudoit S. (Eds.) (2005) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer.: A thorough discussion of the BioConductor software and its applications
  • Bolstad BM, Irizarry RA, Gautier L, and Wu Z.(2005) Preprocessing High-density Oligonucleotide Arrays. In Bioinformatics and Computation al Biology Solutions Using R and Bioconductor. Gentleman R, Carey V, Huber W, Irizarry R, and Dudoit S. (Eds.), Springer, 2005.: Discusses using BioConductor software to pre-process Affymetrix arrays
  • Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry RA, and Speed TP. (2005) Quality Assessment of Affymetrix GeneChip Data in Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Gentleman R, Carey V, Huber W, Irizarry R, and Dudoit S. (Eds.), Springer: Discussion of quality assessment for Affymetrix arrays
  • Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P., A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics, 19, 185 (2003). : Paper discussing normalization methods
  • Cope, LM, Irizarry, RA, Jaffee, H, Wu, Z, Speed, TP (2004) A Benchmark for Affymetrix GeneChip Expression Measures. Bioinformatics 20: 323-331.: Paper discussing how to compare different expression measures. See also the related website at http://affycomp.biostat.jhsph.edu/
  • Irizarry, RA, Bolstad BM, Collin, F, Cope, LM, Hobbs, B, and, Speed, TP (2003) Summaries of Affymetrix GeneChip Probe Level Data. Nucleic Acids Research. Vol. 31, No. 4 e15: Discusses specific comparisons between RMA, MAS 5.0 and dChip.
  • Wu, Z., Irizarry, R., Gentleman, R., Martinez Murillo, F. Spencer, F. A Model Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of American Statistical Association 99, 909-917 (2004).: Discusses the GCRMA expression measure.
  • Wettenhall, J. M., and Smyth, G. K. (2004). limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20, 3705-3706.
  • Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15.
  • Dudoit, S., Yang, Y. H., Callow, M. J., Speed, T. P. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica, 12:111--139.
  • Samartzidou, H., Turner, L., Houts, T., Frome, M., Worley, J., and Albertsen, H. (2001) Lucidea Microarray ScoreCard: An integrated analysis tool for microarray experiments, Life Science News.

Appendix: Where to go for additional documentation and help

Appendix: Installing the software

Before installing R and the required R packages, you should ensure that you have at least 500 MB of free hard disk space. The installation of this software will not require that much space, but whenever installing large software bundles you should allow for some extra space because filling up your hard disk completely can lead to serious performance problems. If you have an existing R installation, you do not need to replace it if you don't want to. You can install the new version of R in a different directory.

Installing the pre-customized bundle for Microsoft Windows Operating systems

Installation should be straightforward as the series of images below demonstrate. The setup proceeds by first installing the necessary datafiles and then launching a separate installer to install the customized version of R. By default the data files are installed to c:\microarrays. If you have another version of R installed on your computer you may choose to install this version to another location, otherwise the default location is probably ok.
images/Setup1.png images/Setup2.png
images/Setup3.png images/Setup4.png
images/Setup5.png images/Setup6.png
images/Setup7.png images/Setup8.png
images/Setup9.png images/Setup10.png
images/Setup11.png images/Setup12.png
images/Setup13.png images/Setup14.png

Special note for users of Microsoft Windows Operating systems

Note that the customized version of R provided for this lab should be sufficient for running limmaGUI, although you might notice warning messages. This can be fixed by installing ActiveTcl from: http://www.ActiveState.com/ActiveTcl. You will need to do this if you choose to manually create your own installation rather than using the supplied installer.

Installing the required software by letting R access the internet (Manual Option)

First you will need to retrieve and install R from www.r-project.org. Pre-compiled versions are provided for Microsoft Windows operating systems and you can download source code for Linux and other Unix operating systems.

Once you have installed R. Launch it and at the R-prompt type:

source("http://www.bioconductor.org/biocLite.R")
biocLite()
this will download and install a core set of BioConductor packages. For this workshop there are some additional package groups we will need. Some can also be downloaded, with all the necessary pre-requisites, automatically from the BioConductor website. To do this type:
source("http://www.bioconductor.org/getBioC.R")
getBioC("limmaGUI")
getBioC("arrayQuality")
getBioC("tkrplot")
getBioC("hgu95av2cdf")
getBioC("hgu95av2probe")

You will also need the jsmHyperdip package which contains a small Affymetrix dataset. You must install this manually. For windows download jsmHyperdip_1.0.zip. For linux/unix operating systems jsmHyperdip_1.0.tar.gz.

Finally, you will need to download the required microarray data files. You should unzip these and place them in an accessible directory (the pre-customized bundle tries to install these in c:\microarray by default. Specifically, download swirl.zip and CellLine.zip.

Linux/Unix users only If you are using Linux or another Unix operating system you will need Tcl/Tk to use limmaGUI. The two Tcl/Tk extensions required by limmaGUI (Tktable and BWidget) can be installed in /usr/local/lib/ or /usr/lib/ or anywhere else within the R-Tcl/Tk search path. To see your current R-Tcl/Tk search path from within R, type:

library(tcltk)
tclvalue("auto_path")
To add a new directory to the R-Tcl/Tk search path, you can use:
library(tcltk)
addTclPath("/custom/TclTk/path")
To test whether the Tcl/Tk extensions Tktable and BWidget can be found within the R-Tcl/Tk search path, type:
library(tcltk)
tclRequire("Tktable")
tclRequire("BWidget")
These Tcl/Tk extensions can be downloaded from the following sites:
PackageSource
Tktable http://tktable.sourceforge.net/
BWidget http://tcllib.sourceforge.net/

Acknowledgements

Portions of the section on two-color microarray data were previously used by James Westenhall, Matt Ritchie and Gordon Smyth for previous workshops.