Statistical Analysis of Low-level High Density Oligonucleotide Array Data
Benjamin M. Bolstad
University of California, Berkeley
Berkeley CA
USA
bolstad@stat.berkeley.edu
When running an experiment using the Affymetrix GeneChip(R) system, it is important to have good
quality gene expression estimates. At the low level, we work with Perfect Match (PM) probe intensity
data. A GeneChip(R) has multiple PM probes, each interrogating the same gene. Information from these
probes may be combined together to compute an expression estimate.
I will discuss the three major steps used when computing the Robust Multi-chip Average (RMA)
expression measure: background correction, normalization and summarization.
Background correction is the process of removing noise, particularly in the low intensity
range. The goal of normalization is to remove unwanted non-biological sources of variation.
To normalize data from multiple arrays, RMA makes use of a simple non-parameteric, non-linear
algorithm called quantile normalization. Summarization is the process where information from
multiple probes is combined to compute an expression measure. In the context of RMA, this is
done by fitting a robust form of linear model to probe information from multiple chips.
I will then compare RMA with other established expression measures using two very important
statistical measures: bias and variance.