Some FAQ about computing the RMA expression measure

Written by Ben Bolstad

What is RMA?

RMA is the Robust Multichip Average. It consists of three steps: a background adjustment, quantile normalization (see the Bolstad et al reference) and finally summarization. Some references (currently published) for the RMA methodology are:
Bolstad, B.M., Irizarry R. A., Astrand M., and Speed, T.P. (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2):185-193 Supplemental information
Rafael. A. Irizarry, Benjamin M. Bolstad, Francois Collin, Leslie M. Cope, Bridget Hobbs and Terence P. Speed (2003), Summaries of Affymetrix GeneChip probe level data Nucleic Acids Research 31(4):e15
Irizarry, RA, Hobbs, B, Collin, F, Beazer-Barclay, YD, Antonellis, KJ, Scherf, U, Speed, TP (2003) Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics .Vol. 4, Number 2: 249-264 [Abstract, PDF, PS, Complementary Color Figures-PDF, Software]

Which paper should I cite for RMA?

Ideally you will cite all three. However, if you wish to cite only a single paper then my recommendation is that you cite
Rafael. A. Irizarry, Benjamin M. Bolstad, Francois Collin, Leslie M. Cope, Bridget Hobbs and Terence P. Speed (2003), Summaries of Affymetrix GeneChip probe level data Nucleic Acids Research 31(4):e15
because it most accurately reflects the core creatorship of the RMA method (BB, RI, FC, TS) which neither of the other papers do completely. You should cite the normalization paper (Bolstad et al, 2003) if you are using quantile normalization apart from the RMA method. If you wish to use Irizarry et al, 2003 Biostatistics as your reference for RMA, I'd request you also cite (Bolstad et al, 2003).

You can download bibtex citation information for all three papers by clicking here.

How does RMA compare to other expression measures?

The Irizarry et al (2003) Nucleic Acids Research paper compares RMA to the dChip and MAS 5.0 algorithms.

Another place to look is the affycomp website which provides a framework and competition for comparing expression measures.

What software is available for computing RMA expression measures?

There are several freely available options for computing the RMA expression measure.
  • Bioconductor - A collection of packages in the R language for analysis and manipulation of biological data. The appropriate package for computing the RMA expression measure is the affy package.
  • RMAExpress - A standalone program for the Windows environment that focuses only on computing the RMA expression measure.
There are perhaps other free non-commericial software for doing so, but the two above are those that I personally support.

I want to use software X to compute RMA expression values, is this ok?

Your guess is as good as mine as to whether it does the right thing or not. Unless it is my own software I can not vouch for its peformance one way or the other. Note that in the past I have seen a lot of people call things RMA, which I would not necessarily call RMA. If you find for instance that the RMA expression values you get from software X differ from those you get from Y you should take the issue up with the provider of that software, not with the RMA authors.

How do I use the Bioconductor software to compute RMA expression measures?

  • First install R using the appropriate method for your operating system.
  • Launch R and then install Bioconductor. An easy way to do this is to source the getBioC.R script directly from the Bioconductor website. To do this from within your R session type source("http://www.bioconductor.org/getBioC.R") then press enter. To begin the installation process type getBioC("affy","release"). The appropriate software will be downloaded and installed for you. Visit the Bioconductor website for more information on the installation process.
  • Now load the appropriate software with library(affy)
  • Now to read all the CEL files in your current working directory (you can find this information by using the getwd() and setwd() commands) by using the ReadAffy() command. In particular the suggested method might be something like
    data <- ReadAffy()
    Eventually your data should read in. This might be a slow process.
  • Now your data is loaded it is time to compute the RMA expression measure. The way we do this is using the rma() command. In particular a command like:
    eset <- rma(data)
    Should be all that it takes to start the process.
  • When it is done the exprSet eset should contain the computed Robust Multichip average
  • You may write out the results to a text file using the write.exprs() command. In particular you would do
    write.exprs(eset,file="myresults.txt")
    another option is to use
    exprs2excel(eset) which will write the results to a file called tmp.csv.

I am using Bioconductor, can I use expresso or justRMA instead of rma?

Absolutely. If you prefer to use these functions you are free to do so. The function rma() should be considered the canonical implementation of RMA, but both expresso() and justRMA() should give you the same expression estimates (as of current writing which would refer to the 1.2.27 version of the affy package). If you use versions of the package prior to the Bioconductor 1.2 release the values might not agree.

To use the expresso() to compute RMA expression measures
eset <- expresso(data, bgcorrect.method="rma", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish")

The other option is to use the justRMA() command. This command was contributed to the affy package by James MacDonald (jmacdon@med.umich.edu). In the underlying code it calls the same C routines used by rma() but it differs from the rma() and expresso() commands in that it does not require you to use ReadAffy() instead you specify the filenames of your CEL files in the call to the function and it produces an exprSet with the RMA expression measures. The easiest way to use justRMA() is to use
eset <- justRMA()

I really don't want to use R, can I get RMA expression measures another way?

If you use a machine with the Windows operating system you might want to consider using RMAExpress which is a stand alone program for computing RMA values. This program is open-sourced and can also be compiled on Linux/Unix systems.

I have a really huge dataset with hundreds (or thousands of arrays) that I want to process together. What should I do?

In this situation you are likely to really struggle with memory problems if you just try using rma(). As mentioned above justRMA() is another option which is more memory efficient but may still reach memory problems. A third option is to try justRMAlite() in AffyExtensions. If you want to go non R based, then is another option with lower memory overhead and ability to scale to large datasets.

Can I get standard error estimates since rma() does not seem to return any?

Yes. If you use expresso() then you will get ad-hoc estimates of the standard errors. Another option is to use the affyPLM package which provides other methods of robust linear model fitting (beyond the medianpolish for which there is no specific way of computing a standard error estimate) and standard error estimates.

Should I compute RMA expression measures separately for different treament groups?

We generally recommend that you compute your expression measure for all chips, irrespective of treatment (condition), together as one batch. This is because we have observed that non biological variation is not as easily reduced if chips are analysed in groups.

I have MGU74A and MGU74Av2 chips which I want to combine in an analysis. What should I do?

One option is to work with the two sets separately. Then you may some how combine the two datasets together.

The other option is to remove the probesets which are not on both versions of the array and work only with the common probesets. You can read more about that option here: Mixture CDFenv.

I am using affy/Bioconductor. How much memory do I need? What is the maximum number of arrays I can process to get RMA estimates?

The general rule is the more memory you can afford the better. I would recommed that if you want to do a moderate amount low level analysis you have at leat one GB of RAM.

The number of arrays that you can process (and how long it might take) depends on a number of things including

  • Processor Speed
  • Memory - RAM and Virtual(swap)
  • Disk speed
  • Operating System
  • version of R and bioconductor packages
  • the function used
  • the chip type
In an attempt to quantify the maximum number of arrays that may be processed I have run some simulations

Can you please explain the RMA background?

The traditional RMA background is to assume that the observed signal is the convolution of a Normal background (N) mean mu variance sigma^2 and exponential signal (S) with mean alpha. That is we observe O = S + N. Under these conditions the background corrected intensity values are given by E(S | O). You can see the exact formula in this document. You can find a derivation of this formula with greater discussion on pages 17-21 of this Dissertation.

I just want to run the background and/or normalization steps of the RMA method (no summarization). How might I do it?

If you are interested in applying the RMA preprocessing to just the PM probes of an affybatch Data. You might find the following code slightly more efficient than applying the standard operations bg.correct(),normalize() to an affybatch.

my.PM <- apply(pm(Data), 2, bg.adjust)
my.PM <- normalize.quantiles(my.PM)

Are there any papers where the RMA expression measure has been used?

Yes. There have been a number, a subsample is below. If you would to add your paper to this list please email me.
Barash, Y., Dehan, E., and Krupsky, M., Franklin, W., Geraci, M., Friedman, N., and Kaminski, N. (2004), Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays. Bioinformatics 20(6):839-46 Abstract
Poirot, Laurent., Benoist, Christophe., and Mathis, Diane. (2004) Natural killer cells distinguish innocuous and destructive forms of pancreatic islet autoimmunity. PNAS 101(21):8102-8107 Abstract
Schlecht, U., Demougin, P., Koch, R., Hermida, L., Wiederkehr, C., Descombes, P., Pineau, C., Jegou, B., and Primig, M. (2004), Expression Profiling of Mammalian Male Meiosis and Gametogenesis Identifies Novel Candidate Genes for Roles in the Regulation of Fertility. Mol. Biol. Cell 15(3): 1031-1043 Abstract
Abeyta, Michael J., Clark, Amander T., Rodriguez, Ryan T., Bodnar, Megan S., Pera, Renee A. Reijo, and Firpo, Meri T. (2004), Unique gene expression signatures of independently-derived human embryonic stem cell lines. Hum. Mol. Genet. 13(6):601-608 Abstract
Scott, Linda A., Vass, J. Keith., Parkinson, E. Kenneth, Gillespie, David A. F., Winnie, Joseph N., and Ozanne, Bradford W. (2004), Invasion of Normal Human Fibroblasts Induced by v-Fos Is Independent of Proliferation, Immortalization, and the Tumor Suppressors p16INK4a and p53. Mol. Cell. Biol. 24(4):1540-1559 Abstract
Tsuchiya, Tomoshi., Dhahbi, Joseph M., Cui, Xinping., Mote, Patricia L., Bartke, Andrzej., and Spindler, Stephen R. (2004), Additive regulation of hepatic gene expression by dwarfism and caloric restriction. Physiol. Genomics 17(3):307-315 Abstract
Parmigiani, Giovanni., Garrett-Mayer, Elizabeth S., Anbazhagan, Ramaswamy., and Gabrielson, Edward (2004), A Cross-Study Comparison of Gene Expression Studies for the Molecular Classification of Lung Cancer. Clin Cancer Res 10(9):2922-2927 Abstract
Li, Jun Z., Vawter, Marquis P., Walsh, David M., Tomita, Hiroaki., Evans, Simon J., Choudary, Prabhakara V., Lopez, Juan F., Avelar, Abigail., Shokoohi, Vida., Chung, Tisha., Mesarwi, Omar., Jones, Edward G., Watson, Stanley J., Akil, Huda., Bunney, William E., Jr and Myers, Richard M. (2004) Systematic changes in gene expression in postmortem human brains associated with tissue pH and terminal medical conditions. Hum. Mol. Genet. 13(6):609-616 Abstract
Barczak, A., Rodriguez, M. W., Hanspers, K., Koth, L. L., Tai, Y. C., Bolstad, B. M., Speed, T. P., and Erle, D. J. (2003), Spotted Long Oligonucleotide Arrays for Human Gene Expression Analysis Genome Research Abstract

Is it proper to say "I normalized my data using RMA"?

It is not really precise enough to say this. Normalization is one stage in the process of computing the RMA expression measure, but certainly not the only process. It is more correct to talk about having "computed RMA expression values" or that you used "RMA pre-processing". More details on this issue are addressed in this BioC mailing list posting.

All that said, the terminology normalization has become pretty synonmous with pre-processing in the microarray analysis context, and I am not the one to fight against this tidal wave of behavior, so you are free to make your own choiced.

How come the distribution of expression values across arrays is not identical? I thought quantile normalization was meant to ensure that, wasn't it?

Answering the second question first, yes indeed quantile normalization as it is implemented in the RMA procedure does indeed insure identical distributions on all the arrays normalized together. However, in RMA the quantile normalization step is carried out at the probe-level, rather than the probeset level. This means that the distribution of probe intensities is identical across arrays before the median polish summarization. After summarization wou will typically find that the distributions are close, but not identical. Really poor data quality arrays may have significantly different distributions after summarization.
DEFUNCT QUESTION

The RMA values I computed using Bioconductor version 1.1 differ from those I compute in version 1.2, how do I duplicate 1.1 results using 1.2?

If you are using the the affy package from the 1.2 release, you may duplicate results by using the parameter bgversion=1 in your rma() function call. The two versions differ only in the background correction step.