PLM timing simulations using preprocessCore

Written by Ben Bolstad
email bmb@bmbolstad.com

Background

From time to time a question comes up on the BioConductor mailing list about the performance of affyPLM::fitPLM. One such thread discusses the issue in a little more detail. That message gives an R script you can download here. The primary vital piece of information here is that PLM procedure for a default log2 signal = sampleeffect + probeeffect + error model is that it is optimized for dealing with high number of samples and small number of probes in each probeset.

Between release 2.0 and 2.1 of BioConductor I developed a new package preprocessCore, that separated out parts of the core preprocessing from the code that was designed to deal with more application specific data structures. Included in the preprocessCore was the PLM fitting code for this model. A wrapper function rcModelPLM fits the model when given an arbitrary matrix, with columns treated like arrays and rows like probes. Internally it uses the same function as is used by fitPLM.

The aim of this page is to demonstrate a little bit better actual performance when varying the number of rows (probes in a probeset) and columns (number of arrays). Here is the simulation code used. The simulation run here was done using R-2.6.0-alpha and preprocessCore 0.99.20.

Simulation 1: Comparing performance when holding one dimension fixed and varying the other

The first simulation demonstrates how the fitting algorithm scales much better for increasing columns when the number of rows is held fixed, than vice versa. In both cases the fixed dimension was set to 11.

Simulation 2: Comparing performance when varying both dimensions

In the second simulation both the number of rows and columns are varied. The first plot uses terrain colors to show the running time for each combination of rows and columns. Contours are added to show row and column combinations having same run times. The contours demonstrate that you can process a much larger matrix if the column dimensions is large and the row dimension is smaller than vice versa for a given run time.

A second plot shows performance when increasing the number of rows or the number of columns and holding the other dimension fixed. Lines and colors are used to represent a particular value of the fixed dimension. Heat colors are used to represent increasing values of the fixed dimension.

Simulation 3: Pushing algorithm to limits of performance

While the performance of the PLM fitting code is efficient it does have its limits. The third simulation shows the performance of the algorithm when the number of cols (ie the number of arrays) is increased as far as 1000, while holding the number of rows fixed at 11 (typical number of probes in a probeset on many, but certainly not all arrays). As can be seen from this plot at 10,000 columns there is signficant slowdown even though run time is very efficient up to 2000-3000. Of course exact performance will vary depending on your hardware configuration.