How many arrays can I RMA process?

Written by Ben Bolstad

Special Note

This page was originally compiled in Mid-2003 (except this section) and is now somewhat outdated. In the intervening period of time 64 bit processors have become more common and dataset limits have increased. What is called just.rma2() below became just.rma(). In addition RMAExpress as of version 0.4 can process extremely large datasets.

Simulation Outline

Using this script we simulate the process of reading in and RMA processing an arbitrary number of arrays. The test.cel file was a HGU_95Av2 array. We compared
  • rma()
  • expresso()
  • just.rma()
  • just.rma2()
To fairly compare we also include times for reading data in using ReadAffy() in the times for rma() and expresso(). The read.affybatch2() function was used in place of read.affybatch(). This function is a little faster and has a lower memory overhead.

Results

The times shown here are real world ties as reported by system.time(). The figures in parentheses is the time not including ReadAffy() time. A red box indicates we were not able to complete because R could not allocate necessary memory or the operating system killed the process (again it could not allocate the required memory).

System 1 represents a higher end linux machine. The raw results for system 1.

Results using system 1 (specifications below)
# arraysReadAffy()rma()expresso()just.rmajust.rma2()
53.5120.05 (16.54)98.35 (94.84)11.4811.86
106.3125.73 (19.42)120.94 (114.63)21.0119.9
158.9134.35 (25.44)158.14 (149.23)30.1927.94
2011.3543 (31.65) 187.48 (176.13)39.5736.09
2513.9152 (38.09)231.07 (217.16)48.4643.97
3016.5160.43 (43.92)271.19 (254.68)61.97 52.58
3519.0269.2 (50.18) 351.11 (332.09)71.69 60.55
4021.678.08 (56.48)398.56 (376.96)85.7969.69
5031.79112.34 (80.55)817.12 (785.33)114.7785.52
6032.91123.27 (90.36)1330.58 (1277.24)120.42107.45
7036.96138.97 (102.01)137.6122.69
8050.67166.62 (115.95)157.51139.01
9089.82 224.53 (134.71)175.64154.8
10099.06285.96 (186.9)207.25173.85
125193.77460.14 (266.37)261.96219.77
150320.72264.3
175395.06309.39
200447.93379.73
250603.84507.97
300750.93699.6
350948.7889.16
4001218.951084.71
5001891.041648.08

System 2 represents a more moderate linux machine. The raw results for system 2.

Results using system 2 (specifications below)
# arraysReadAffy()rma()expresso()just.rmajust.rma2()
55.9433.04 (27.1) 160.28 (154.34)19.88 20.54
1010.7144 (33.29)211.48 (200.77)36.8534.89
1515.2159.2 (43.99)404.67 (389.46) 53.7950.14
2019.7775.62 (55.85)647.09 (627.32)72.6164.96
2524.1190.99 (66.88)1128.41 (1104.3)105.7280.52
3027.89112.33 (84.44)1744.07 (1678.78)106.897.64
3553 167.47 (114.47)2682.55 (2574.9)123.8 113.61
4038.86143.48 (104.62)139.56127.03
50127.41298.13 (170.72)174.81156.62
60308.58 672.27 (363.69)260.84189.01
70489.281247.48 (758.2)278.75223.81
80606.881738.64 (1131.76)359.88255.99
90871.762256.15 (1384.39)393.73287.93
100435.42313.9
125649.41470.33
150812.56706.42
175976.76 845.67
2001435.81 957.63
2502578.11
3004526.76
35011640.77
400

System 3 represents a windows machine. The raw results for system 3.

Results using system 3 (specifications below)
# arraysReadAffy()rma()expresso()just.rmajust.rma2()
55.3923.56 (18.17)117.72 (112.33)15.7915.19
109.9932.55 (22.56)146.32 (136.33)28.9126.24
15 14.3244.3 (29.98)217.4 (203.08)51.81 38.09
2041.12 99.78 (58.66) 343.6 302.4882.1249.97
25 23.02 69.92 (46.9)370.41 347.3988.8361.68
3027.3883.95 (56.57)583.15 555.77 168.9473.13
35 31.7894.08 (62.3) 96.71 85.57
4040.25110.94 70.69 109.7595.42
5045.15130.58 85.43137.22118.68
6053.93 172.54 118.61166.36141.49
7064.52207.06 142.54218.14168.17
8099.74384.01 284.27253.03195.12
90132.14690.56 558.42 333.43250.19
100275.79724.78 448.99 441.18303.77
125528.84601.93
150506.53403.85
175594.5537.17
200738.451055.84

Plots of running times for system 1, system 2 and system 3.

Comparing windows to linux on the same hardware: rma() and justrma()

TO COME ...... simulation with HGU133A chip (more probes and probesets),

Discussion

Currently we would rank (in terms of number of chips that you will be able to process) the methods in the order: just.rma2, just.rma, rma, expresso.

Their is some upwards bias in the times due to the way the simulation was run. Because of the way the operating system moves memory pages in and out of swap if a large amount of memory was allocated previously, the following routines might also suffer. Generally after about 100 arrays each iteration of the simulation was carried out in a fresh R session to reduce this problem. (The bias was due to wanting to avoid babysitting the machine too much). In places where this bias was heavily apparent the simulation was rerun for the individual method to get a more accurate estimate of time.

It is important to note the processes have a 3 GB memory limit on 32bit x86 linux machines. On windows machiens this is typically 2GB.

read.affybatch2() suffers when the affybatch object is instantiated. The instantiation process seems to use much more memory than necessary. Both rma() and expresso() cannot be carried out without a affybatch object. This is why times for rma() finish when the affybatch object can no longer be created.

On the windows machine the memory limit was 1.6GB (this is as far as I could raise it and have a stable R session). See the R on Windows FAQ on this matter. The results for just.rma2 were affected by it running in sequence after just.rma (see the CPU time to see that just.rma2 is faster).

Test machine specifications

System 1 specifications
ComponentDescription
ProcessorAMD Athlon XP 2500+ (Barton)
RAM1 GB
Swap6 GB
Operating SystemRed Hat Linux 9
Kernel2.4.21-rc7-ac1
R1.7.1
affy1.3.6

System 2 specifications
ComponentDescription
ProcessorIntel Pentium 4M 1.7 GHz
RAM640 MB
Swap1 GB
Operating SystemRed Hat Linux 9
Kernel2.4.20-18.9
R1.7.1
affy1.3.6

System 3 specifications
ComponentDescription
ProcessorAMD Athlon XP 2500+ (Barton)
RAM1 GB
Swap6 GB
Operating SystemWindows XP pro
R1.7.1
affy1.3.6