Which R package is more efficient to use for parallelization?

ktrn · April 13, 2018, 2:48pm

There are a few R parallel packages I found that allow me to parallelize my loop - mclapply and parallel+foreach are most popular in my group. Which approach is most efficient in an HPC environment,
for using a single node with a large number of cores?

CURATOR: Katia

jpessin1 · May 25, 2018, 9:38pm

Comment Are you asking about “standard R” methods that work well in a typical (or specific) HPC environment? Or taking advantage of the special set up, i.e. using other methods like MPI or MQ’s?

ktrn · May 31, 2018, 6:23pm

I am mostly interested in running R within an HPC environment, parallelizing over multiple cores on the same node. Is there any difference in performance (or memory usage) using mclappy() vs parallel loop using foreach()? I did not find any measurable difference while some people suggest otherwise.

jpessin1 · June 12, 2018, 8:49pm

The answer to this may vary depending on local conditions (version, build tools, etc), sharing arrangement on the node and how the program is coded.

The best thing to do is to test it.
To get the time info is time

I’d recommend picking a small test case for what you’re doing.

If you call it test1.R test2.R…:

module load myFavoriteREnvironment
time $(Rscript test1.R) > test1.time
time $(Rscript test2.R) > test2.time

each .time will have “user”, “real” and “sys” times

“user” is probably what you want, it’s user experienced or wall clock time.

“real” is the time spent running the program, summed for all processes

“sys” is time spent by the system on other things (startup, shutdown, wait to read or write data etc.)