How do you facilitate installation of R packages on your HPC/Cluster platform?

wwarr · August 30, 2023, 8:29pm

A substantial portion of support requests on our HPC system come from researchers wanting to install packages in R and running into errors. One of my facilitators is an expert at finessing these installs, but it is eating up a good amount of his time and we want to make the whole process more robust.

Currently we have R and RStudio installed as lmod modules. Most of our researchers interface with RStudio via an interactive application through Open OnDemand, which loads these modules as appropriate. Naturally, most researchers expect to be able to install packages from within RStudio, without errors. They are often coming from a laptop/desktop/workstation experience where they have sufficient privileges/control over the local system.

The errors reported often involve compiler incompatibilities and or missing, operating system-level shared libraries. The shared libraries are often development libraries that require sudo privileges to install, and unavailable via, e.g., Anaconda. Even if they were available via Anaconda, the researcher would still need to be remember to modify LD_LIBRARY_PATH, which is a pain point we’d like to do away with.

We’ve considered managing R and RStudio through bioconda, but from what we’ve seen that may be trading one collection of pain points for another.

How do you all approach and manage these pain points on your systems?

ktrn · August 31, 2023, 3:02am

This is a very interesting question! Over the years, I heard some people advocate using conda env. to deal with various problems they encountered with R package installation. It would be interesting to hear how this approach helps as I personally prefer to keep R and Python environments separate. Let me divide my answer into 3 problems/questions:

Should R packages be installed into the user’s space or should they reside in the “central” location and be available to all users?
What is the best way to resolve conflicts between the versions of various R packages?
How to make the process of installing new R packages (and re-installing these packages into new R versions) less time-consuming?

I have been installing and maintaining the R packages on our cluster for more than 10 years now and here is the approach that works very well for us.

Let’s start with the first two questions. The first one obviously depends on many factors. On our cluster, the users have only 10GB in their home directory. R packages are usually tiny (compared to Python packages); they usually easily fit into this limit even if a user installs dozens of them. However, almost everyone who uses R wants to use packages like “tidyverse”. Then they start to mix (in the same directory) the packages that use similar R versions, like R/4.1.2 and R/4.1.3 and in this case, the version conflict is almost unavoidable. So, we install many popular R packages into the central location (along with the appropriate R version) so the users do not need to do that. Once a package is installed for a particular R version, we never update it (inside that R version) and if some user needs to have a newer version, they will need to install it into their home environment. For the same reason, when we install a new Bioconductor package we always answer “none” on the question “Which packages would you like to update (all, some, none)?”. If a user needs a package that is not published on CRAN or Bioconductor (and as a result is not as thoroughly vetted as if it were), this package is installed into the user’s space. Using this approach we have not had a single time when we had any R package version conflict (within our central R installations) for the past 10 years (and we installed over a thousand R packages for almost every R version.) One problem that this approach solves is that we do not need to “untangle” version problems for as many users as we would otherwise need to do. And in 99% of the cases when the users come to us saying that they have a problem with their R packages, we recommend that they delete all of them as we usually have them already installed. If they need to add one or two more, it is much easier.

Now to move to a more difficult question - how to make the installation process for more difficult packages (like rgdal) easier. Yes, there always be some packages that depend on other modules and those that cannot be simply installed with the install.packages() function. First of all, for many years now, when we install ANY software on our cluster we create a file (in the directory, where we go through the installation process) with every single line that we execute to install this software, so when the next person needs to install a new version of this software, he spends minimal time. And we do the same for R packages. We obviously do not document how to install those R packages that can simply be installed through install.packages() function, but we thoroughly document those that need some extra steps.
When a new R version is installed, we download the names of installed packages for the previous R version and then feed this list into install.packages() function and those packages that require no additional effort are beautifully installed and we need to take care only of those that are “tricky”. Now it turns out that there are only a handful of approaches one needs to know to make “difficult” packages installed and so when a new difficult package comes, we just go through our documentation and try those approaches to see which one would work. We usually want to make sure that our users do not need to load any additional modules when they need to load R packages.

Here are for example our notes for the installation of udunits package:

module purge
module load R/4.2.3
module load udunits/2.2.26

export UDUNITS2_INCLUDE=$SCC_UDUNITS_INCLUDE
export UDUNITS2_LIB=$SCC_UDUNITS_LIB
export MAKEFLAGS="LDFLAGS=-L$SCC_UDUNITS_LIB\ -Wl,-rpath=$SCC_UDUNITS_LIB"

wget https://cran.r-project.org/src/contrib/udunits2_0.13.2.1.tar.gz

R CMD INSTALL --configure-args="--with-udunits2-lib=$SCC_UDUNITS_LIB --with-udunits2-include=$SCC_UDUNITS_INCLUDE"  udunits2_0.13.2.1.tar.gz

Once this package is installed this way, the user does not have to worry about loading udunits module (and remember which version of this package he needs to load). He only needs to load R module.

Or here are our notes for another tricky package - sf:

wget https://cran.r-project.org/src/contrib/sf_1.0-12.tar.gz
module purge
module load R/4.2.3
module load gdal/3.4.3
module load udunits/2.2.26
module load proj/8.2.1
module load geos/3.10.2

export MAKEFLAGS="LDFLAGS=-L$SCC_GDAL_DIR/lib-unified\ -L$SCC_GDAL_LIB\ -L$SCC_UDUNITS_LIB\ -W`Preformatted text`l,-rpath=$SCC_GDAL_LIB\ -Wl,-rpath=$SCC_UDUNITS_LIB\ -Wl,-rpath=$SCC_GDAL_DIR/lib-unified\ -Wl,-rpath=$SCC_GDAL_LIB"

R CMD INSTALL sf_1.0-12.tar.gz

And one more. This time rgeos:

module load geos/3.10.2
module load R/4.2.1
export MAKEFLAGS="LDFLAGS=-L/share/pkg.7/geos/3.10.2/install/lib\ -Wl,-rpath=/share/pkg.7/geos/3.10.2/install/lib"
export PKG_CONFIG_PATH="/share/pkg.7/geos/3.10.2/install/lib/pkgconfig":$PKG_CONFIG_PATH
export LD_LIBRARY_PATH=/share/pkg.7/geos/3.10.2/install/lib:$LD_LIBRARY_PATH
wget https://cran.r-project.org/src/contrib/rgeos_0.5-9.tar.gz
R CMD INSTALL rgeos_0.5-9.tar.gz

As you can see the general approach is very similar.

langford · August 31, 2023, 2:43pm

We have tried a few different ways of managing R and have settled on the “Really Big R Module” solution that seems to be favored by EasyBuild.

When I started with our group ~5 years ago I thought conda-R was going to be a great solution, but it failed to meet the R user-base where it is. People like being able to run install.packages or use devtools to build R modules, and the selection of R modules available via conda was frequently lacking. This lead to a mixture of conda-installed packages and those built via R itself, which often caused more headaches.

Currently, our R software module brings in ~100 module dependencies (like GDAL and GEOS) and installs ~1100 R modules covering the vast majority of what our researchers require. EasyBuild also provides “bundles” of R modules for things like Bioconductor.

Since our clusters are heterogeneous mixtures of node-types, our most common issue used to be researchers compiling R modules on new hardware, but then running it on older nodes. This lead to lots of emails about Illegal Instructions errors. To fix it, we set the -march compiler option to match the oldest generation of hardware in a Makevars.site file inside the software module $R_HOME/etc:

CFLAGS = -O2 -ftree-vectorize -march=broadwell -fno-math-errno
CXXFLAGS = -O2 -ftree-vectorize -march=broadwell -fno-math-errno
CXX11FLAGS = -O2 -ftree-vectorize -march=broadwell -fno-math-errno
CXX14FLAGS = -O2 -ftree-vectorize -march=broadwell -fno-math-errno
CXX17FLAGS = -O2 -ftree-vectorize -march=broadwell -fno-math-errno
CXX20FLAGS = -O2 -ftree-vectorize -march=broadwell -fno-math-errno

We also have RStudio modules which people can load and launch via Open OnDemand. In general, this seems to be working out pretty well for us.

wwarr · September 14, 2023, 6:23pm

Thank you for the detailed reply! This seems like a relatively streamlined solution. About how much time per month would you say you spend on maintaining R-related software? Do you have a regular release cadence for R versions, and/or a list of the shared packages?

wwarr · September 14, 2023, 6:24pm

Thank you for the reply! This is helpful to know about compilation and illegal instruction errors. The build process sounds daunting. Is it a big effort to maintain?

ktrn · September 14, 2023, 10:18pm

I usually install 2 versions of R each year - the one that is ..1 (I try to avoid installing the very first release ..0 which comes around May as this version usually has more bugs) and then one of the later versions that is released around winter time.
Since the installation process almost does not change, I spend around 1 hour installing R itself and making fresh notes for this version and then I start a script that installs all the packages that I want to add (this process usually runs overnight as we install many of them). Finally, I spend 2-3 hours or less installing “special” packages like rgdal since with each new version of R, I need to download new versions of these packages and it would be tricky to write a script that would do this automatically. And I make new notes for these packages. Between these installations, we occasionally receive tickets where users report problems with the installation of R packages. I would say 2-4 a month, but in most cases, they are trying to install packages that are already installed and we just tell them to use our installation. In rare cases where there is a new “problematic” package, we try one of our “recipes” and install the package. It may take an hour, but usually 15-20 min at most. I would be happy to share my notes.