This is a very interesting question! Over the years, I heard some people advocate using conda env. to deal with various problems they encountered with R package installation. It would be interesting to hear how this approach helps as I personally prefer to keep R and Python environments separate. Let me divide my answer into 3 problems/questions:
- Should R packages be installed into the user’s space or should they reside in the “central” location and be available to all users?
- What is the best way to resolve conflicts between the versions of various R packages?
- How to make the process of installing new R packages (and re-installing these packages into new R versions) less time-consuming?
I have been installing and maintaining the R
packages on our cluster for more than 10 years now and here is the approach that works very well for us.
Let’s start with the first two questions. The first one obviously depends on many factors. On our cluster, the users have only 10GB in their home directory. R
packages are usually tiny (compared to Python
packages); they usually easily fit into this limit even if a user installs dozens of them. However, almost everyone who uses R
wants to use packages like “tidyverse”. Then they start to mix (in the same directory) the packages that use similar R
versions, like R/4.1.2
and R/4.1.3
and in this case, the version conflict is almost unavoidable. So, we install many popular R
packages into the central location (along with the appropriate R
version) so the users do not need to do that. Once a package is installed for a particular R
version, we never update it (inside that R version) and if some user needs to have a newer version, they will need to install it into their home environment. For the same reason, when we install a new Bioconductor
package we always answer “none” on the question “Which packages would you like to update (all, some, none)?”. If a user needs a package that is not published on CRAN
or Bioconductor
(and as a result is not as thoroughly vetted as if it were), this package is installed into the user’s space. Using this approach we have not had a single time when we had any R package version conflict (within our central R installations) for the past 10 years (and we installed over a thousand R packages for almost every R version.) One problem that this approach solves is that we do not need to “untangle” version problems for as many users as we would otherwise need to do. And in 99% of the cases when the users come to us saying that they have a problem with their R packages, we recommend that they delete all of them as we usually have them already installed. If they need to add one or two more, it is much easier.
Now to move to a more difficult question - how to make the installation process for more difficult packages (like rgdal
) easier. Yes, there always be some packages that depend on other modules and those that cannot be simply installed with the install.packages()
function. First of all, for many years now, when we install ANY software on our cluster we create a file (in the directory, where we go through the installation process) with every single line that we execute to install this software, so when the next person needs to install a new version of this software, he spends minimal time. And we do the same for R packages. We obviously do not document how to install those R packages that can simply be installed through install.packages()
function, but we thoroughly document those that need some extra steps.
When a new R version is installed, we download the names of installed packages for the previous R version and then feed this list into install.packages()
function and those packages that require no additional effort are beautifully installed and we need to take care only of those that are “tricky”. Now it turns out that there are only a handful of approaches one needs to know to make “difficult” packages installed and so when a new difficult package comes, we just go through our documentation and try those approaches to see which one would work. We usually want to make sure that our users do not need to load any additional modules when they need to load R packages.
Here are for example our notes for the installation of udunits package:
module purge
module load R/4.2.3
module load udunits/2.2.26
export UDUNITS2_INCLUDE=$SCC_UDUNITS_INCLUDE
export UDUNITS2_LIB=$SCC_UDUNITS_LIB
export MAKEFLAGS="LDFLAGS=-L$SCC_UDUNITS_LIB\ -Wl,-rpath=$SCC_UDUNITS_LIB"
wget https://cran.r-project.org/src/contrib/udunits2_0.13.2.1.tar.gz
R CMD INSTALL --configure-args="--with-udunits2-lib=$SCC_UDUNITS_LIB --with-udunits2-include=$SCC_UDUNITS_INCLUDE" udunits2_0.13.2.1.tar.gz
Once this package is installed this way, the user does not have to worry about loading udunits module (and remember which version of this package he needs to load). He only needs to load R module.
Or here are our notes for another tricky package - sf:
wget https://cran.r-project.org/src/contrib/sf_1.0-12.tar.gz
module purge
module load R/4.2.3
module load gdal/3.4.3
module load udunits/2.2.26
module load proj/8.2.1
module load geos/3.10.2
export MAKEFLAGS="LDFLAGS=-L$SCC_GDAL_DIR/lib-unified\ -L$SCC_GDAL_LIB\ -L$SCC_UDUNITS_LIB\ -W`Preformatted text`l,-rpath=$SCC_GDAL_LIB\ -Wl,-rpath=$SCC_UDUNITS_LIB\ -Wl,-rpath=$SCC_GDAL_DIR/lib-unified\ -Wl,-rpath=$SCC_GDAL_LIB"
R CMD INSTALL sf_1.0-12.tar.gz
And one more. This time rgeos
:
module load geos/3.10.2
module load R/4.2.1
export MAKEFLAGS="LDFLAGS=-L/share/pkg.7/geos/3.10.2/install/lib\ -Wl,-rpath=/share/pkg.7/geos/3.10.2/install/lib"
export PKG_CONFIG_PATH="/share/pkg.7/geos/3.10.2/install/lib/pkgconfig":$PKG_CONFIG_PATH
export LD_LIBRARY_PATH=/share/pkg.7/geos/3.10.2/install/lib:$LD_LIBRARY_PATH
wget https://cran.r-project.org/src/contrib/rgeos_0.5-9.tar.gz
R CMD INSTALL rgeos_0.5-9.tar.gz
As you can see the general approach is very similar.