How to prepare projects in R for archiving?



How can I prepare an R project for transfer and archiving?

I have several project directories that are between 100 & 700 Gigabyts each, by far the largest files are .Rdata. How can I cleanup and condense them for archiving without losing reproducibility?


For existing workspaces in .Rdata, I can think of 2 options:

For if rerunning the script from the beginning with the same inputs, gets the same files (use a diff) it is probably the same and you can save the inputs and script independently, if not you may have changed things.

Separately you can compress the work space – it’s anacdotal but I mostly here .xz does great (relative) compression, with moderate time.

The advantage of a workspace is its edit-ability, but like most good things this can also create challenges. Workspace space files store everything in active memory, including unused information in dataframes and variables that you reran with a new name but did not remove this can lead to bulky workspace files. This also means that if your operation resulting in the object in memory might not be identical to the operations in the script, introducing questions of reproducibility for the code.

It is often preferable to write in chunks and rerun if you are developing in Rstudio. This can be as simple as clicking the source button after you’ve added each section (if you have time consuming models, you can save the model matrix as a file and read it in to use it). Once the code is done, do a full run through with Rscript --vanilla and you can reasonably expect that other folks doing the same will get the same results.


added note: For compression R by default uses gzip (and it’s default compression level of -6).
and you will probably get (the same or) better results adjusting the setting internally, than trying to run the file.rdata though a compression program, especially if its already been compressed.