Using RAM Disk on the queuing system to optimize file I/O

schadalapaka · October 18, 2018, 7:47pm

Hi All,

Is anyone using RAM disk(tmpfs) on your cluster queuing system to optimize file I/O ?

Thanks,
Sarvani Chadalapaka
HPC Administrator
University of California Merced, Office of Information Technology
schadalapaka@ucmerced.edu| it.ucmerced.edu

mrobbert · October 19, 2018, 8:00pm

I’m a little confused by the wording of your question, but I’ll try to answer the best that I can. Typically the queueing system itself does not do enough file I/O to require an optimized file system so I would not suspect that most sites have the queueing system writing to a tmpfs file system. The user jobs that the queueing system runs may have many different file I/O patterns and a tmpfs could be useful there. At our site we do have a tmpfs filesystem that is mounted on all compute nodes as /dev/shm/ and is available for users to use in their jobs.

mbussonnier · October 19, 2018, 9:04pm

Let me rephrase for schadalapaka here,

We have a number of users that have a number of jobs using FileIO, some of them are aware of /dev/shm, and (try to) use it properly. It is annoying as they have to 1) learn about it, 2) clean up after themselves.

We saw that system.d is mounting /run/user/$(id -u $USER), and tear it down for interactive session, though Slurm job does not seem to trigger it.

Do you have any experience in hooking something similar into a scheduler so that user have easy access and discoverability while still making sure resources get cleaned up?

jpessin1 · October 23, 2018, 2:39pm

hmmm… If I’m understanding you @mbussonnier your looking for a simple way to make these accessible, and automate clean-up for reuse (1)? and you’re stuck on the automated clean-up (2)?

I believe using the Queue Schedulers Prolog & Epilog options might work for that.
These are typically for set-up and tear-down that you want to keep separate from from the job.
In most cases they can be set up at multiple ‘independent’ levels, including system (scheduler) and user/job as well as being able to separate batch verse interaction jobs.

that should be enough to script-out your specifics and matching environment variables.

Since you mentioned Slurm (gridEngine and PBS also have prolog/epilog configuration)
https://slurm.schedmd.com/prolog_epilog.html

A follow up, some user will likely do things that fill the whole available space.
Has anyone had issues with this? How were they addressed?

I haven’t tried anything, but to my current thinking, having a separate mount to ‘partition’ off a limited fraction of the memory seems plausible?

mbussonnier · October 23, 2018, 3:19pm

Many thanks, I’ll have a look at that – relatively new to slurm; I’ll see if I can integrate it to make the experience seemless and will try to report. I’d still like to figure out how systemd is making that to not reinvent the wheel. I’ve also encountered software this week-end that work on interactive and not on batch subscription because /run/user/$id was not available on batch jobs and $XDG_RUNTIME was set.

I believe that /dev/shm and /run/usr/$id are limited by default to 1/2 the available ram, and at least /run/user/id is destroyed when a user has no more login session, so you are pretty much guarantied that the resources will be cleaned-up.

jpessin1 · October 24, 2018, 10:40pm

It is configurable but I’ve seen enough variations to take the position ‘check the specifics for each setup’

/dev/shm is often half of RAM when it exist, but not /run/usr/$id (at least where I’ve been)
For example on a local workstation with a default install of CentOS 7 /run/usr/$id with one user is only 10% of RAM (3.2Gb of 32Gb) and on on some systems doesn’t exist at at all.

More importantly even assuming half RAM by default it’s a potential issue when allocating individual cores you end up with a system reporting a total of 0.5 * core-count of available RAM, leading to users having/causing issues with thrashing. – though user education seems the solution here.

Much less of an issue if you’re scheduling at the node level.

Systemd - IDK, it might be calling systemd-tmpfiles
systemd-tmpfiles (8), tmpfiles.d (5)

ccoffey · August 20, 2019, 4:00pm

You could create a slurm epilog and prolog script to setup and clean the areas. Something like this (we do this for /tmp) …

prolog (task prolog):

TMPDIR="/tmp/$SLURM_JOB_USER"
JOBDIR="$TMPDIR/$SLURM_JOB_ID"
echo “export TMPDIR=$JOBDIR”

epilog:

TMPDIR="/tmp/$SLURM_JOB_USER"
JOBDIR="$TMPDIR/$SLURM_JOB_ID"
TMP=$TMPDIR
TEMP=$TMPDIR
rm -rf $JOBDIR