If a SLURM job fails or terminates unexpectedly, what mechanisms are available for making sure that temporary data, especially that on compute nodes, is cleaned up?
If you use the local storage (like $TMPDIR) on the node to store temporary data, it will be automatically purged on exit. Every HPC cluster can have a different location of $TMPDIR and can have size limits. Check with local HPC provider for those.
If there are other things you need to do to cleanup, you can use bash’s error handling. The following is based on response on stack overflow: interrupt handling - How to trap ERR when using 'set -e' in Bash - Stack Overflow
#!/bin/bash
set -eE # same as: `set -o errexit -o errtrace`
trap 'cleanup' ERR
function cleanup(){
echo "FAILED! Cleaning up..."
# rm stuff ...
}
function func(){
ls /root/
}
func
The call to func will fail, which will cause cleanup to be called. Note that if any command failure that isn’t part of a conditional will cause this cleanup function to be called and the whole script to exit. See the man page for bash section here: bash(1): GNU Bourne-Again SHell - Linux man page
You can also trap EXIT instead if you want to also run it when the script completes. It runs regardless of error or no error. It runs second if both ERR and EXIT are trapped.
One of the things I like to do is build my runtime inside a shell script, and have slurm execute that script. All of the essential elements within this called script (for example, an R script or series of R scripts) pass their return code back. Upon return of control to the SLURM script I then check the return code of the calling script and then execute a set of post-processing scripts or instructions that takes actions based on the return code. This post-processing step takes care of any housekeeping I’d like done.
This approach handles most ‘normal’ job failures, but it doesn’t save you from yourself if the job abnormally terminates because it ran overtime or the node it is running on fails.