Sometimes when I run a SLURM job, I receive the following message in my stderr file: “slurmstepd: error: Exceeded step memory limit at some point”. What does this mean? Should I be worried about this? How do I avoid getting this error?

ANSWER: This error indicates that at some point, your job (or a task in your job) used more
memory than was allocated to it, and Slurm killed your job.

You can use the sacct cmd to find the maximum resident memory size (MaxRSS)
for any task in your job; see How can I use SLURM’s sacct command to show memory usage statistics for a job that I am running?
for more details.

In the simple case that your job just needs a bit more memory than you requested,
you can try increasing the amount of memory that you request. This is usually
specified in slurm with either the --mem-per-cpu=MEMSIZE or --mem=MEMSIZE
parameters to sbatch; the former sets the memory per allocated CPU core, the latter
sets the memory required per node. Usually it the memory per allocated CPU core
makes more sense to set. In either case, MEMSIZE is amount of memory in MB.

You should try to do an estimate of how much memory your code will need to ensure
that you are not hitting a memory leak which will consume however much memory
you throw at the job (and requires fixing the code).


