Ask.Cyberinfrastructure

"slurmstepd: error: Exceeded step memory limit at some point"

slurm
scheduler

#1

Sometimes when I run a SLURM job, I receive the following message in my stderr file: “slurmstepd: error: Exceeded step memory limit at some point”. What does this mean? Should I be worried about this? How do I avoid getting this error?

CURATOR: Scott Yockel


#2

ANSWER: This error indicates that at some point, your job (or a task in your job) used more
memory than was allocated to it, and Slurm killed your job.

You can use the sacct cmd to find the maximum resident memory size (MaxRSS)
for any task in your job; see How can I use SLURM’s sacct command to show memory usage statistics for a job that I am running?
for more details.

In the simple case that your job just needs a bit more memory than you requested,
you can try increasing the amount of memory that you request. This is usually
specified in slurm with either the --mem-per-cpu=MEMSIZE or --mem=MEMSIZE
parameters to sbatch; the former sets the memory per allocated CPU core, the latter
sets the memory required per node. Usually it the memory per allocated CPU core
makes more sense to set. In either case, MEMSIZE is amount of memory in MB.

You should try to do an estimate of how much memory your code will need to ensure
that you are not hitting a memory leak which will consume however much memory
you throw at the job (and requires fixing the code).


#3

AC COMMENT: The answer by payerle to this question does, indeed, address the issue in detail, however for the question itself to be StackExchange-ready it needs further work in the following ways:

  1. first, the person asking the question should follow the guidance at: https://math.stackexchange.com/help/how-to-ask

“Have you thoroughly searched for an answer before asking your question? Sharing your research helps everyone. Tell us what you found and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and above all, it helps you get a more specific and relevant answer!”

By simply copy-pasting this question title into a google search: https://www.google.com/search?q=My+job+died+with+a+“slurmstepd%3A+error%3A+Exceeded+step+memory+limit+at+some+point”+please+help&oq=My+job+died+with+a+“slurmstepd%3A+error%3A+Exceeded+step+memory+limit+at+some+point”+please+help

The first few search results yield some good hits:

In the StackExchange community, it’s often the case that there are several good answers to a question, rather than a single good answer— depending on how specific the question is, and whether or not it actually has a single solution or not. Given that a google search reveals several different answers, those could be referenced by another person to answer the question in addition to what payerle has posted.