Ask.Cyberinfrastructure

Understanding SLURM accounting fields

As many of our HPC sites are using SLURM, I wonder if somebody has taken time to write down the meaning of the accounting fields spitted out by SLURM’s sacct command. Its own manual page is very terse:

https://slurm.schedmd.com/sacct.html

It lists all the available fields but no explanation whatsoever. For example, what is the unit time of “ElapsedRaw” field (apparently it is in seconds). What’s the difference between “JobID” and “JobIDRaw”? Things along this line is what I am looking for. I would like to be able to analyze the accounting fields manually. While there exists tools such as XDMOD to provide many aggregate quantities, I would like to be able to do a “deep drill” into the accounting data directly.

hey @wirawan0 did you see this section?

https://slurm.schedmd.com/sacct.html#OPT_ALL

For example, if you look at “Elapsed” it shows the format of the time. And then for JobID vs JobIDRaw:

JobID
The number of the job or job step. It is in the form: job.jobstep.

JobIDRaw
In case of job array print the JobId instead of the ArrayJobId. For non job arrays the output is the JobId in the format job.jobstep.

Are you looking for better clarification than is provided? Here is a nice link to show how to use squeue and sacct for monitoring:

https://wiki.rc.hms.harvard.edu/display/O2/Using+Slurm+Basic#UsingSlurmBasic-MonitoringJobs

For example, you can issue this command (and put the comma separated value list of headers you want to include):

# get statistics on a completed job
# you can find all the fields you can specify with the --format parameter by running sacct -e
# you can specify the width of a field with % and a number, for example --format=JobID%15 for 15 characters
sacct -j <jobid> --format=JobId,AllocCPUs,State,ReqMem,MaxRSS,Elapsed,TimeLimit,CPUTime,ReqTres

It’s these labels that @wirawan0 is asking specifically about, and I’ve reached out to the SLURM team to get some updates on their sacct page. In the meantime, do others have any additional documentation on tricks for monitoring, or custom commands that have worked nicely in the past? Please share your thoughts on this thread!

heyo again @wirawan0! I wanted to let you know that I reached out to our friends at SLURM, and they did some work on that documentation, so things should be a bit more clear now. The changes are represented here: https://github.com/SchedMD/slurm/commit/1a563823285b13a87ac74e7982c1963997491d11