SLURM: how can I get more details about why a job still pending execution?

pending-jobs
slurm
scheduler

#1

Is there a command/option you can run to determine the specifics of why a SLURM job is still pending execution besides the REASON CODE given by the squeue command (with default options)? E.g. What resources is it waiting on and/or what currently running/pending jobs might be competing for those resources?

CURATOR: Jack Smith


#2

Here’s one example, using the scontrol command using grep to filter out the out the other 30ish lines:

scontrol -d show job <JOBID> | grep Reason

JobState=PENDING Reason=launch_failed_requeued_held Dependency=(null)

#4

I don’t think the standard Slurm commands will give you much beyond the REASON code.
And I believe that the REASON code is from the last time the job was examined in the normal scheduler run (I don’t think it gets updated by the backfill process). Depending where the job is in the queue, there may be a field SchedNodeList which will show you what nodes Slurm is thinking about using for this job (I believe this is available if REASON=Resources). And note that the StartTime field may have the estimated start time for the job. That’s about all I ever found really usable for jobs with REASON=Resources.

REASON=Priority and the various held states are pretty self explanatory. Stuff like QOSResourceLimit, and AssocGrpCPUMinsLimit might take a little work to figure out what limit is being hit, but usually not to bad to do so.

Other reasons, like “launch_failed_requeued” typically indicate something abnormal in the system, and the sysadmin should examine logs on the node the job ran on to see what is up.


#3

The most common Reason code is “Resources” and if that is the case then a good place to look is your job’s priority. That can be queried with the sprio command. That command should list all pending jobs with their priority number along with the priority factors that are used to calculate the overall number. You may want to restrict the output to the partition that you’re job was submitted to. There may be several factors that weigh into your jobs priority depending on your site’s configuration. You can see those weights with “sprio -w”.
Another thing that you may want to look at is if this job was submitted with any Dependency.