Ask.Cyberinfrastructure

How to determine if jobs are dying on their own or from the scheduler?

monitoring
sge
scheduler

#1

I have a set of jobs that dies without an error message - how can I tell if it is the job itself, the scheduler (I’m on an SGE cluster), or both?

If it is from the job how can I get more info for troubleshooting?


#2

This is a partial answer, covering SLURM job scheduler. When a job is killed due to time-out, it will have this kind of error message at (or near) the very last of the output file (stderr file, if you use -e option):

slurmstepd: error: *** JOB 8841014 ON coreV2-22-017 CANCELLED AT 2019-03-08T11:30:03 DUE TO TIME LIMIT ***

Here is an example job that will time out:

#!/bin/bash                                                                                                                                                                            
# 20190308                                                                                                                                                                             
# Test SLURM                                                                                                                                                                           
# Demo for a job that will timeout                                                                                                                                                     

# For 1 task                                                                                                                                                                           
#SBATCH -n 1                                                                                                                                                                           

# Job name                                                                                                                                                                             
#SBATCH -J Timeout                                                                                                                                                                     
#SBATCH -t 00:01:00                                                                                                                                                                    
#SBATCH -o %x.o%j                                                                                                                                                                      
## Additional switches may need to be specified on your system                                                                                                                         

echo "Start date: $(date)"
echo
echo "Sleeping"
set -x
sleep 10m

The output is:

Start date: Fri Mar  8 11:28:49 EST 2019                                                                                                                                               

Sleeping                                                                                                                                                                               
+ sleep 10m                                                                                                                                                                            
slurmstepd: error: *** JOB 8841014 ON coreV2-22-017 CANCELLED AT 2019-03-08T11:30:03 DUE TO TIME LIMIT ***