User-Initiated GPU Monitoring during 'Draining' State

ShrutiDongare · September 30, 2023, 8:40pm

I frequently encounter a situation where, after submitting a job to the HPC cluster, the job state remains queued for an extended period. Upon reaching out to the system administrator, I was informed that GPUs were in a ‘draining’ state.

I’m interested in knowing about a tool to monitor GPU state along with job status that can work universally or at least with most of the HPC systems. This way, as a user, we can independently check and avoid having to depend on system administrators for such information.

jfossot · October 1, 2023, 4:27pm

NVIDIA Has a whole chapter on this Draining State. One of the functions in this chapter can be used to check if a GPU is in a draining state or not ( nvmlReturn_t nvmlDeviceQueryDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t* currentState )

ErikB · October 2, 2023, 1:05pm

sinfo -a | grep -i drain ; echo '--------' ; squeue -a | grep $USER || echo "no jobs"
will print a list of any nodes that are draining followed by a list of any of the invoking user’s jobs that are pending (PD) or running (R). If the user has no jobs, it will note this. This command will work for HPC systems that use Slurm as their scheduler. For other schedulers you would need a different command.