Ask.Cyberinfrastructure

How do I check that all mount points are in place, and mark a node out of service if a mount point is missing before a job is scheduled in Slurm?

slurm

#1

Before scheduling a job using Slurm, is there a way to check if the allocated compute node has the required filesystem mounted and accessible?

CURATOR: Raminder Singh


#2

ANSWER:

Your best bet is probably to enable node health checks. There is a built in facility for this in Slurm (although you need to provide the actual checks, but there are some packages available).

Basically, in Slurm config, you can set

  • HealthCheckProgram — to the path of a health check program to use

  • HealthCheckInterval — how often the health check should run on each node (in seconds)

  • HealvthCheckNodeState – comma separated list of states (or ANY for any state) specifying which node states should run the checks.

The HealthCheckProgram should run whatever checks you want, and do what is desired if the health checks fail (typically place the node in the DRAIN state with an explanation of why). By placing the node in the DRAIN state, no new jobs will be placed on the node until an administrator fixes the problem and UNDRAINs the node. Any job currently running on the node will be allowed to complete. One can choose other, more drastic actions if desired.

As stated, Slurm has built-in support for running node health checks, but you are responsible for providing the health check code. However, there are some packages out there. You might wish to look at the Warewulf/LBNL node health check package; this is a reliable, flexible framework. You need to provide a config file listing which mount points you want to check, etc.


#3

A useful thing to be aware of with applying Warewulf/LBNL node health check package to the specific task of ... check that all mount points are in place, and mark a node out of service if a mount point is missing .... is how to handle filesystem checks that can deadlock.

Some filesystems (for example NFS mounts) can (and do) lock in a so-called

uninterruptible sleep

. When this happens the process making a health check can deadlock and will not report back.

The Warewulf/LBNL node health check package implements something called Detached Mode to handle this. This mode is usually used in a way that does not 100% fulfill the notion of checking exactly before a job is scheduled. Instead it is used to check periodically via the Slurm

HealthCheckProgram=COMMANDNAME
HealthCheckInterval=TIMEINSECONDS

setting mentioned in this answer.

If the health check fails then the node is marked so that it no longer accepts subsequent jobs. Jobs that arrive and are dispatched to a node in between a Detached Mode health check that passes and Detached Mode health check that fails, will still be launched. This is usually preferable to executing a health check that might deadlock as part of a Slurm prolog.