Prevent CPU-only Jobs running on GPU nodes

DelilahM · March 6, 2023, 10:11pm

Hi All,

Does anyone have any experience configuring SLURM that prevents cpu-only jobs running on GPU nodes?
TIA

Delilah

langford · March 6, 2023, 11:51pm

We use partition QOS’s on all our partitions. For the gpu partition we set the MinTRESPerJob limit to require all jobs to request at least 1 GPU (QOS Docs).

It doesn’t do anything to make sure that people actually use the GPU they request, but it eliminates CPU-only jobs from taking over the GPU partition.

DelilahM · March 7, 2023, 10:59am

I wonder if it can be done without qos setup, as we don’t have that. MinTRESPerJob is definitely taking me to the right direction.
Thank you so much!

Best,
Delilah

mrobbert · March 7, 2023, 5:03pm

We do that with a job submit plugin. Slurm provides a job_submit_lua.so plugin if it is built with the lua-devel package installed. You can enable that with the following line in your slurm.conf

JobSubmitPlugins=lua

You can then create a job_submit.lua script in the same directory as slurm.conf and have it check fields within the job submission to see if the script requested a GPU.
Our script is a little convoluted, but if you’d like to see an example I can probably clean it up to show a basic example.

jsimms · March 10, 2023, 10:29pm

Another option is to user features and constraints. You can assign features to nodes, then specify a constraint in the job submission file. It may seem odd, but you could assign a feature of “nogpu” or “gpu” and then jobs could request that specifically. In this case, something like what others suggested with the lua plugin might be the most efficient, but this works without any plugins, etc.

Warmest regards,
Jason

pagondas · June 17, 2023, 2:16pm

This might be more appropriate for its own discussion thread, but here goes:

If there is a GPU partition, then a user who runs jobs there without requesting GPU might not be necessarily a bad thing. Yes, there are users who just want compute and may not know (or care) that by grabbing all the CPUs on a node they block the GPU resources from the users who need them. On the other hand, on our cluster we often see a great many CPUs in the GPU partition sit idle while GPU jobs using only a few CPUs chug away. The GPU partition offers a huge amount of CPU compute, and if the CPU-only folks can get a little guidance on how to run in the GPU partition (e.g. don’t use all the available cores if GPUs are idle, or only run on GPU nodes where all GPUs are allocated, etc) resources can be shared efficiently.

rromero · April 26, 2024, 7:59pm

Though this thread is dated, I’d like to contribute by elaborating on using QOS. You’ll need to incorporate your GRES/GPU into your slurm.conf under AccountingStorageTRES, similar to the following:

AccountingStorageTRES=gres/gpu,gres/gpu:h100,gres/gpu:a100,gres/gpu:rtx6000

PartitionName=gpu ... QOS=gpu

Afterward, execute:

systemctl restart slurmctld
scontrol reconfigure

Additionally, add QOS settings:

sacctmgr add qos gpu set MinTRESPerJob=gres/gpu=1