Hi All,
Does anyone have any experience configuring SLURM that prevents cpu-only jobs running on GPU nodes?
TIA
Delilah
Hi All,
Does anyone have any experience configuring SLURM that prevents cpu-only jobs running on GPU nodes?
TIA
Delilah
We use partition QOS’s on all our partitions. For the gpu
partition we set the MinTRESPerJob
limit to require all jobs to request at least 1 GPU (QOS Docs).
It doesn’t do anything to make sure that people actually use the GPU they request, but it eliminates CPU-only jobs from taking over the GPU partition.
I wonder if it can be done without qos setup, as we don’t have that. MinTRESPerJob is definitely taking me to the right direction.
Thank you so much!
Best,
Delilah
We do that with a job submit plugin. Slurm provides a job_submit_lua.so plugin if it is built with the lua-devel package installed. You can enable that with the following line in your slurm.conf
JobSubmitPlugins=lua
You can then create a job_submit.lua script in the same directory as slurm.conf and have it check fields within the job submission to see if the script requested a GPU.
Our script is a little convoluted, but if you’d like to see an example I can probably clean it up to show a basic example.
Another option is to user features and constraints. You can assign features to nodes, then specify a constraint in the job submission file. It may seem odd, but you could assign a feature of “nogpu” or “gpu” and then jobs could request that specifically. In this case, something like what others suggested with the lua plugin might be the most efficient, but this works without any plugins, etc.
Warmest regards,
Jason
This might be more appropriate for its own discussion thread, but here goes:
If there is a GPU partition, then a user who runs jobs there without requesting GPU might not be necessarily a bad thing. Yes, there are users who just want compute and may not know (or care) that by grabbing all the CPUs on a node they block the GPU resources from the users who need them. On the other hand, on our cluster we often see a great many CPUs in the GPU partition sit idle while GPU jobs using only a few CPUs chug away. The GPU partition offers a huge amount of CPU compute, and if the CPU-only folks can get a little guidance on how to run in the GPU partition (e.g. don’t use all the available cores if GPUs are idle, or only run on GPU nodes where all GPUs are allocated, etc) resources can be shared efficiently.
Though this thread is dated, I’d like to contribute by elaborating on using QOS. You’ll need to incorporate your GRES/GPU into your slurm.conf under AccountingStorageTRES, similar to the following:
AccountingStorageTRES=gres/gpu,gres/gpu:h100,gres/gpu:a100,gres/gpu:rtx6000
PartitionName=gpu ... QOS=gpu
Afterward, execute:
systemctl restart slurmctld
scontrol reconfigure
Additionally, add QOS settings:
sacctmgr add qos gpu set MinTRESPerJob=gres/gpu=1