MPI + Singularity

carrie.brown · March 14, 2025, 6:18pm

We have a user who is trying to execute multiple containers using MPI on our Slurm cluster, but the job hangs with no output.

Here’s a minimally reproducible example:

Running this works fine:
srun -n 5 apptainer exec docker://alpine cat /etc/alpine-release

However, things hang when running the following:
srun --mpi=pmix -n 5 apptainer exec docker://alpine cat /etc/alpine-release

This behavior is the same for the self-built container our researcher is using, except running it without --mpi results in expected MPI errors, whereas running it with it hangs with no output.

Can anyone provide some insight as to why this is happening?

nucci · April 21, 2025, 5:45pm

I am able to run this at PSC on Bridges2, except I swap --mpi=pmi2 as there is no pmix option on Bridges:

$ srun --mpi=pmi2 -n 5 apptainer exec docker://alpine cat /etc/alpine-release
srun: job 30687221 queued and waiting for resources
srun: job 30687221 has been allocated resources
INFO:    Using cached SIF image
INFO:    Using cached SIF image
INFO:    Using cached SIF image
INFO:    Using cached SIF image
INFO:    Using cached SIF image
3.21.3
3.21.3
3.21.3
3.21.3
3.21.3

On the PSU RC cluster, I cannot even get your minimally reproducible test case to run from an interactive desktop session:

$ srun --mpi=pmi2 -n 5 apptainer exec docker://alpine cat /etc/alpine-release
srun: warning: can't honor --ntasks-per-node set to 1 which doesn't match the requested tasks 5 with the maximum number of requested nodes 1. Ignoring --ntasks-per-node.
srun: error: Unable to create step for job 37007252: More processors requested than permitted

I suspect it may have to do with the way SLURM handles an environment export.

As your minimally reproducible test case with pmi2 runs fine at PSC, I suspect the PSU RC cluster has configuration issues either with SLURM or with the default MPI.

nucci · April 21, 2025, 5:49pm

OK, an update to my update.

I amended my SLURM commands on the PSU RC cluster, and your minimal test case seems to now work if I use pmi2 instead of pmix:

$ export SLURM_EXPORT_ENV=ALL
$ salloc --partition=standard --account=test_credits_rise --nodes=5 --ntasks-per-node=1 --time=1:00:00
salloc: Pending job allocation 37007997
salloc: job 37007997 queued and waiting for resources
salloc: job 37007997 has been allocated resources
salloc: Granted job allocation 37007997
salloc: Waiting for resource configuration
salloc: Nodes p-sc-[2145-2149] are ready for job

$ srun --mpi=pmi2 -n 5 apptainer exec docker://alpine cat /etc/alpine-release
INFO:    Using cached SIF image
INFO:    Using cached SIF image
INFO:    Using cached SIF image
INFO:    Using cached SIF image
INFO:    Using cached SIF image
3.21.3
3.21.3
3.21.3
3.21.3
3.21.3