GNU Parallel, SLURM, and srun/mpirun

nucci · August 6, 2024, 12:24pm

Greetings.

I am trying to design a workflow that will marshall a set of smaller parallel jobs inside the scope of a larger script. The idea would be to allocate one or more nodes of 24 cores each. Inside this larger job, a series of smaller parallel jobs will be run, some processing done after they complete, followed by another series of smaller parallel jobs. Wash/rinse/repeat until convergence.

I thought about using GNU Parallel inside the SLURM script to launch the series of simultaneous smaller parallel jobs.

Let’s assume for this example I have 16 cores available to me, so I would like to launch 8 2-core MPI jobs.

If I do something like the following:

parallel -j 8 “srun -n 2 ./my_MPI_job” ::: {1…8}

I do get 8 “srun” sessions launched, but they are executed sequentially. At first, 7 of the 8 2-core jobs are waiting while the firs tjob runs. When it completes the second run runs, and so on.

I’ve tried a series of options to srun, I even used mpirun instead, but the processes are still serialized.

Has anyone successfully implemented this, or is there a better way to manage multiple smaller MPI jobs within the scope of a larger SLURM session?

Thanks.

langford · August 6, 2024, 11:25pm

What are you actually trying to achieve with this workflow? Why do this instead of something like Slurm job-arrays?

Do you have to get whole nodes at your HPC center, or can you run smaller jobs?

mweiner3 · August 7, 2024, 1:05pm

We’ve had success simply using srun natively with fractional resources inside the script, leading us to mostly retire GNU Parallel. The key is to fractionally assign all resources, or the steps will still run serially.
Quoting from Georgia Tech PACE’s documentation (this part is unfortunately now behind a login barrier at GT SSO or Guest Redirect):

On the Slurm scheduler, it is possible to run multiple processes in parallel natively with srun. This can be an alternative to Pylauncher, GNU Parallel, or job arrays for running a large number of smaller tasks at once in a single job. The method supports the execution of many small tasks in parallel, enabling HTC-style work-flows on HPC systems, such as PACE.

Your Slurm script will contain multiple srun lines. There are several key requirements for them to run simultaneously:

Ensure that each srun command asks for a fraction of the CPU and memory resource of the full job, with lines that should run simultaneously requesting less than or equal to the job’s total. Each task will start in order as soon as sufficient resources have become available for it.

Include -c1 if using 1 CPU per task, which is standard.

Include & at the end of each line to have the commands run simultaneously in the background.

Include wait at the end of the sequence of srun commands, to avoid having the job end while the processes are running in the background.

In this example, we’ll have six total tasks to run and want to run two at a time, each allocated half (12 cores and 84 GB of memory) of the job’s resources (24 cores and 168 GB of memory). The third task can start as soon as either of the first two ends, and so on.

#!/bin/bash
#SBATCH -N1 --ntasks-per-node=24                # Number of nodes and cores per node required
#SBATCH --mem-per-cpu=7G                        # Memory per core

srun --quiet -n12 -c1 --mem=84G ./executable1 &
srun --quiet -n12 -c1 --mem=84G ./executable2 &
srun --quiet -n12 -c1 --mem=84G ./executable3 &
srun --quiet -n12 -c1 --mem=84G ./executable4 &
srun --quiet -n12 -c1 --mem=84G ./executable5 &
srun --quiet -n12 -c1 --mem=84G ./executable6 &
wait

nucci · August 7, 2024, 4:43pm

Thanks much, your response is very appreciated. The way our proposed workflow is going, this looks like a good way to manage the smaller pieces within the larger scope of the workflow.

nucci · August 7, 2024, 6:46pm

The total workflow consists of running simultaneous smaller MPI jobs that form a larger collective. Whole nodes are desired because of exclusive access to memory and local disk.

Pseudocode looks something like this:

Assume each Icompute node has C cores

Set master SBATCH to span potentially days of wall clock time, on MAX(M*R, N) cores, which might be larger than C

Do some setup tasks

For each time step T: (this will loop is potentially 10^4 iterations)
# Forecast Step
For member 1 to P: (say P=16)
For member 1 to R in group Q: (say Q=8; we want to run R=2 copies of the model, Q=8 times sequentially, or a total of P=16 ensemble members)

              Call model runscript on M cores (assume M<C for now; may wish to change this further in the future)
  
              Confirm that all R runs in Q have completed
        # keep this loop going until all P members have run
  Confirm that all P runs have completed
  
  Do some file management (if M*R > C, this may complicate this task)

  Assimilation step
  
  Call MPI assimilation code using N Infiniband cores (can assume N<C for now; may wish to change this further in the future)

  Confirm that assimilation has been completed

  Do some file management

Do some concluding tasks

chris_blanton1 · August 8, 2024, 1:48pm

Pylauncher can do this workflow. GitHub - TACC/pylauncher

I really liked some of the cool things like an adaptive workflow.