Ask.Cyberinfrastructure

What are good uses for job arrays?

job-array

#1

Hey HPC nerds! I’m putting together a little tutorial and introduction to job arrays (very simple) and the most important detail is having a list of compelling reasons we would want to use them in the first place, say, over something standard like submission with sbatch. Let’s put our heads together and think! I’m relatively new to using them so my list is likely limited.

  • Running a randomized simulation many times, with output files numbered from some 1…N. The output files can be numbered according to the array index. This means we use the array index as a variable to name our output files.
  • Running an analysis over inputs, where each input is named according to the array index, and outputs follow suit. This could also be applied to directory names.

What do others think?


#3

An example which reads multiple parameters from a separate file is handy, e.g. in the job script:

#!/bin/bash
# An input file with a line for each array element
# and parameters separated by spaces.
PARAMS=/path/to/parameter/file.txt
read -ra params <<<"$(sed ${SLURM_ARRAY_TASK_ID}'q;d' $PARAMS)"

After which params will be an array with all the things from line ${SLURM_ARRAY_TASK_ID} in the $PARAMS file. By using an array it allows the number of parameters to change, so the next lines in the script might be

command -t ${params[0]} -b ${params[1]} ${params[@]:2}

to use the first two elements to set parameters and then pass everything else to the command.

Once a parameter file is ready (by whatever means it is created) the job can then be submitted with

sbatch --array=1-$(awk 'END{print NR}' /path/to/parameter/file.txt) job_array.sh

edit: Although I use the line read from the file as parameters here, there’s no reason why the entire command can’t be in the parameter file, so the array can be used really to run any arbitrary set of commands as the tasks.

Also when working on a cluster which limits the number of job array elements, using a step in the array spec and then having each array task run several commands for it’s step can be useful. Say with

--array=1-100:10

Then each element can work on lines ${SLURM_ARRAY_TASK_ID} through ${SLURM_ARRAY_TASK_ID} + 10 of the input file. Season to taste, of course.


#2

Any large data operation where the results are not dependent on each other and the input and outputs can be designated by the iterator. Here’s what we put together to demo the problem and some approaches:

In this case, the scheduler is SGE, but the approaches (and some of the problems addressed) are identical among schedulers:

https://goo.gl/cEWNJ6
ta,
Harry