Hello everyone @csim, @solj, @sasmitam,
Another question about parametric runs. I am using the following script to run multiple parametric jobs using launcher.
#!/bin/bash
#SBATCH -J launcher
#SBATCH -N 3
#SBATCH -n 48
#SBATCH -p normal
#SBATCH -o out/slurm-out/Parametric.%j.out
#SBATCH -e out/slurm-out/Parametric.%j.err
#SBATCH -t 36:00:00
#------------------------------------------------------
module load launcher
export LAUNCHER_SCHED=interleaved
export LAUNCHER_WORKDIR=~/monte
export LAUNCHER_JOB_FILE=script/montecmd1
$LAUNCHER_DIR/paramrun
The jobs start normally and seem to run on multiple nodes as expected.
[rxz074000@europa ~]$ squeue -u$USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1089 normal launcher rxz07400 R 1-06:50:09 3 compute-1-1-[1-3]
However, when I check the output, the jobs seem to only run on the first selected node and never make it on the remaining nodes. The slurm error output shows the following message.
[rxz074000@europa monte]$ vi out/slurm-out/Parametric.1089.err
using /tmp/launcher.1089.hostlist.anax6Zvm to get hosts
starting job on compute-1-1-1
starting job on compute-1-1-2
starting job on compute-1-1-3
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-2: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-2: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-2: Name or service not known^M
ssh: Could not resolve hostname compute-1-1-3: Name or service not known^M
...
I cannot figure out whether it’s something I am doing or if there’s another issue. The executable is the same for all jobs with different parameters.