How to configure Spark's path for the tmpdir when running on an MPI queue

ktrn · June 6, 2025, 1:13am

I am trying to run a program that uses Spark on our MPI queue.
However, Spark creates very large files in the /tmp directory, so I would like to configure Spark to write all these temporary files to a specific directory instead.
I tried setting various environment variables:

export SPARK_WORK_DIR=$(pwd)
export SPARK_WORKER_DIR=${SPARK_WORK_DIR}/work
export SPARK_LOG_DIR=${SPARK_WORK_DIR}/log
export SPARK_LOCAL_DIRS=${SPARK_WORK_DIR}/tmp
export JAVA_OPTS="$JAVA_OPTS -Djava.io.tmpdir=${SPARK_WORK_DIR}/tmp"

I also added various config variables to the command line itself:

spark-submit \
    --master spark://$master:$port \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf SPARK_LOCAL_DIRS=${SPARK_WORK_DIR}/tmp \
    --conf "spark.local.dir=${SPARK_WORK_DIR}/tmp" \
    --conf "spark.worker.dir=${SPARK_WORK_DIR}/tmp" \
    --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=${SPARK_WORK_DIR}/tmp" \
    --conf "spark.executor.extraJavaOptions=-Djava.io.tmpdir=${SPARK_WORK_DIR}/tmp" \
    --conf "spark.driver.local.dir=${SPARK_WORK_DIR}/tmp" \
    --conf "spark.executor.local.dir=${SPARK_WORK_DIR}/tmp" \
etc.

But the large temporary files are still are written to the /tmp

Any suggestions what else I can try?

ktrn · June 27, 2025, 2:06pm

After running several tests, I realized that the problem is related to the way Spark is set by the Hail software https://broadinstitute.github.io/long-read-pipelines/tasks/Hail/. I opened an issue in their GitHub repo.