I am trying to run a program that uses Spark on our MPI queue.
However, Spark creates very large files in the /tmp directory, so I would like to configure Spark to write all these temporary files to a specific directory instead.
I tried setting various environment variables:
export SPARK_WORK_DIR=$(pwd)
export SPARK_WORKER_DIR=${SPARK_WORK_DIR}/work
export SPARK_LOG_DIR=${SPARK_WORK_DIR}/log
export SPARK_LOCAL_DIRS=${SPARK_WORK_DIR}/tmp
export JAVA_OPTS="$JAVA_OPTS -Djava.io.tmpdir=${SPARK_WORK_DIR}/tmp"
I also added various config variables to the command line itself:
spark-submit \
--master spark://$master:$port \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf SPARK_LOCAL_DIRS=${SPARK_WORK_DIR}/tmp \
--conf "spark.local.dir=${SPARK_WORK_DIR}/tmp" \
--conf "spark.worker.dir=${SPARK_WORK_DIR}/tmp" \
--conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=${SPARK_WORK_DIR}/tmp" \
--conf "spark.executor.extraJavaOptions=-Djava.io.tmpdir=${SPARK_WORK_DIR}/tmp" \
--conf "spark.driver.local.dir=${SPARK_WORK_DIR}/tmp" \
--conf "spark.executor.local.dir=${SPARK_WORK_DIR}/tmp" \
etc.
But the large temporary files are still are written to the /tmp
Any suggestions what else I can try?