How to reduce memory footprint without restructuring mpi code



My mpi job requires more memory than is allocated in my current node configuration. What tricks can I use to reduce memory footprint without restructuring the code?

CURATOR: toreliza



A good approach to running a job with minimal #SBATCH variables in the submit script preamble is to include the options for number of tasks and cpus per task; below is an example:

#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2

This is because the ntasks option tells the slurm controller the maximum number of tasks to be run, and lets the controller allocate the appropriate resources. Adding the --cpus-per-task option lifts the default of one cpu per task, allowing more than one task per node to launch.

To illustrate the above, let’s use a specific scenario. Suppose the job is being submitted on 4 nodes, each node has 16 core and total RAM of 4 GB. Attempting to submit the job to the cluster results in an error that indicates that not enough memory is allocated for the job to execute successfully.

The above setup (I will use SLURM notation) implies that the preamble of the submit script includes:

#SBATCH --nodes=4

Already the total amount of memory available for allocation is restricted (to the total available to four nodes). If the job needs more than 16 GB of RAM, it will not run.

Let’s include some options that might also be part of the script:

#SBATCH --ntasks=16

With four nodes and a default of one task per node, clearly resources are not sufficient.

Let’s add #SBATCH --cpus-per-task=4

Now four tasks per node are allowed, and with four nodes, 16 total tasks are at least possible. As long as each task does not require more than 1 GB of memory, the job can run. ((4 GB / node) / (4 tasks / node) = 1GB / task.)

Let’s say, however, that each task requires 2GB to run. We can either request more nodes (8), or fewer cpus per task (2) (and therefore fewer tasks overall-8). This resolution relies on the restraints of the current configuration being lifted (i.e., more than four nodes are available).

Alternatively, if we had not specified the restriction of (four nodes), and just requested the number of tasks and the number of cpus per task, the scheduler would be free to allocate the most appropriate configuration and number of resources to satisfy the job’s memory requirements.



Depending on the range of compute resources available, there are two main approaches to resolve the memory issue. If the job is constrained by the current resource configuration (there are no more nodes available, for example), varying options that affect the amount of memory consumed by these resources can result in a workable rebalance of memory usage. If there are resources such that the current configuration can be expanded, a good strategy is to specify only ntasks and cpus-per-task in the submit script preamble. This allows the scheduler the flexibility to efficiently allocate cluster resources to satisfy memory requirements, while optimizing time spent in the queue and time spent running the calculation.