Spreading a job over more than one node - what considerations are involved?

toreliza · March 16, 2018, 1:22am

My job uses more memory than is available on one node in a cluster. What considerations are involved in spreading the job over more nodes?

toreliza · June 8, 2018, 2:57pm

Considerations:
Are there more nodes available to you for your computation? (Assume yes.)
What is the amount of RAM per node? (Assume homogeneous group of nodes.)
How many core per node? (This gives amount of RAM per core: (RAM/node)*(node/core) = RAM/core).
How much memory does your job require?

If your cluster is heterogeneous, there are several ways to manage your submission. You could choose parameters that the nodes with least memory and number of core can tolerate. This gives you potential access to the entire cluster, but may leave a substantial portion of a given node unused, or ‘wasted’. (One work-around might be to run in shared mode rather than exclusive.) You could specify actual node numbers/names when submitting your job. If there is a plethora of lower-capacity nodes, and you wish to avoid using higher-core, higher-memory nodes, you could restrict the maximum number of core/node, or memory/node, thus not involving those nodes at all in your run.

toreliza · July 17, 2018, 10:13am

First determine how resources are currently allocated to your mpi job. If you are running on only one node, it’s justifiable to assume that, since the job is described as an mpi job, it can run over more than one node without being restructured.

Some basic analysis can provide information that will also help to optimize the allocation of resources you request. I’ll use specific numbers in an example to demonstrate.

Let’s assume, for simplicity, your cluster is homogenous. (Or, that you have access to a queue of homogeneous nodes.) Each node has 16 core, with a total of 4 GB of RAM. With this established, the considerations that will have the most impact will be scheduler parameters. I will use SLURM options in this example, but in general, the options are either analogous in another scheduler, or the values can be reformulated to comply with another scheduler’s choices. (There are alternative combinations of SLURM options to achieve equivalent arrangements as well.)

If your code has been built with OpenMP, taking advantage of this capability can affect memory usage (and performance). Let’s say you are using MPI and running on one node. You are running with 16 tasks per node (one task per core, or one core per task). The amount of memory on the node is then divided by 16; each task has only 1/16 of the node’s total memory. Alternatively, you could run one task per node; with MPI, the default is one core per task, giving the one task access to all of the node’s memory, but running with only one core. Instead, you could run with one task per node and 16 threads per task (by invoking OpenMP). The single task per node still has access to all of the node’s memory, but 16 core instead of one core are doing the processing.

With respect to memory usage, the above reformulation will not have an effect. Let’s say you want to run 16 tasks total, but one node does not provide sufficient memory. By ensuring that SLURM options
ntasks and cpus-per-task do not violate restrictions imposed by 16 core per node, one can vary the combination of these values to evaluate memory usage scenarios.

Let’s say the job in question only needs a total of 8 GB RAM to run. Thinking in terms of RAM per node, we note that 2 nodes should suffice for memory requirements. If we adhere to running 16 tasks, we need 8 tasks per node, and/or 2 cpus-per-task. We could also opt for 4 nodes, and run 4 tasks per node, with 4 cpus per task. The deciding factor is how much memory per task is required to run the calculation, and of course, whether or not resources can be allocated to fulfill that restriction.

For the sake of comparison, let’s say we decide to run on 16 nodes. If we submit with the default of one task per node, requesting resources for 16 tasks means we need 16 nodes. At 4 GB per node, we have 64 GB available, far more than the 8 GB we need.