HPC access models: many nodes for short periods of time



How do other HPC clusters offer access to researchers with jobs that regularly require many nodes for short periods of time?

We offer:
1. a free tier model that provides up to 500,000 hours in a general access pool with lots of nodes, which can accommodate the bursty nature of these jobs well but only until they hit the 500,000 SU limit;
2. a paid condo model for researchers to purchase their own nodes for their exclusive access; the cost of purchasing many (20+) nodes is prohibitive, especially if they are not needed by the researcher most of the time.

Neither of these models seem like a good fit for researchers who need access to many nodes on a regular basis but only for short periods of time.

Does your institution have an option for researchers in this situation?


We generally do not provide the “condo” model but instead something (for lack of a better term) called the “coop” model. When researcher contributes funds, we purchase hardware to add to the cluster, but that hardware is NOT for the researcher’s exclusive use. It is added to a general pool, and the researcher gets
SUs, replenishing quarterly, proportional to the SUs added to the cluster by the hardware they contributed.
These SUs can be used to run jobs on any hardware in the pool, although you are not guaranteed that jobs will start immediately.

So if a researcher contributes 1 node to the cluster, he can do any of the following:

  1. run jobs on 1 node 100% of the time (for the entire quarter)
  2. run jobs on 2 nodes 50% of the time
  3. run jobs on 10 nodes 10% of the time, etc

Depending on how heavily the cluster is being used, the amount of time jobs wait in queue can vary, but usually averages no more than a few hours for larger jobs. Shorter jobs that don’t need too many nodes can usually do better, sneaking in the backfill.