Uptime requirement for HPC cluster

ktrn · October 3, 2025, 2:55pm

From the CaRCC community discussion

Hello all,
I am struggling trying to find the requirement of wole system uptime
for HPC clusters, e.g. from NIH funding supported systems. I found NSF
asks for <5% downtime. But not sure if there is any similar
requirement from NIH?
Thanks!

Best,

Feng Zhang
HPC Architect
Research & Innovation, Division of Information Technology
Stony Brook University

ktrn · October 3, 2025, 2:58pm

Hello Feng,

The 95% uptime seems fairly standard, and perhaps even generous. At least a while back, I believe NSF accepted 90% uptime, given the often cutting-edge nature of the hardware, etc. While I don’t know what NIH requires, 95% seems reasonable.

There is a difference, as well, between an HPC system that is, as mentioned, “bleeding edge” that may result in frequent downtime, tuning, updates, etc., vs. a system meant to be stable and unchanging that might therefore have an expected uptime of closer to 99%, or even some log-level of 9s beyond that. Both could be “HPC” broadly considered, but there is a fundamental difference in their purposes and builds.

Warmest regards,

Jason Simms

Manager of research and high-performance computing environments,

Swarthmore College

ktrn · October 3, 2025, 3:00pm

FWIW: I have the impression that NSF has historically funded shared infrastructure: networking, supercomputers, HPC clusters, Public Cloud on-ramp (‘CloudBank’) … telescopes & particle accelerators … whereas NIH has not. I would be surprised if NIH has an opinion on.

Stuart Kendrick, Network Engineer at Allen Institute

ktrn · October 3, 2025, 3:03pm

Well, here is one example of an NIH funded HPC system

https://insidehpc.com/2024/08/nih-funds-3-15m-for-anton-hpc-at-psc/

Ian Kaufman

Principal Systems Integration Engineer

UC San Diego, Research IT Services

ktrn · October 3, 2025, 3:04pm

I am unclear how you would define uptime for a ‘whole cluster’. To my mind, if the scheduler is scheduling and there is at least one node available for jobs, the cluster is up. So, uptime would be the amount of time not taken by scheduled cluster downtime.

Isn’t part of the point of a cluster that individual nodes can go down, but jobs continue to run so the cluster is ‘up’?

Maybe what you want is to define it yourself. How many days is the whole cluster down and unavailable completely as a percentage of the time in a whole year. Then, you could supplement that by reporting what the average percentage of nodes available over the year is?

That seems like it would provide a good summary of how well the cluster is running and summarize its availability to users well.

Bennet Fauber

Vermont Advanced Computing Center

ktrn · October 3, 2025, 3:21pm

Is HIPAA compliance all you really need? Are you sure that FedRamp compliance is not going to appear commonly as a term in the grant awards given to the people you are supporting? (There are other options as well but FedRamp seems at the moment like the requirement that you might most often encounter). I’m not as current as I once was, but I’d be pretty surprised that something as relatively weak in today’s world as HIPAA compliance would do for supporting a large group of NIH-funded researchers.

Thoughts from people more up on the current regulatory environment?

Sincerely, Craig Alan Stewart

ktrn · October 3, 2025, 3:22pm

Looking at the Anton system funded by NIH, operated at PSC: ( https://insidehpc.com/2024/08/nih-funds-3-15m-for-anton-hpc-at-psc/ )

I would recommend two things:

1. Ask PSC director Barr von Oehsen what requirements they’re held to. (I collaborated with him for a while and pulled him into this thread, but maybe his email filters will work against us.)

2. Examine any requirements described for the NIH grant program that funded Anton and/or other systems. (There may be different expectations for systems intended to address different use cases, computing modalities, storage needs, data access/privacy levels, etc.).

'Hope that helps,

Lauren

Lauren Michael - Principal Consultant, LMichael Consulting LLC; Internet2 Research Engagement

ktrn · October 3, 2025, 3:23pm

We’ve gotten a couple of awards via the NIH S10 HEI program for compute cluster equipment. I cannot recall ever seeing a requirement for uptime (though we like to put our uptime capabilities in the proposal). I’ve never seen nor heard of a general uptime requirement from the NIH for HPC systems.

The S10 program does have annual reporting requirements which includes reporting usage but I’m not aware of a specific number they require there. We typically report CPU*hours allocated for that. So like Lauren says, I expect any uptime requirements to be called out by the funding program.

Michael Gutteridge

FHCC IT - Scientific Computing

ktrn · October 3, 2025, 3:24pm

My understanding is that NIH is trending towards 800-171 for security controls. They require that forresearchers working with data from their 20 controlled repositories (like dbGaP), though some contracts ask for FISMA compliance (which is the security standard to certify a service on the FedRAMP marketplace). If you look at the 800-171 spec at:

https://nvlpubs.nist.gov/nistpubs/SpecialPublications/800-171r3/NIST.SP.800-171r3.html

You can see the following (that states availability is NOT included as a requirement). Which makes sense as the focus of 800-171 is on confidentiality and integrity:

"SP 800-171 security requirements represent a subset of the controls that are necessary to protect the confidentiality of CUI. The security requirements are organized into 17 families, as illustrated in Table 1. Each family contains the requirements related to the general security topic of the family. Certain families from SP 800-53 are not included due to the tailoring criteria. For example, the PII Processing and Transparency (PT) family is not included because personally identifiable information (PII) is a category of CUI, and therefore, no additional requirements are specified for confidentiality protection. The Program Management (PM) family is not included because it is not associated with any control baseline. Finally, the Contingency Planning (CP) family is not included because it addresses availability”.

I would think that if you track your uptime by Scheduled (maintenance windows) versus Unscheduled (incidents that take the system offline) and provide that as documentation of your uptime, you are at least declaring your expected availability, kind of to Michael’s point.

Best, Bill

--

William K. Barnett, Ph.D.

Chief Research Computing Officer

UMass Chan Medical School