Ask.Cyberinfrastructure

Automated HPC testing tools and techniques?

Hello all, does anyone have any experience they can share about doing automated testing of HPC clusters? As in, we would like to have a test suite that we can run on our cluster any time we make a configuration change or upgrade, in order to exercise the most commonly used software and commands.

We’d love to hear about other groups’ experiences in this area.

hey @mrich! I can speak a little bit for some of our clusters - I don’t think most of our clusters have tests set up akin to continuous integration with changes in some code, but rather we have continuous monitoring to reflect the status of servers (up, down, usage, etc.) I’ll ping other members of my group to see if we can provide more feedback, because automated testing of HPC is a cool idea!

Thanks, that would be great if you could ask around. For now we are going with a simple shunit2 script that will run on a login node and just load our most popular modules and run “hello world” scripts to make sure nothing is egregiously broken.

It really depends on how a cluster is operated, so one size doesn’t fit all but I’ve found that if every time something breaks that if I roll the root cause analysis/fix into a health check (I use NHC with Slurm) and/or a test job I can submit, that over time a pretty good testing regimen is assembled that is not just a change-driven test but an ongoing monitoring system looking for known errors specific to the current environment. Once the health checks exist to define the “known good state” then by redefining that state (changing expected kernel version, for instance) the health check also becomes a handy tool for doing rolling reboots for kernel updates, BIOS/firmware updates, etc., making it easy to drain a cluster node-by-node based on anything you can write a script to test for.

For the unknown errors…good luck predicting those. https://dilbert.com/search_results?terms=Unplanned%20Outages

griznog

1 Like

We successfully are using ReFrame testing framework at OSC and are presenting our paper at PEARC19 https://pearc19.conference-program.com/presentation/?id=pap152&sess=sess204 on our use of it.

The idea is to write automated regression tests that submit batch jobs that load modules and possibly exercise the software to a degree that the output can be automatically tested and verified. Testpilot might be another option but I don’t know if it is available for other centers or not.