Automated HPC testing tools and techniques?

mrich · June 20, 2019, 2:46pm

Hello all, does anyone have any experience they can share about doing automated testing of HPC clusters? As in, we would like to have a test suite that we can run on our cluster any time we make a configuration change or upgrade, in order to exercise the most commonly used software and commands.

We’d love to hear about other groups’ experiences in this area.

vsoch · July 12, 2019, 3:22pm

hey @mrich! I can speak a little bit for some of our clusters - I don’t think most of our clusters have tests set up akin to continuous integration with changes in some code, but rather we have continuous monitoring to reflect the status of servers (up, down, usage, etc.) I’ll ping other members of my group to see if we can provide more feedback, because automated testing of HPC is a cool idea!

mrich · July 12, 2019, 3:43pm

Thanks, that would be great if you could ask around. For now we are going with a simple shunit2 script that will run on a login node and just load our most popular modules and run “hello world” scripts to make sure nothing is egregiously broken.

griznog · July 12, 2019, 4:03pm

It really depends on how a cluster is operated, so one size doesn’t fit all but I’ve found that if every time something breaks that if I roll the root cause analysis/fix into a health check (I use NHC with Slurm) and/or a test job I can submit, that over time a pretty good testing regimen is assembled that is not just a change-driven test but an ongoing monitoring system looking for known errors specific to the current environment. Once the health checks exist to define the “known good state” then by redefining that state (changing expected kernel version, for instance) the health check also becomes a handy tool for doing rolling reboots for kernel updates, BIOS/firmware updates, etc., making it easy to drain a cluster node-by-node based on anything you can write a script to test for.

For the unknown errors…good luck predicting those. https://dilbert.com/search_results?terms=Unplanned%20Outages

griznog

ericfranz · July 16, 2019, 7:30pm

We successfully are using ReFrame testing framework at OSC and are presenting our paper at PEARC19 https://pearc19.conference-program.com/presentation/?id=pap152&sess=sess204 on our use of it.

The idea is to write automated regression tests that submit batch jobs that load modules and possibly exercise the software to a degree that the output can be automatically tested and verified. Testpilot might be another option but I don’t know if it is available for other centers or not.

ericfranz · August 5, 2019, 2:47pm

Here is our PEARC19 paper “A Continuous Integration-Based Framework for Software Management” on alternative to EasyBuild and Spack where we use ReFrame test suite and Gitlab webhooks to automate the execution of those tests on our three clusters.

vsoch · August 5, 2019, 7:20pm

Do you have a version (a direct link to pdf or one you can upload) that isn’t behind a login / paywall?

ericfranz · August 6, 2019, 3:14pm

Oops, https://dl.acm.org/citation.cfm?id=3332219 is the link. I’ll fix my previous comment with the correct link.

ericfranz · August 6, 2019, 8:44pm

Oh sorry, that link is now paywall-ed…

Shahzeb_Siddiqui · July 11, 2020, 4:10am

Take a look at buildtest https://github.com/buildtesters/buildtest it targets HPC system and software testing. He framework is based on YAML files that are validated with jsonschema. I have been doing some prototype tests for different HPC sites.

Hope that helps you