HPC Cluster monitoring: how do you do it?

ruben.lara · July 6, 2023, 8:43pm

As we continue our progress towards tool maturity, we are finding that the answer to the question “is the cluster up?” is not as straight forward as it seems. We are curious to learn how the community defines what it means for a cluster to be up, and what you are all watching to make sure the services are available.

Thanks!

wwarr · July 7, 2023, 7:11pm

I can’t necessarily speak to what it means for a cluster to be “up” globally, I’m not sure we’re at that stage yet. We are exploring a few tools to better understand and automate health at the node level, and to that end we have a stack of tools we use. Caveat, as a facilitator I’m a consumer of these tools, not a developer, so my understanding of their operation and how they fit together is fuzzy right now.

Our batch scheduler is Slurm. We use Prometheus to gather and feed Slurm usage and node hardware data into Loki and Grafana for visualization. We’ve also got some Node Health Check automation which sets nodes to “drain” automatically if load or context switching get too high. This prevents researchers from landing new jobs on poorly-performing nodes. Even at this level of use and automation complaints about low performance, and overall outages, have decreased substantially. We are still encountering occasional issues with GPFS getting “overwhelmed” in some sense, causing a global lack of I/O responsiveness, but we’re making headway.

The metric visualizations also give us facilitators the opportunity to identify which nodes may have misbehaving workflows, enabling us to investigate root causes more rapidly and easily. I’ve personally done some investigations that helped several researchers with misconceptions about multiprocessing style workflows. We’ve learned that context switching above about 100kHz results in performance degradation, and that high context switching is almost always caused by a multiprocessing workflow with more workers than cores. Helping the researchers match workers to cores (using $SLURM_CPU_ON_NODE for example) fixes the issues. They’re usually also really happy to have good conversations about parallelism and gives a starting point for further facilitation. The correlations between metrics and symptoms aren’t perfect, but our cluster health is much easier to manage than it was six months ago before we had these tools.