Application-agnostic vs application-aware HPC scheduling

ShrutiDongare · September 29, 2023, 9:18pm

If an HPC scheduler possesses knowledge about the job-specific characteristics of running applications, would this information significantly enhance the accuracy of runtime predictions and improve various performance parameters for HPC jobs? I am particularly interested in the context of ML/AI workloads, given their diverse behavior and varying system support demands depending on the specific ML architecture being utilized in the underlying job.

ShrutiDongare · September 29, 2023, 9:49pm

It’s interesting to observe that a very low percentage of research works include job-related characteristics, like the specific model architecture or the primary library used in ML applications. I recently stumbled upon research called XALT data, which sheds light on the primary application.

But it doesn’t quite cover all the bases. How can we address this gap more comprehensively, especially when dealing with ML applications, to capture a fuller picture of a job’s characteristics and needs?