I am researching opportunities in orchestrating ML/AI workloads on HPC systems. Currently, my evaluation relies on a profiled dataset that encompasses a limited number of ML models and resources. This makes the solution non-representative of real-world HPC traces for ML workloads and also questions scalability. I am seeking suggestions or input regarding the availability of public or private datasets, possibly with on-demand access, that I can incorporate into my experiments to assess the system more comprehensively. Your insights and recommendations would be greatly appreciated.
Orchestrating ML/AI workloads on HPC systems requires a diverse and scalable dataset to obtain representative results. Using a profiled dataset with limited ML models and resources may not be sufficient to assess scalability and diverse workloads in real-world HPC scenarios.
To improve the comprehensiveness of experiments, consider both public and private datasets. Public datasets, such as those available on the UCI Machine Learning Repository, Kaggle, or the AWS Public Dataset Program, can be tailored for HPC workloads. Initiatives such as the MLPerf benchmarking suite by Google can also provide valuable insights into performance characteristics across different ML/AI workloads.
Private datasets from industries or research institutions may be more indicative of specific real-world scenarios, but access is often restricted. Synthetic dataset generation tools can be used to create large, customizable datasets to simulate particular HPC-ML scenarios, but they may not accurately capture the complexities of real-world data.