Recommendations from HPC/ML specialists and enthusiasts

shakhizat · April 21, 2023, 5:41am

Dear Ask Cyberinfrastructure community,

As a small non-profit academic organization with Nvidia DGX servers, we are seeking recommendations from HPC/AI specialists and enthusiasts worldwide. We would like to know how you maintain your HPC/AI clusters for machine learning training and whether there are any online HPC clusters available for testing their functionality.

Additionally, we would appreciate any best practices you could recommend for small academic organizations, as well as any online training programs you could suggest.

If anyone has experience with Bright cluster computing, we would like to hear your thoughts on its worthiness.

Regarding the choice between Kubernetes and Slurm for small organizations, we would like to know which option would be better. Can Kubernetes provide the same functionality as Slurm in terms of job scheduling?

Thank you for your time and assistance.

Best regards,
Shakhizat

jfossot · April 22, 2023, 1:25pm

It looks as though you have a lot of questions related to how to effectively use the resources you have available. If I were to solve the same problem. I will use nationally available testbeds to evaluate hardware and Applications that I am interested in. I will identify some of the users on my campus then get a Campus Champions allocation from ACCESS. I will give these identified users access and watch how they use the resources. I will then make it a staple of my service offering.

These are some testbeds that I know of. I have never use them but I have had for close communication with them:

Chameleon Testbed
FABRIC Testbed also includes many testbeds around the world

Some free computing resources that you can have access to are:

ACCESS. The Campus Champions allocation will work well for you. It worked for me when I was at R2 and R3 institutions.
ALCF, Find a contact and talk to them. They also offer training for facilitators .
OLCF. You can find resources and training. This is where I was first exposed to HPC.
NERSC. The resources here are mostly for scientific applications, they also have training.

As far as Slurm and Kubernetes goes, you can use them in different ways:

You can use Slurm manage jobs especially for your bare metal
If you could virtualize you GPU server and have vGPUs then Kubernetes can be very useful to manage the resources and run applications. I think this one will work well for you. I can talk more on how to use it if you are available.