HPC job managers and migrating to the cloud

gce
azure
aws
cloud
zeta:for-sc18
scheduler
question

#1

What job managers are available, how do I easily move between them, and how do they map to tools in the cloud?

I am currently using PBS to run my analysis and want to migrate to the cloud to collaborate for a multi-institution project. Is there any recommendation on the type of cloud service? And what tools I will need to learn to help with this migration.

CURATOR: Raminder Singh


#2

When considering whether and how to migrate your workload from an HPC cluster environment to the cloud you may first wish to check to see if your campus research computing group provides consulting to help with that.

There are also resources beyond the campus at the national scale that may help such as the XSEDE Extended Collaborative Support Service (ECSS) and the Software Community Gateways Institute (SGCI). Those services are both available to help you determine what’s needed for a multi-institution project, especially if you have a workflow suitable as a Science Gateway that provides a web-frontend to HPC cluster resources (so that you don’t have to change your workflow, except to maybe switch to a different job scheduler such as SLURM available on the newer XSEDE clusters), or whether your workflow is better suited to a cloud environment that may not require a batch scheduler at all, but may require you to change your workflows and related software.


#4

A couple of quick comments and then an answer

Comments

  1. It is hard to tell what question is being asked from this post. It might be worth rewording to make it easier to get a focussed answer. The most useful answer may depend on what sort of workflow and what application/problem you are trying to solve. If its a specific question about spinning up a virtual cluster with a specific scheduler and software stack then the question should clarify that.

  2. @raminder I think the post title might be better reading

    What HPC job managers are available on cloud platforms?

    as it it currently written it is not a question.

  3. The explanatory sentence

    does not make sense to me. Are you asking what job managers are available anywhere and how you move between SLURM, PBS, SGE, LSF etc… or are you asking about using job managers in the cloud and moving between cloud providers. The text isn’t very clear.

  4. Overall it might be useful to clarify whether this is a narrow specific question such is “Is there a recipe I can use to create a PBS cluster on AWS?” or whether the question is broader like “What cloud platforms can I use for a PBS workflow that I currently execute on a bare metal cluster?”.

Now an answer

Assuming you are asking about using cloud resources with traditional HPC style workload managers then the answer is that nearly all cloud platforms in their most basic form provide networked virtual machines. As such you can configure pretty much any scheduler software that you can use on networked physical machines. As noted above

each of AWS, GCE and Azure provide some tools to streamline the process of launching multiple virtual machines in a manner that they can form a cluster.

There are also a number of independent tools that can perform similar tasks. Some of these include

You can also script the steps of creating a cluster with some work yourself.

If you are more looking for a service then there are several activities trying to address this. For example

  • Rescale provide a commercial service that utilizes cloud resources to streamline running many typical research computing application (e.g. Gaussian, NAMD etc…) workflows that are normally executed through job managers like PBS.

#3

I’d recommend looking into intra/extra institutional support First (such as in @aculich’s great answer for an example).

That said, There is nothing specific to the Cloud that will keep you from rolling your own setup using what ever you want.

Additionally, pretty much all the cloud vendors also support, provide or encourage tools for easy setup and config with common schedulers. As well as alternatives based on server-less and/or container models, and pre-built ‘marketplace’ options

The tools according to the big three:
cnfCluster https://cfncluster.readthedocs.io/en/latest/
AWS supported

elastiCompute http://gc3-uzh-ch.github.io/elasticluster/
GCE recommended (can also be used with AWS & private OpenStacks)


Azure

all of which officially support SGE, Slurm, and Torque (PBSpro is an active request in both elastiCompute and cnfCluster)

note: I’ve personally only dealt with AWS & GCE, mostly with SGE so YMMV.


#5

Another couple of options to mention are alces flight (AWS & Azure) and CloudyCluster (AWS and soon GCP, autoscaling with the CCQ meta-scheduler).