HPC job managers and migrating to the cloud



What job managers are available, how do I easily move between them, and how do they map to tools in the cloud?

I am currently using PBS to run my analysis and want to migrate to the cloud to collaborate for a multi-institution project. Is there any recommendation on the type of cloud service? And what tools I will need to learn to help with this migration.

When considering whether and how to migrate your workload from an HPC cluster environment to the cloud you may first wish to check to see if your campus research computing group provides consulting to help with that.

There are also resources beyond the campus at the national scale that may help such as the XSEDE Extended Collaborative Support Service (ECSS) and the Software Community Gateways Institute (SGCI). Those services are both available to help you determine what’s needed for a multi-institution project, especially if you have a workflow suitable as a Science Gateway that provides a web-frontend to HPC cluster resources (so that you don’t have to change your workflow, except to maybe switch to a different job scheduler such as SLURM available on the newer XSEDE clusters), or whether your workflow is better suited to a cloud environment that may not require a batch scheduler at all, but may require you to change your workflows and related software.


Assuming you are asking about using cloud resources with traditional HPC style workload managers then the answer is that nearly all cloud platforms in their most basic form provide networked virtual machines. As such you can configure pretty much any scheduler software that you can use on networked physical machines. As noted above

each of AWS, GCE and Azure provide some tools to streamline the process of launching multiple virtual machines in a manner that they can form a cluster.

There are also a number of independent tools that can perform similar tasks. Some of these include

You can also script the steps of creating a cluster with some work yourself.

If you are more looking for a service then there are several activities trying to address this. For example

  • Rescale provide a commercial service that utilizes cloud resources to streamline running many typical research computing application (e.g. Gaussian, NAMD etc…) workflows that are normally executed through job managers like PBS.


I’d recommend looking into intra/extra institutional support First (such as in @aculich’s great answer for an example).

That said, There is nothing specific to the Cloud that will keep you from rolling your own setup using what ever you want.

Additionally, pretty much all the cloud vendors also support, provide or encourage tools for easy setup and config with common schedulers. As well as alternatives based on server-less and/or container models, and pre-built ‘marketplace’ options

The tools according to the big three:
cnfCluster https://cfncluster.readthedocs.io/en/latest/
AWS supported

elastiCompute http://gc3-uzh-ch.github.io/elasticluster/
GCE recommended (can also be used with AWS & private OpenStacks)


all of which officially support SGE, Slurm, and Torque (PBSpro is an active request in both elastiCompute and cnfCluster)

note: I’ve personally only dealt with AWS & GCE, mostly with SGE so YMMV.


Another couple of options to mention are alces flight (AWS & Azure) and CloudyCluster (AWS and soon GCP, autoscaling with the CCQ meta-scheduler).