Ask.Cyberinfrastructure

How do I create my own cluster?

qow
deployment
cluster
#1

I’ve been seeing posts around like this and I’m genuinely interested in what it takes to create your own little cluster. A lot of these discussions start with the assumption that I have a crapton of old equipment lying around, or I have a general plan and need feedback on the details (e,g., what software to use? How to do networking?) but I’m actually interested in a higher level description of how (someone like me) could deploy a cluster. What are the general steps? How do I make decisions? For example - how could I do this with raspberry PIs in my closet? How could I do it on a couple of instances in the Cloud over a weekend?

I think this is also important to talk about so we can shed light on some of the real differences between what the cloud is calling HPC, and what is actually HPC.

#2

At this point in time I do everything with virtualization on my mac with something like vmware fusion. This gives the greatest flexibility without having to worry about unexpected costs in the cloud when one forgets to turn instances off and you don’t have to carry around a pile of hardware (or stow it away in the closet).

Just to get a general feel for things one can start by manually creating a basic “system” with login node and 1-2 compute nodes. The login node can also double as shared storage for home directories, software installations, scheduler. Now you have a platform to experiment with linux, nfs (or BeeGFS, etc), slurm and the rest of the stack.

By using service names for all the various layers such as login, xfer, storage, the basic architecture is abstracted away from the actual server or virtual the service is running on.

From there go to a more sophisticated setup that starts with a pxe server to install the smallest possible images. Puppet or another solution can then be used to apply specific configurations to each node to make them a login node, compute, file transfer.

The catch?

  1. Having a laptop with enough storage and memory to run handful of virts.
  2. Or running on external storage, hopefully SSD on a faster interface.
  3. Setting up PXE under vmware and managing the virtual network can be a pain in the neck.
  4. Responsibility for learning the entire stack top to bottom (this is the interesting part however)
  5. Sometimes having “real” hardware can just be more fun.
#3

The terms “cluster” and “HPC” are about as useful as “cloud” at this point, they pretty much can describe anything someone wants to describe with them. If we strip away all the fluff,

  • cluster: two or more hosts that cooperate on a computational problem.
  • HPC cluster: two or more hosts which cooperate on a fine grained computational problem very efficiently

Two hosts with an OS installed, copies of the same applications in the same locations and ssh that works between them are a cluster, so a basic setup for learning is simple to do using any two of anything: VM, pi, open-box special from BestBuy, cloud instance, those old laptops you can’t bear to throw away,… can all be used to learn to do “cluster” computing. It’s also entirely possible to learn everything you need to know about parallel computing on a single system now that everything under the sun has multiple cores and you can get a CUDA aware GPU in a laptop.

From that basic starting point it’s all turd polishing by adding as much extra stuff as desired:

  • Scheduler (Slurm, MOAB/MAUI/torque, LSF, PBS, Condor…)
  • Shared $HOME (I like NFS but the next item can work for this)
  • Shared Parallel Filesystem (BeeGFS, Lustre, GPFS,…)
  • Common software stack (lmod/modules, easybuild, spack, etc…)
  • Provisioning tool (Warewulf, XCat, a gazillion others)
  • Configuration management (Saltstack, ansible, cfengine, puppet, chef,…)
  • Interconnect (Ethernet, Infiniband, proprietary foo)
  • Grouchy HPC Sysadmin to tell users “NO!” :slight_smile:

There’s nothing magical about clusters, although the marketing would have us believe otherwise. IMHO the most important thing is to keep a good grasp on the high level view of “what do I want to accomplish with this?” because the problem being solved should drive the cluster, not the other way around. If the goal is to become a cluster sysadmin, then hit every bullet point hard and try multiple tools for each. If the goal is to learn parallel programming, skip it all and just run MPI or whatever interests you on your daily driver system.

I think the most important thing we can take home from that list is how critical it is that we keep my boss convinced that the last item there is the one that matters the most.

5 Likes
#4

@griznog had the main ingredients of a cluster. I want to point out that on the software side, one could look into the OpenHPC effort:

https://openhpc.community/

It does not provide the base OS, apparently. But it does provide a lot of the tools and commonly used software packages on HPC. This can significantly cut down the time to get the HPC started.