What are the relative benefits of a stateful vs. stateless cluster configuration?

cluster-management
question

#1

I am preparing to rebuild a compute cluster, and am deciding whether to go with a stateless or a stateful configuration for it.

What are the relative benefits and drawbacks of each approach?


#3

Maybe these examples will help you.

Here is a stateless installation on a compute node:

[root@compute-1-14 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           3.9G  692M  3.3G  18% /
devtmpfs        3.9G     0  3.9G   0% /dev
tmpfs           3.9G     0  3.9G   0% /dev/shm
tmpfs           3.9G  8.9M  3.9G   1% /run
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
tmpfs           797M     0  797M   0% /run/user/0
10.2.1.1:/opt/ohpc/pub  1.9T   63G  1.8T   4% /opt/ohpc/pub

Here is a stateful installation on a compute node:

[root@compute-1-14 ~]# df -h
Filesystem              Size  Used Avail Use% Mounted on
/dev/sda3               147G  1.8G  138G   2% /
devtmpfs                3.9G     0  3.9G   0% /dev
tmpfs                   3.9G     0  3.9G   0% /dev/shm
tmpfs                   3.9G   17M  3.9G   1% /run
tmpfs                   3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1               463M   45M  395M  11% /boot
10.2.1.1:/opt/ohpc/pub  1.9T   63G  1.8T   4% /opt/ohpc/pub
tmpfs                   798M     0  798M   0% /run/user/0

As you see above, in the stateless case, the operating system doesn’t use the local hard drives while in stateful case, the operating system is mounted on the local hard drive.
From the computing efficiency standpoint:

If the compute node has a local disk memory to use:
If for any application, there are temporary files to be written on a temporary location, or there are certain software packages to be installed locally on the node, the stateful case will be more efficient as it doesn’t require reading/writing from a network-mounted device.

On the other hand, if the compute node doesn’t have local disk memory to use, stateless would be the best option.

Arash


#2

I had to look up the terms stateless and stateful in the context of cluster computing. I think I’m understanding that you mean diskless (on compute nodes), is that correct?

If so, one of the advantage of having a disk drive in each compute node is the ability to use that local disk drive for local scratch per compute node, for temporary files that won’t need to move to any global filesystem by the end of the run.

We see only a modest number of users needing that – most don’t produce much in the way of temporary files during a run – but our ATLAS high energy physics group uses the local drive all the time: they produce GB per run, but only retain MB at the end of the run.


#4

We just started using a stateless configuration on our system, and as the sysadmin, it has been very helpful to be able to reboot misbehaving nodes and have a completely fresh install of the compute node image handed out by the headnode. Our equipment is hand-me-down and has been temperamental so this saves me a lot of time and headaches.