The lifecycle of a cluster

I saw this recent post on Twitter about the end of the life of the cluster titan, and it made me a little sad:

But it also prompts a lot of questions that would be interesting to discuss. For example, it’s logical that servers and disks only have a certain life, and often are replaced well before the infrastructure that they serve is taken down. It’s also the case that hardware changes so quickly that at some point, given that funding is sufficient, it might even be easier to build a new cluster then to try and restore an older one. So with this in mind, I want to ask my question. What is the typical lifecycle of a research cluster? What are the factors that determine clusters that have shorter lives versus longer lives? What are the challenges in maintaining an older cluster? A newer one? It occurs to me that documentation bases might be really hard to maintain just based on the fact that they need to be totally redone for some new entity every 5 to 10 years. It also seems likely that there might be some balance between providing the newest and trendiest that the user might want, and maintaining something that is stable and reliable. Now given some common lifecycle that you might have in mind, is there any potential future that would allow for change to be less frequent, and clusters to be more stable? Will clusters ever be able to have longer lives?

Looking forward to hearing what people think!

It really does depend on the type of research and how well the OS and research code is supported. Assuming the machine can still receive software and security updates, the lifespan can be extended as long as the data does not change. One could theoretically run mathematical simulations that output kilobyte files on nearly any machine but accommodating 5 petabytes of radio telescope data would take some upgrades for most machines. For most scientific purposes, the amount of data to process is ever increasing. This takes more disk space, memory and network capacity. There is also a bias in development towards writing code that is easier for the developer to understand than the machine. This means that people are writing horribly inefficient code in MatLab as opposed to writing better code in a more efficient language like C. The solution for most institutions just seems to be buy faster computers as opposed to fixing their code so it runs more efficiently. There are things that can be done to extend clusters’ lives but it takes knowing how to use them within the framework of the research you’re doing.