How busy is your Hadoop cluster?

ktrn · May 24, 2019, 3:44pm

I’m working with a colleague at another institution to set up a Hadoop installation on their cluster. Since my experience is primarily with our Hadoop cluster, we’d like to gather some info about how other campuses are using Hadoop. Can you please take a few minutes to answer this three question survey? You can either answer here on Ask.CI or in this google form.

Do your researchers have access to an installation of Hadoop on your local cluster?
What is the level of usage?
Which research groups dominate Hadoop usage at your institution?

hoang · May 29, 2019, 7:03pm

We previously had a Hadoop cluster with dedicated Hadoop nodes, but usage was very low, so those resources were repurposed for general use. But we retain the ability to stand up a dedicated Hadoop cluster in case there is demand from researchers.

Historically, our researchers that have expressed interest in Hadoop have generally been from bioinformatics, biostatistics, economics, or finance. In many cases it was easy to create a simple proof of concept example, but very often Hadoop turns out to not be the best tool for the job. Having to fit things into a map-shuffle-reduce paradigm, as well as the need to move data in and out of HDFS was often awkward enough to decide not to use Hadoop.

But that was a few years ago, and the Hadoop ecosystem has grown since then. We’ve actually had some success with using a part of that ecosystem, Spark, in working with some of our finance users for certain types of analysis. We like Spark in that it doesn’t require your data to be loaded in HDFS. Spark can just work with data that is already sitting on the filesystem, and there is no need to create a dedicated Hadoop cluster, so Spark jobs can run as just another type of parallel processing job, like MPI jobs.