We previously had a Hadoop cluster with dedicated Hadoop nodes, but usage was very low, so those resources were repurposed for general use. But we retain the ability to stand up a dedicated Hadoop cluster in case there is demand from researchers.
Historically, our researchers that have expressed interest in Hadoop have generally been from bioinformatics, biostatistics, economics, or finance. In many cases it was easy to create a simple proof of concept example, but very often Hadoop turns out to not be the best tool for the job. Having to fit things into a map-shuffle-reduce paradigm, as well as the need to move data in and out of HDFS was often awkward enough to decide not to use Hadoop.
But that was a few years ago, and the Hadoop ecosystem has grown since then. We’ve actually had some success with using a part of that ecosystem, Spark, in working with some of our finance users for certain types of analysis. We like Spark in that it doesn’t require your data to be loaded in HDFS. Spark can just work with data that is already sitting on the filesystem, and there is no need to create a dedicated Hadoop cluster, so Spark jobs can run as just another type of parallel processing job, like MPI jobs.