Economics of JupyterHub at Scale?

jlsimms · July 30, 2018, 3:37pm

I’m wondering how various institutions are handling questions of deploying notebook-type technologies at scale, especially in terms of their economics. In particular, we would like to deploy JupyterHub as a campus service. Right now, it’s mainly been used as part of a pilot in two courses, with a total of about 60 students. The pilot has been rather successful, and we would like to make it available to the campus at large. We are also in the process of evaluating RStudio Server Pro, which is similar, and might be considering a deployment for broader campus use soon.

My problem is that these technologies don’t seem to scale well. My research and conversations with JH developers indicate that each session requires about 1GB of memory as a reasonable floor, and it’s common to cap at about 2GB. The issue is that for 90% or more of the semester, we might see 0-4 concurrent sessions, and such traffic can be handled easily by a comparatively small server/VM. But - students being students - they will regularly want until the night before to work on homework or projects, and it’s not uncommon to see 40-60 concurrent sessions. We are in a classic problem of needing a high amount of resources for a really small sliver of time.

We don’t have a system on-campus that can be dedicated to this, as it would require 128GB or so of memory to be allocated. (Cores and disk space are more trivial, but are not completely excluded; it’s just that RAM is the primary limiting factor.) For example, Digital Ocean’s “droplet” VM that offers 128GB of memory comes in around $640/mo., whereas most of our typical concurrent needs could be met with, say, a VM with 16GB of RAM for $80/mo. There seems to be no effective way to “burst up” for those specific times that require additional resources.

I thought, perhaps, that I could accomplish this with Kubernetes through Google Cloud, but that doesn’t seem to be an option, really. I can provision a few smaller nodes and allow Kubernetes to “load balance” the requests, but it looks like I pay for the availability of those rather than their actual usage. And in this case, it comes out to somewhere around $700/mo. to offer a Kubernetes cluster with three smaller VMs.

Ultimately, I’d like a solution where I can burst up when needed, while paying more or less some base fixed cost plus whatever is needed to support the additional intermittent load. Because if we are facing these struggles trying to provide a service for 60 students, I can’t imagine trying to offer something that upwards of 200-300 students might need each semester.

I would welcome any thoughts or guidance.

Warmest regards,
Jason Simms

KrisP · July 31, 2018, 4:36pm

We are currently using Open OnDemand to launch Jupyter notebooks in general, rather than JupyterHub. Though we have both JupyterHub and Open Ondemand, currently JupyterHub is just a VM and will probably not be able to sustain a lot of students. So, Open Ondemand is a better choice.

We are looking into using batch spawner to launch Jupyter notebooks on the cluster as well. One thing that Jupyterhub does better is incorporating nbgrader, which makes it easy to assign, collect and grade assignments. Though nbgrader could also be set up on a cluster, I have not yet made it so. (e.g. One class is fine, but what about several concurrent classes?)

Another possibility is XSEDE JetStream instead of GCP for Kubernetes, which would solve the cost problem. I’d love to learn about how to launch Kubernetes on JetStream!

jpessin1 · August 16, 2018, 7:57pm

hmmm … that kind of sounds like it might be config issue …

GCP (including the Kubernetes engine) bills primarily on occupancy, is it possible parts of your system are staying up?

Though As KrisP mentioned jetstream is a good resource, with help and experience attached.