Setting up a flexibly-configured Dask-powered Jupyterhub for the DIRECT research group has been a goal of mine for a while. Back in 2019, I did some experiments that demonstrated that we could use Dask as a really effective way to scale some of the computations that we do with neuroimaging data. These experiments were done on a Jupyterhub that I set up on AWS using a set of snakemake scripts that I got from Scott Henderson, who had originally created these as part of his work with the Pangeo project. These worked well after only some minimal tweaking, but it turned out that there were issues with this setup, and we had to take down our Jupyterhub. I got distracted by other things and left that aside. Over the time that has passed since, I also worked with Erik Sundell, on another hub I had set up for another project (described here). I had originally set up this hub on GCP, but due to changing needs of the project, and some administrative quirks that I will not go into, we decided to move our hub to AWS. When we did that, Erik moved our configuration from the one that I had originally designed, that fairly closely matched the Pangeo configuration as of early 2019, to one that is much more generic and more flexible (but as you’ll see, still leans heavily on pangeo). Recently, we got some cloud credits from Azure for our DIRECT work, which gave me the opportunity and impetus to try this again. Over the time that had passed between me setting up the GCP hub and the move to AWS, another thing that happened is that work on Pangeo and other work on Dask had given rise to the very useful daskhub helm chart that can now be (in principle) used to deploy these hubs and configure them.

So, I set about to create an AKS cluster following the relevant Dask documentation. The first step was to use the zero to jupyterhub documentation to create an AKS cluster. The only thing that needed tweaking there was the kubernetes version used in the aks cluster create call. I ended up with 1.19.6, based mostly on the fact that I have 1.19.3 installed locally. I also set this up with a minimum of 3 instances and a max of 100. That should be enough.

The Dask documentation is not very explicit about how to use multiple files for configuration (e.g., config.yaml vs. secrets.yaml). I think that I can split this up arbitrarily, but not 100% sure how to do that, so to be on the safe side, I just kept everything together in one file called secrets.yaml. After an initial install of the helm chart following the Dask documentation, I got the IP address for the cluster, and set up GitHub authentication. After doing that, I set up a custom docker image and installed that into the hub (more about this image below). After progressively changing this file, I ended up with the following ('xxx' are my secrets):

jupyterhub:
  singleuser:
    extraEnv:
      DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
      DASK_DISTRIBUTED__DASHBOARD_LINK: '/user/{JUPYTERHUB_USER}/proxy/{port}/status'
      DASK_LABEXTENSION__FACTORY__MODULE: 'dask_gateway'
      DASK_LABEXTENSION__FACTORY__CLASS: 'GatewayCluster'
    image:
      name: ghcr.io/nrdg/autofq-daskhub
      tag: 0546ce5f054b

  proxy:
    secretToken: "xxx"
  hub:
    networkPolicy:
      enabled: false

    readinessProbe:
      enabled: false
    services:
      dask-gateway:
        apiToken: "xxx"
    config:
      Authenticator:
        admin_users:
        - arokem
      GitHubOAuthenticator:
        client_id: xxx
        client_secret: xxx
        oauth_callback_url: http://xxx/hub/oauth_callback
      JupyterHub:
        admin_access: true
        authenticator_class: github
  cull:
    enabled: false

dask-gateway:
  gateway:
    auth:
      jupyterhub:
        apiToken: "xxx"
    extraConfig:
      optionHandler: |
            from dask_gateway_server.options import Options, Integer, Float, String
            def option_handler(options):
                if ":" not in options.image:
                    raise ValueError("When specifying an image you must also provide a tag")
                return {
                    "image": options.image,
                }
            c.Backend.cluster_options = Options(
                String("image", default="ghcr.io/nrdg/autofq-daskhub:0546ce5f054b", label="Image"),
                handler=option_handler,
            )
  traefik:
    service:
      type: ClusterIP  # Access Dask-Gateway through JupyterHub.

A few things to note about this:

  1. I am cargo-culting Erik’s work here. For example, I am not sure whether the traefik section at the end is necessary or if the readinessProbe section under jupyterhub.hub. They don’t seem to harm.
  2. The networkPolicy.enabled == false bit, however, is essential for GatewayCluster to work. This is based on Erik’s work on the l2l hub, and also on this issue (I get a ServerDisconnectedError error otherwise).
  3. For things to work smoothly, I had to change the machine type that Azure uses on the AKS nodepool to the beefier ‘Standard_D8s_v3’ (from Standard_D4s_v3 that was there originally, I believe). Otherwise, scaleup events (e.g., adding workers) can cause everything to come crashing down. I think it’s because scaleups require more memory than each one of these machines had.
  4. I don’t like it when Jupyterhub culls pods. In particular, it seems that the criterion for culling is not whether any computing is happening on the pod, but if there is enough interaction. This can be very frustrating if you are trying to run long-running computations, while doing something else (like, say, writing this).

The ghcr.io/nrdg/autofq-daskhub image configuration is maintained on this repository.

A few important points about the Docker configuration:

  1. I spent some time trying to inherit from an image higher up in the dependency chain (as suggested here), but I ended up using the Dockerfile that Erik had written for the l2l hub instead.
  2. It intentionally does everything, without using any of the onbuild tricks that the pangeo stack uses. Again, this relies on Erik’s lead on the l2l image config.
  3. I still have some work to do to figure out how to install some tricky dependencies. In particular, some of our software relies on cvxpy for convex optimization methods, and that turns out to be a bit of a pain to install, even with conda (why oh why are there so many conflicts with conda? Is that a new thing?).
  4. One big gotcha with ghcr is that images are per default private and you have to explicitely make an image public for the hub to be able to pull that.

The initial installation of the hub and upgrades are done via:

helm upgrade --wait --install --render-subchart-notes \
    dhub dask/daskhub \
    --version=4.5.7 \
    --values=secrets.yaml

The code I use to start a cluster and use it in notebooks:

from dask_gateway import GatewayCluster
gateway = GatewayCluster()
gateway.scale(n_workers)
client = gateway.get_client()

Things I still need to sort out:

  1. This issue is still happening with this configuration. That is, using the dask labextension currently doesn’t work (and could confuse users, as it launches a LocalCluster). I can still get a view on the work that the cluster is doing by clicking on the provided dashboard link.
  2. The initialization of the GatewayCluster class instance is still a bit flaky. It looks like maybe it takes a while for the dask-scheduler pod to get started on a cold cluster, so I sometimes need to repeat this step a couple of times, before it works.
  3. My current configuration has 2GB RAM per worker. That might be a little low for the kinds of things we’d like to do with this.
  4. It would be great to automate everything, so that changes to configuration immediately trigger an update to the hub. Right now, I have to make a PR for the docker image to build (at least that’s automated!), then merge the PR and wait for it to build, copy the hash of the new version of the image into my secrets.yaml file and run the upgrade on my laptop. I know it’s possible to have the CI/CD do all of this, but I would still need to piece it together. I realize that this will save more time in the long run. Sigh.

Finally, something a bit more meta. All this made me think of conversations that I was recently part of where people said things like “we are scientists, not cloud engineers!”, suggesting that large research collaborations should rely (exclusively?) on non-scientists with specialized knowledge to design, implement and maintain systems like this. I’d like to push against that notion just a bit. It’s true that I am relying on Erik and others with the specialized expertise as a starting point for this work, but I think that understanding how these systems and technologies work a bit better is relevant to scientific knowledge creation in many fields where this technology is going to be used. Just like I believe software engineering is. Or math. This doesn’t neccesarily mean that we should drop everything and just focus on cloud engineering issues, but we should try to make sure that it has a place within our work. For me personally, one of my main learning points from all this was to more effectively use kubectl describe pod to debug issues that come up and try to interpret events that happened that caused issues to arise. There’s definitely more to learn.

EDIT (20210215): After first writing this post, I figured out some things that were originally wrong, so I fixed them in the config file that I posted here. I also discovered (maybe that shouldn’t have been a surprise?) that some issues that come up can be “fixed” by turning the damn thing off and then on again. In this case, I ran into some really puzzling behaviors with dask schedulers dying on me left and right, leaving me unable to run anything on the cluster (basically a persistent form of issue 2 above). Power cycling the Azure VMs seems to have resolved that issue.