Rokem Research

A typology of datatypes?

2023-02-24T00:00:00+00:00

We are writing up a project that explored the potential benefits of using neural network algorithms to model relationships between white matter tract profiles and phenotypes. One of the main questions we are asking in this work is whether NNs would be able to capitalize on the structure that exists in the data and discover non-linear relationships between white matter tissue properties and chronological age of the participants. It turned out that the benefits in model accuracy where rather small (although we also know of use cases where the benefits are much more dramatic).

When presenting this work at the eScience Institute core staff meeting a few months ago, Anissa Tanweer raised the interesting question of whether we could reason about the data in advance of the rather extensive empirical evaluation that we did to tell us whether this data would benefit from the additional flexibility and power of an NN algorithm. My current hunch is that this is not possible, because it’s hard to discover a specific non-linear relationship in high dimensional data without the non-linear model. However, I think that there could be a way to reason about data that would bring us closer to this interesting silver bullet. This could have to do with the fact that different types of data have different characteristics. In the paper that we are writing, we touch on the fact that tract profiles are derived from images, but they are no longer images. In fact, when they are naively observed, they are tabular data. But we also know that they have group structure, and that they contain spatially contiguous data in neighboring nodes within a tract, but not across tracts. In addition, there could be other kinds of structure. For example, some tissue properties that we calculate could be correlated because of physical relationships between them (for example, diffusion FA and estimates of axonal water fraction are related to each other). Another example is that there are known correlations between the same tract in the two hemispheres (a fact beautifully exploited by Lerma et al.).

One way to analyze and discover this structure may be through unsupervised learning approaches (linear and non-linear). If this analysis reveals the kind of structure that would benefit from the strengths of the NN, then it might be worth pursuing.

This also relates to a typology of different kinds of data that we might encounter with more or less of this kind of structure. For example, things that are more “image-like” vs. things that are more “table-like”. It also relates to findings about the relative disadvantage of CNNs in analysis of large tabular data, where other kinds of non-linear models may be better suited.

pyAFQ 1.0 is out; some thoughts on documentation

2023-02-11T00:00:00+00:00

It’s been a little while. One reason to write now is that we finally release a version 1.0 of our software pyAFQ. This has been a long time in the making. This is a project that I started working on together with Jason Yeatman soon after first arriving at UW in 2015. However, it really only took off in earnest after John Kruper joined us in this effort in 2019, and with the funding that we received in late 2019.

One of the things that I hope to do next is an overhaul of the documentation. I have noticed that we field a set of the same questions from collaborators and I am hopeful that rethinking the documentation will have an impact on our ability to collaborate with researchers from a variety of backgrounds who want to use the methods, and will also enable us to more effectively capture the knowledge that we accumulate through these interactions. I am excited about the framework that is laid out in divio’s documentation system, and we are currently working to implement this with Qiqi Liang, who recently joined the group as a research assistant. This is currently work in progress in a PR that will hopefully replace our current documentation before too long.

Setting up a scalable dask Jupyterhub (cargo-cult edition, as of 20210214)

2021-02-14T00:00:00+00:00

Setting up a flexibly-configured Dask-powered Jupyterhub for the DIRECT research group has been a goal of mine for a while. Back in 2019, I did some experiments that demonstrated that we could use Dask as a really effective way to scale some of the computations that we do with neuroimaging data. These experiments were done on a Jupyterhub that I set up on AWS using a set of snakemake scripts that I got from Scott Henderson, who had originally created these as part of his work with the Pangeo project. These worked well after only some minimal tweaking, but it turned out that there were issues with this setup, and we had to take down our Jupyterhub. I got distracted by other things and left that aside. Over the time that has passed since, I also worked with Erik Sundell, on another hub I had set up for another project (described here). I had originally set up this hub on GCP, but due to changing needs of the project, and some administrative quirks that I will not go into, we decided to move our hub to AWS. When we did that, Erik moved our configuration from the one that I had originally designed, that fairly closely matched the Pangeo configuration as of early 2019, to one that is much more generic and more flexible (but as you’ll see, still leans heavily on pangeo). Recently, we got some cloud credits from Azure for our DIRECT work, which gave me the opportunity and impetus to try this again. Over the time that had passed between me setting up the GCP hub and the move to AWS, another thing that happened is that work on Pangeo and other work on Dask had given rise to the very useful daskhub helm chart that can now be (in principle) used to deploy these hubs and configure them.

So, I set about to create an AKS cluster following the relevant Dask documentation. The first step was to use the zero to jupyterhub documentation to create an AKS cluster. The only thing that needed tweaking there was the kubernetes version used in the aks cluster create call. I ended up with 1.19.6, based mostly on the fact that I have 1.19.3 installed locally. I also set this up with a minimum of 3 instances and a max of 100. That should be enough.

The Dask documentation is not very explicit about how to use multiple files for configuration (e.g., config.yaml vs. secrets.yaml). I think that I can split this up arbitrarily, but not 100% sure how to do that, so to be on the safe side, I just kept everything together in one file called secrets.yaml. After an initial install of the helm chart following the Dask documentation, I got the IP address for the cluster, and set up GitHub authentication. After doing that, I set up a custom docker image and installed that into the hub (more about this image below). After progressively changing this file, I ended up with the following ('xxx' are my secrets):

jupyterhub:
  singleuser:
    extraEnv:
      DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
      DASK_DISTRIBUTED__DASHBOARD_LINK: '/user/{JUPYTERHUB_USER}/proxy/{port}/status'
      DASK_LABEXTENSION__FACTORY__MODULE: 'dask_gateway'
      DASK_LABEXTENSION__FACTORY__CLASS: 'GatewayCluster'
    image:
      name: ghcr.io/nrdg/autofq-daskhub
      tag: 0546ce5f054b

  proxy:
    secretToken: "xxx"
  hub:
    networkPolicy:
      enabled: false

    readinessProbe:
      enabled: false
    services:
      dask-gateway:
        apiToken: "xxx"
    config:
      Authenticator:
        admin_users:
        - arokem
      GitHubOAuthenticator:
        client_id: xxx
        client_secret: xxx
        oauth_callback_url: http://xxx/hub/oauth_callback
      JupyterHub:
        admin_access: true
        authenticator_class: github
  cull:
    enabled: false

dask-gateway:
  gateway:
    auth:
      jupyterhub:
        apiToken: "xxx"
    extraConfig:
      optionHandler: |
            from dask_gateway_server.options import Options, Integer, Float, String
            def option_handler(options):
                if ":" not in options.image:
                    raise ValueError("When specifying an image you must also provide a tag")
                return {
                    "image": options.image,
                }
            c.Backend.cluster_options = Options(
                String("image", default="ghcr.io/nrdg/autofq-daskhub:0546ce5f054b", label="Image"),
                handler=option_handler,
            )
  traefik:
    service:
      type: ClusterIP  # Access Dask-Gateway through JupyterHub.

A few things to note about this:

I am cargo-culting Erik’s work here. For example, I am not sure whether the traefik section at the end is necessary or if the readinessProbe section under jupyterhub.hub. They don’t seem to harm.
The networkPolicy.enabled == false bit, however, is essential for GatewayCluster to work. This is based on Erik’s work on the l2l hub, and also on this issue (I get a ServerDisconnectedError error otherwise).
For things to work smoothly, I had to change the machine type that Azure uses on the AKS nodepool to the beefier ‘Standard_D8s_v3’ (from Standard_D4s_v3 that was there originally, I believe). Otherwise, scaleup events (e.g., adding workers) can cause everything to come crashing down. I think it’s because scaleups require more memory than each one of these machines had.
I don’t like it when Jupyterhub culls pods. In particular, it seems that the criterion for culling is not whether any computing is happening on the pod, but if there is enough interaction. This can be very frustrating if you are trying to run long-running computations, while doing something else (like, say, writing this).

The ghcr.io/nrdg/autofq-daskhub image configuration is maintained on this repository.

A few important points about the Docker configuration:

I spent some time trying to inherit from an image higher up in the dependency chain (as suggested here), but I ended up using the Dockerfile that Erik had written for the l2l hub instead.
It intentionally does everything, without using any of the onbuild tricks that the pangeo stack uses. Again, this relies on Erik’s lead on the l2l image config.
I still have some work to do to figure out how to install some tricky dependencies. In particular, some of our software relies on cvxpy for convex optimization methods, and that turns out to be a bit of a pain to install, even with conda (why oh why are there so many conflicts with conda? Is that a new thing?).
One big gotcha with ghcr is that images are per default private and you have to explicitely make an image public for the hub to be able to pull that.

The initial installation of the hub and upgrades are done via:

helm upgrade --wait --install --render-subchart-notes \
    dhub dask/daskhub \
    --version=4.5.7 \
    --values=secrets.yaml

The code I use to start a cluster and use it in notebooks:

from dask_gateway import GatewayCluster
gateway = GatewayCluster()
gateway.scale(n_workers)
client = gateway.get_client()

Things I still need to sort out:

This issue is still happening with this configuration. That is, using the dask labextension currently doesn’t work (and could confuse users, as it launches a LocalCluster). I can still get a view on the work that the cluster is doing by clicking on the provided dashboard link.
The initialization of the GatewayCluster class instance is still a bit flaky. It looks like maybe it takes a while for the dask-scheduler pod to get started on a cold cluster, so I sometimes need to repeat this step a couple of times, before it works.
My current configuration has 2GB RAM per worker. That might be a little low for the kinds of things we’d like to do with this.
It would be great to automate everything, so that changes to configuration immediately trigger an update to the hub. Right now, I have to make a PR for the docker image to build (at least that’s automated!), then merge the PR and wait for it to build, copy the hash of the new version of the image into my secrets.yaml file and run the upgrade on my laptop. I know it’s possible to have the CI/CD do all of this, but I would still need to piece it together. I realize that this will save more time in the long run. Sigh.

Finally, something a bit more meta. All this made me think of conversations that I was recently part of where people said things like “we are scientists, not cloud engineers!”, suggesting that large research collaborations should rely (exclusively?) on non-scientists with specialized knowledge to design, implement and maintain systems like this. I’d like to push against that notion just a bit. It’s true that I am relying on Erik and others with the specialized expertise as a starting point for this work, but I think that understanding how these systems and technologies work a bit better is relevant to scientific knowledge creation in many fields where this technology is going to be used. Just like I believe software engineering is. Or math. This doesn’t neccesarily mean that we should drop everything and just focus on cloud engineering issues, but we should try to make sure that it has a place within our work. For me personally, one of my main learning points from all this was to more effectively use kubectl describe pod to debug issues that come up and try to interpret events that happened that caused issues to arise. There’s definitely more to learn.

EDIT (20210215): After first writing this post, I figured out some things that were originally wrong, so I fixed them in the config file that I posted here. I also discovered (maybe that shouldn’t have been a surprise?) that some issues that come up can be “fixed” by turning the damn thing off and then on again. In this case, I ran into some really puzzling behaviors with dask schedulers dying on me left and right, leaving me unable to run anything on the cluster (basically a persistent form of issue 2 above). Power cycling the Azure VMs seems to have resolved that issue.

Finding visual pathways

2020-11-09T00:00:00+00:00

Some recent work is focusing in on finding the primary visual pathways. This is motivated by a project in Jason Yeatman’s lab that is looking at these pathways, and particularly some exciting work by Sendy Caffara that focuses on comparisons between properties of these pathways and physiological responses to visual stimuli. It’s pretty amazing that you can compare properties of the brain at such different scales and find pretty tight correspondences. At any rate, Sendy has found a few ROIs that can be used to define endpoints and exclusion regions for streamlines in large tractographies (order 10M streamlines) that provide very nice optic radiations. David Bloom in my group has diligently worked to engineer a solution that makes pyAFQ even more flexible than before in defining custom bundles, so that we can integrate Sendy’s work into the software. Finally, I have been working on integrating this solution into the pyAFQ API and to write an example that shows how to find the visual pathways. After a few experiments, I think that this is doable. The key, I believe is to generate a lot of streamlines around the part of the brain that we are interested, massively oversampling this part of the brain. And then use the ROIs to refine down to only the OR. I think that we can come up with a combination of fast tractography, and ROI-based and streamline-based bundle segmentation, that will provide nice OR segmentations. But I’ll still need to demonstrate that.

Motion correction for dMRI

2020-10-19T00:00:00+00:00

One of the things that I have been thinking about in the last couple of weeks is head motion correction for diffusion MRI. A few things have come together: the first is that I have been interested in the idea originally proposed in this paper for a few years. The idea, which is now used in some of the most popular algorithms for motion and eddy current correction, is that direct registration of a diffusion-weighted image to a non-diffusion-weighted image (for example, to the b0 normalization image in a particular scan) is not a great idea, because diffusion-weighted images are supposed to look different from the non-diffusion-weighted images. In particular, in parts of the image that contain large bundles that curve around, slightly different diffusion-weighting directions should cause slight shifts in the location of the brightest pixels. If registered directly, this would cause all kinds of mis-registrations. The idea is instead to use the available data to create a model of the signal in each voxel and to use this model to predict what the image should look like in a particular volume (diffusion-weighting direction) and then register to the predicted image. An algorithm based on this idea would be useful in dmriprep, so I have been working to implement it. Another thing that should help with this is that my previous work demonstrated that a regularized sparse fascicle model provides very accurate predictions of diffusion data. Previously, we used the Elastic Net algorithm to fit this kind of model, with non-negativity constraints. But for motion correction, we don’t really care if the parameters are aphysical (i.e., negative weights in the fODF), so it makes sense to use ridge regression, as implemented in fracridge. In parallel, DIPY now has a simplified API to registration, which should make implementation of the registration part much more straightforward. Taking all these bits and putting them all together is going to be non-trivial, though. An initial sketch of an implementation is here, but I should really move that into a PR on dmriprep soon.

Life support

2020-08-24T00:00:00+00:00

Over the weekend, and inspired by a conversation we had on Friday in the weekly telecon of our data intensive research in connectomics (DIRIC) group, I started working on a new project that I call life support (because what I really need right now are new projects…). The idea is rather simple and is not mine (it is due to my postdoc advisor, Brian Wandell): if you have a tractography and you try to create tract profiles, you often find that some features of the tract profile are more affected by the environment that the streamlines are passing through than by the characteristics of that bundle. This can muddy the interpretation of the tract profile in all kinds of ways, because the features may be due to the size of a certain narrow passage, or due to changes in other crossing pathways. To deal with this, one way is to generate a bundle-specific signal, based on a model, such as LiFE and then generate the tract profile based on this predicted signal. The advantage of this approach is that it could effectively remove confounding effects. The notebook I have in that repo is a first pass at doing this. For now, it’s not working as well as I’d hoped, but one initial observation is how easy it is to start prototyping something like this with the new version of pyAFQ.

NWB+HDF5+ZARR+Dask

2020-08-17T00:00:00+00:00

This was not meant to be a usual year of research, after all. Sigh. Trying to pick this back up, though.

One of the things that I have been working in the past week or so is a new and rather exciting development, spearheaded by Ben Dichter and Daniel Sotoude. Based on previous work in the geoscience community, they have been working on enabling reading of NWB files stored as HDF5 into ZARR: https://github.com/catalystneuro/HDF5Zarr. This means that a large NWB file containing multi-channel recordings of neurophysiology can be stored in cloud storage and through a combination of gcsfs and their work read into a ZARR accessible on a jupyterhub node.

My first set of experiments with this approach uses a 70 GB neurophysiology recording, provided by Yoni Browning and Beth Buffalo. The details of their experimental setup and what exactly is in these time-series is interesting, but not so important for what follows.

The first thing that can be done in this setup is to look at the data. All of it. Even though the hub is running on a machine with only 13 GB of RAM, we can access it through ZARR/gcsfs at near-interactive speed. For example, I implemented the following widget:

chan = widgets.SelectMultiple(
    options=np.arange(arr.shape[1]),
    value=[0],
    description='Channels:',
    disabled=False
    
)

time = widgets.IntSlider(
    value=0,
    min=0,
    max=arr.shape[-1]-5000,
    step=1,
    description='Starting timepoint:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)

time_d = widgets.IntSlider(
    value=5000,
    min=0,
    max=10000,
    step=1,
    description='Window duration',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)



def f(time, time_d=5000, chan=[0]):
    fig, ax = plt.subplots()
    for c in chan:
        if c < 124:
            ax.plot(arr[c, time:time+time_d], label=f"Channel {c}")
    ax.legend()
    ax.set_xlabel("Time (samples)")
    ax.set_ylabel("Voltage")
    plt.show()

out = widgets.interactive_output(f, {'chan': chan, 'time': time, 'time_d': time_d})

widgets.HBox([widgets.VBox([chan, time, time_d]), out])

This is not painful: I can select a channel, or even several channels and see their time-series appear within a very short time. In fact, I suspect that most of that time is Matplotlib dealing with the rendering of all this data locally. The key seems to be that data access patterns matter. So, data should be addressed along the chunks stored in the zarr. That is arr[chan, x:x+delta] is good, but arr[chan1:chan2, x:x+delta] can be disastrous. But I need to experiment a bit more.

Second, by attaching this hub to a Dask kube cluster, we can process the data. For example, if we want to do a spectral decomposition using wavelets, we can write something like:

from mne.time_frequency import tfr_array_morlet
from functools import partial
freqs = np.logspace(1, 2, 20)
my_morlet = partial(tfr_array_morlet, sfreq=fs, freqs=freqs)
morlet_arr = arr.map_blocks(my_morlet, dtype=np.complex128, new_axis=2)

This generates the Dask array that could compute the Morlet wavelet decompisition for the entire dataset, for 20 frequency bands. This would result in a rather large variable: about 0.5 TB here. But this can still be viewed at near interactive speed:

chan = widgets.Select(
    options=np.arange(arr.shape[1]),
    value=[0],
    description='Channel:',
    disabled=False
    
)

time = widgets.IntSlider(
    value=0,
    min=0,
    max=arr.shape[-1]-500,
    step=1,
    description='Starting timepoint:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)

time_d = widgets.IntSlider(
    value=50,
    min=0,
    max=1000,
    step=1,
    description='Window duration',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)



def f(time, time_d=5000, chan=0):
    fig, ax = plt.subplots()
    result = np.abs(morlet_arr[0, chan, :, time:time+time_d].compute())
    ax.matshow(result)
    ax.legend()
    ax.set_xlabel("Time (samples)")
    ax.set_ylabel("Frequency bin")
    plt.show()

out = widgets.interactive_output(f, {'chan': chan, 'time': time, 'time_d': time_d})

widgets.HBox([widgets.VBox([chan, time, time_d]), out])

Again, this is not painful, although it can only show one channel at a time and does this with a small, maybe one second, delay.

But this doesn’t do everything we’d like to do. For example, we want to normalize to z-score in every frequency band in every channel. I think that this could be done by creating a new array:

morlet_mean = da.mean(morlet_arr, axis=-1)
morlet_std = da.std(morlet_arr, axis=-1)
zscored = (morlet_arr[:,0,:,:] - morlet_mean[:, 0, :]) / morlet[:, 0, :]

Calculating the z-scored value here is challenging, because this would require aggregating across all of the channel data. And this is where Dask should help. I haven’t done this entire computation yet, but calculating the mean for the channel is about 10 minutes of computing. We do also need to worry about the edges of each block when doing this computation, so one of the next set of experiments to do would use the Dask arr.map_overlap method and a more elaborate Morlet function that also windows each chunk to combine with neighboring chunks.

Paper about ML for glaucoma

2020-02-18T00:00:00+00:00

We’re almost almost ready to submit a paper about an automated algorithm for detection of glaucoma based on data from the UK Biobank. This paper is the result of a lot of work by Parmita Mehta, who is a PhD student in Computer Science and a long-time collaborator, and a continuation of a long series of collaborations with Aaron Lee. I will not say here too much about the results, except to say that giving a talk about these results last week at a local seminar (slides here) made me think about the value of such work. There is the (obvious?) value of demonstrating the potential utility of such algorithms. And it’s (vaguely?) possible that some variant of this algorithm will find its way into clinical application one day. But for now, I think that one important and potentially quite fruitful outcome of this kind of work is in considering the interplay between “brute force” machine learning, that aims to find the most accurate representation of the data for predictive accuracy, and an interpretive methodology that tries to pick apart the results of an accurate algorithm, to derive some insights. Here, the interpretive methodology takes three different forms: the first is pixel-by-pixel allocation of credit in deep learning algorithms. The spatial maps provided by such an analysis can be quite compelling. Another approach uses sub-sampling of the data, to ask what information is provided by different parts of the data. This kind of approach is then also further formalized in using SHAP values. One thing that became clear in giving a talk about this work is that it would be worth coming up with a simple and intuitive explanation of how SHAPs work. But even when these values and maps are derived there is still often a challenge to synthesize what it is that the algorithm is telling us. So this interplay is further complemented by a lot of domain knowledge. In this case, the knowledge is derived from close collaboration with ophthalmologists (primarily Aaron and also Christine Petersen, who have been working with us closely on the manuscript) and from reading the literature. Here, a long line of literature on the effects of glaucoma on different parts of the retina (whoa, eyes are complicated…).

One cool potential conclusion of the work, reiterating previous results that we’ve found in at least one more case is that deep learning algorithms can be sensitive to information that is “hidden in plain sight” in features of an image that are very subtle and would be hard to extract in a top-down manner, just based on what we think that an image represents. This allows the algorithm to point to parts of the retina that would not have been considered useful for a diagnosis, and in which you might think no information should be present based on the standard analysis of the images. This is a rather interesting and important conclusion, as data-driven approaches to analysis of images becomes much more central.

dmriprep development sprint

2020-01-23T00:00:00+00:00

Reliable, robust and efficient preprocessing of MRI data is hard. So many things can go wrong. Building a general-purpose pipeline for preprocessing also faces the challenge that even for just one type of data (e.g., dMRI) there are multiple variations on the manner in which the data can be collected (for example, are multiple gradient strengths collected in each scan, or in separate scans? Are fieldmaps collected for susceptibility distortion correction, or b0 scans with reverse phase encode directions? Etc.). fmriprep provides an excellent template to follow as an example of a robust, general-purpose pipeline for preprocessing of data collected in many different kinds of fMRI experiments. And so, for a while now, we have been thinking about a dmriprep that would emulate the success of fmriprep for the dMRI community. Initially, this was a local effort (with Adam Richie-Halford and Anisha Keshavan at the helm), but after presenting our initial work on this at OHBM, very quickly we were able to get together other members of the community: Oscar Esteban (at Stanford), who is the lead developer of fmriprep, was already interested in expanding fmriprep to an ecosystem of niprep tools and wanted to make sure that we did it right. Matt Cieslak (Penn), who has in the meanwhile created qsiprep, which does a lot of what we might want a dmriprep to do, was also interested in contributing his (extensive) knowledge and experience to a community-oriented effort. Over the course of the last few months, several others joined the effort as well: Gari Lerma (Stanford), Derek Pisner (UT Austin), Erin Dickie and Michael Joseph (both at CAMH). We were able to pull in Jelle Veraart (NYU) into some of our discussions, to contribute from his expertise on the physics of dMRI and particularly on the way to mitigate noise and other artifacts in dMRI data processing. Jelle has thought a lot about a process for generating community consensus around dMRI processing (including a session at ISMRM devoted to starting this process), and we’d like to be part of this process, so his contribution is crucial. Ross Lawrence (JHU) has also more recently joined the effort, as part of the contributions that Joshua Vogelstein and his team are making to open source (Ross is part of Jovo’s team).

Distributed software development is challenging. It is hard to figure ut who is doing what, and what the overall architecture should look like. Even if we are following a well-worn template laid out by fmriprep. We had started doing bi-weekly telecons, but we needed an opportunity to get together in person and hammer out our process and coordinate our expectations. Luckily, I had some funding available from the Moore and Sloan Data Science Environments grant here at UW eScience that I could use to support travel and accommodation for a three-day code sprint. And so, on January 13th - 15th, we all congregated in Seattle (with the exception of Jelle, who couldn’t make it). The sprint gave us just the opportunity we needed to lay the ground work for the library, in terms of development infrastructure (testing, documentation, continuous integration, etc.) and an excellent opportunity to have some in-depth discussions about the things we would like dmriprep to do for us. At the end of all this, we could even go so far as to write down a roadmap for future developments during this year. This lays the groundwork for the telecons that we will continue to have on a bi-weekly basis (and that are open to anyone to join…).

For me personally, three things stand out as highlights. The first important thing that I take from this sprint is how I might implement the philosophy of “release early and often” more seriously in my own work in other projects. For example, it took us several years to finally release a 0.1 of pyAFQ, but it really shouldn’t have. And if we adopt some of the approaches that are part of the genetic make-up of dmriprep, with its origins in fmriprep, we will be releasing more and hopefully leveraging this to make more rapid progress and detect/fix problems with the software more rapidly.

The second thing that I was excited about is an approach that Matt developed to correcting for head motion and potentially also for eddy currents. This approach has its roots (at least in my mind) in a 2012 paper by Amitay, Jones and Assaf. The idea is that we are limited in how we can register different volumes to each other because differences between volumes are due to both artifactual effects that we’d like to correct for: motion and eddy currents; but also due to systematic effects that we’d like to retain. Different parts of the tissue lose signal because of the different gradients applied in different scans. This is particularly pernicious for high b-values (where a lot of the signal is lost) and for parts of the brain in which orientation changes gradually. The approach proposed by Amitay et al. is that a model of the diffusion in high b-values could be used to predict what the image should look like. This prediction is then used as a target for registration. This approach was subsequently popularized by Andersson and Sotiropolous in FSL’s popular eddy tool (and their approach is also described in a paper). Their model of diffusion is slightly more complex than the CHARMED model used by Amitay et al., but it’s not clear that it more accurately represents the data. Matt has really run with this approach by using the well-motivated 3D SHORE model to fit and predict the data using a cross-validation approach (he calls it SHORELine). However, in line with the goals of qsiprep, this approach would be limited to multi-shell diffusion. For dmriprep, we need to expand this approach slightly to also work for single-shell data. So, we need a model that accurately predicts the data. Matt’s previous experiments suggest that DTI systematically fails in some places (this is well-understood as a consequence of complex fiber configurations that are not well-captured by DTI). On the other hand, CSD seems to overfit. Luckily, during my postdoc, I developed a model that does exactly that: fits the data and predicts it really accurately, and fortunately this model is implemented in DIPY, including both fitting and prediction. One of the next stages in development (already prototyped by Derek here) uses this Sparse Fascicle Model as the predictive model in the heart of a SHORELine-like algorithm. To be continued!

Finally, the last take-home is my optimism about community-led projects that pool knoweldge, talent and resources across disparate groups and institutions. Working together towards shared goals, when possible, makes a lot of sense. The potential to save duplicated effort and to produce outcomes that take into account more different use-cases is tantalizing.The challenges of bridging between different work cultures, different scientific goals and inclinations, as well as between the incentive structure governing the contributions of individuals are non-trivial, but learning more about the patterns of collaboration that facilitate productive and happy collaborations is a worthwhile endeavour in and of itself. Maybe, like families, all happy collaborations are alike in some essential way? Hopefully, that’s exactly where dmriprep is headed.

Releasing pyAFQ 0.1

2020-01-15T00:00:00+00:00

Just as the year started, and after several years of slow development, we finally released a version 0.1 of pyAFQ. The core functionality of this software is to segment tractography solutions into bundles of streamlines that represent major brain tracts, such as the corticospinal tract, which connects the primary motor cortex with the spinal cord, or the arcuate fasciculus, which connects the temporal lobe with the frontal lobe and plays an important role in language functions. The algorithm for bundle segmentation is based on previous work by Jason Yeatman that was published a while ago. A large part of the motivation for reimplementing these ideas in Python is that this allows us to take advantage of the developments we are making in DIPY. For example, it should allow us to use the large collection of diffusion MRI reconstruction algorithms implemented in DIPY. The core segmentation algorithm, using a strategy that combines waypoint ROIs through which each streamline in a bundle must cross and a probabilistic atlas, was implemented over a rather long period of time, and tested on some sample datasets, as well as on the Human Connectome Project dataset. We had initially implemented a command-line interface that used argparser to parse input arguments, but there are many options that can be used in modeling the diffusion in each voxel, using these models in tractography and in creating scalar values, and in the manner in which segmentation is executed (e.g., whether or not and how to clean a bundle after segmentation, to produce a more compact and coherent bundle that conforms with our idea of a tract). This made the CLI rather clunky. To deal with this, we now use an additional configuration file that customizes each one of these steps. This is a TOML file that follows a particular division into sections, based on stages of the analysis with a set of key-word arguments in each section. One nice thing is that introspection allows us to set defaults once in the executing function (for example, in a tracking function) and then use a simple introspection pattern (thank you, SO!) to read out these values and replace them only if they are defined in the configuration file. This should hopefully make a rather complicated API, with multiple steps of analysis and lots of knobs to twiddle, rather straightforward to control and run.