<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://arokem.github.io/rokem-research/rokem-research/feed.xml" rel="self" type="application/atom+xml" /><link href="https://arokem.github.io/rokem-research/rokem-research/" rel="alternate" type="text/html" /><updated>2024-12-13T06:44:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/feed.xml</id><title type="html">Rokem Research</title><subtitle>arokem&apos;s r&amp;d</subtitle><entry><title type="html">Debugging workflow for R development</title><link href="https://arokem.github.io/rokem-research/rokem-research/2024/12/12/r-debugging.html" rel="alternate" type="text/html" title="Debugging workflow for R development" /><published>2024-12-12T00:00:00+00:00</published><updated>2024-12-12T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2024/12/12/r-debugging</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2024/12/12/r-debugging.html"><![CDATA[<p>Over the time that I have developed software with Python, I have grown to like
a rather specific workflow for discovering and addressing bugs. I like to think
of this workflow as fairly efficient. I might start with a few simple tests of
the code that I am writing, to make sure that I am on the right path, and that
the code runs through, producing reasonable outputs for some set of simple
outputs. I use <a href="https://docs.pytest.org/en/stable/"><code class="language-plaintext highlighter-rouge">pytest</code></a> as a test
harness, and so a typical development cycle might look like:</p>

<ul>
  <li>Start to develop some new code.</li>
  <li>Pretty early on write some tests for this code (sometimes, but frankly rather rarely I might even start with the tests, aka test-driven development or TDD).</li>
  <li>Iterate:
    <ul>
      <li>Run the tests (usually that will fail first time through)</li>
      <li>Fix up the code</li>
      <li>Run the tests; stop when they pass.</li>
    </ul>
  </li>
  <li>Go back to the code and hone the ideas (yes, I often don’t really know what I need to do until I start doing it…)</li>
  <li>Iterate:
    <ul>
      <li>Run the tests.</li>
      <li>Fix up the code and/or the tests (at this point, I might discover the tests that I originally wrote are not the right ones).</li>
      <li>Run the tests; stop when they pass.</li>
    </ul>
  </li>
  <li>Etc.</li>
</ul>

<p>This is nice. In the Python/pytest setting. The fixing up part often requires
knowledge of the state of the program at various points. This is because at the
point where something needs fixing, I am usually facing some kind of bug in my
code, and without the ability to explore the state of the program (i.e., the values of variables within the program up and down the stack), it is sometimes
hard to know what’s wrong. Pytest enables this kind of examination by using the
<code class="language-plaintext highlighter-rouge">--pdb</code> flag. This means that when tests are run with (using a <a href="https://github.com/tractometry/AFQ-Insight/pull/13">recent example</a>):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pytest afqinsight/tests/test_parametric.py --pdb
</code></pre></div></div>

<p>When the program hits an error, it will drop me into good old
<a href="https://docs.python.org/3/library/pdb.html#debugger-commands">pdb, which is the Python debugging environment</a>.
If I am getting results that don’t make sense, I will often need to introduce
“break points” into the program. That is, I will want to examine the program
failing at a point earlier from the point at which the error is raised. This is
because, for example, by the time an assertion error is raised in my tests, the
state of the program that produced the error is no longer inspectable. One
sneaky way to do this is to insert a 1/0 into the code at the point at which
you want to examine the state of the program. This will raise a
ZeroDivisionError, at which point the program will drop me into pdb, and I can
get to work (I believe I may have picked this one up from Matthew Brett, when I
was wet behind the ears.) I have been doing this for so long that writing the
“appropriate” code for inserting a breakpoint (<code class="language-plaintext highlighter-rouge">import pdb; pdb.set_trace()</code>)
seems like that many too many characters. By the way, this approach works just
as well if I am not doing development on the command line, but I am instead
working inside of a Jupyter notebook and I hit some results that don’t make
sense. This is thanks to the Jupyter <code class="language-plaintext highlighter-rouge">debug</code> magic command, which also drops me
into the debugger at the point at which an error was raised. This is all good and
simple.</p>

<p>However, I more recently started developing software in R (<a href="https://github.com/tractometry/tractable">here</a>), and I have been hankering for a similar
workflow. Much of what I described above does work quite well. For example, R provides an excellent test harness in the <a href="https://testthat.r-lib.org/"><code class="language-plaintext highlighter-rouge">testthat</code></a> library, so I can replicate my quasi-TDD development cycle quite nicely
(adding in frequent visits to R documentation, stack overflow and google to figure out how to do the most basic things in this language, because I am quite
clueless about it for now.)</p>

<p>but when it comes to debugging, testthat seems to completely resists the kind
of facilities that I have been accustomed to in my Python development. For
example, there is no way (that I could figure out) to run testthat in a way
that would drop you into a debugging session when an error is raised(either in
the test code, or in the program that is exercised by the test), which would be
akin to the <code class="language-plaintext highlighter-rouge">--pdb</code> flag . So, for example, if I am working in the zsh shell
(because, yes, I am a cliche) and execute:</p>

<p><code class="language-plaintext highlighter-rouge">Rscript -e "devtools::test()"</code><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>Errors will get recorded and reported, but as far as I can tell, there is no
input to the <code class="language-plaintext highlighter-rouge">test</code> function that will make R stop running when it hits that
error and would drop me into an interactive debugging session. If I introduce
breakpoints into the code by introducing <code class="language-plaintext highlighter-rouge">browser()</code> calls into the code,
testthat will happily report that a browser debugging session happened, but
will just keep going. So, how does one (especially one as inexperienced with R
as me) make any progress debugging their code?</p>

<h2 id="enter-positron">Enter Positron</h2>

<p>So here’s what I end up doing (for now?). I use the newly-launched
<a href="https://positron.posit.co/">Positron IDE</a>.
I would’ve preferred to be in VS Code (my usual text editor) for all this, but
fortunately, it’s a fork of VS Code, so I am pretty close to where I want to be.
I can still use the terminal to run the tests and identify failures, repeating my
development-cycle-of-choice but in addition, I also add to the top of each of my
test files (haven’t done it yet, to be completely honest, <a href="https://github.com/tractometry/tractable/pull/12">but it might happen soon</a>)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(testthat)
</code></pre></div></div>

<p>and append to all the calls to library code the necessary <code class="language-plaintext highlighter-rouge">tractable::</code>, so
that when the test file is run independently in a Positron session, the code
can be run and all the functions that are defined within the tests can properly
be found by R. Thus, I open up the test files and ask Positron to run them for
me, and when I do that, I can now hit the calls to <code class="language-plaintext highlighter-rouge">browser</code> and Positron will
conveniently drop me into a browser session, where I can haplessly bumble
through the variable defined in my code and try to make sense of things. For
now, that seems to solve the main problem for me, and I can get on causing some
real damage. I will also add that the Makefile I have for this library also has
a <code class="language-plaintext highlighter-rouge">make reinstall</code> rule, which runs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R CMD REMOVE tractable
R CMD INSTALL .
</code></pre></div></div>

<p>Because I can’t figure out how to install an R package in “editable” mode (i.e.,
the equivalent to the Python <code class="language-plaintext highlighter-rouge">pip install -e</code> flag), so whenever I make any
changes to the code, I can run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make reinstall
make test
</code></pre></div></div>

<p>to see what I’ve broken this time.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Natch, I actually run <a href="https://github.com/tractometry/tractable/blob/main/Makefile"><code class="language-plaintext highlighter-rouge">make test</code></a> because that’s too many characters to be typing into a terminal every time I want to test my code. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[Over the time that I have developed software with Python, I have grown to like a rather specific workflow for discovering and addressing bugs. I like to think of this workflow as fairly efficient. I might start with a few simple tests of the code that I am writing, to make sure that I am on the right path, and that the code runs through, producing reasonable outputs for some set of simple outputs. I use pytest as a test harness, and so a typical development cycle might look like:]]></summary></entry><entry><title type="html">A typology of datatypes?</title><link href="https://arokem.github.io/rokem-research/rokem-research/2023/02/24/types-of-data.html" rel="alternate" type="text/html" title="A typology of datatypes?" /><published>2023-02-24T00:00:00+00:00</published><updated>2023-02-24T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2023/02/24/types-of-data</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2023/02/24/types-of-data.html"><![CDATA[<p>We are writing up a project that explored the potential benefits of using
neural network algorithms to model relationships between white matter tract
profiles and phenotypes. One of the main questions we are asking in this work
is whether NNs would be able to capitalize on the structure that exists in the
data and discover non-linear relationships between white matter tissue
properties and chronological age of the participants. It turned out that the
benefits in model accuracy where rather small (although we also know of <a href="https://www.biorxiv.org/content/10.1101/2023.01.17.524459v1">use cases</a> where the
benefits are much more dramatic).</p>

<p>When presenting this work at the eScience Institute core staff meeting a few
months ago, Anissa Tanweer raised the interesting question of whether we could
reason about the data in advance of the rather extensive empirical evaluation
that we did to tell us whether this data would benefit from the additional
flexibility and power of an NN algorithm. My current hunch is that this is not
possible, because it’s hard to discover a specific non-linear relationship in
high dimensional data without the non-linear model. However, I think that there
could be a way to reason about data that would bring us closer to this
interesting silver bullet. This could have to do with the fact that different
types of data have different characteristics. In the paper that we are writing,
we touch on the fact that tract profiles are derived from images, but they are
no longer images. In fact, when they are naively observed, they are tabular
data. But we also know that they have group structure, and that they contain
spatially contiguous data in neighboring nodes within a tract, but not across
tracts. In addition, there could be other kinds of structure. For example, some
tissue properties that we calculate could be correlated because of physical
relationships between them (for example, diffusion FA and estimates of axonal
water fraction are related to each other). Another example is that there are
known correlations between the same tract in the two hemispheres (a fact beautifully exploited by <a href="https://stanford.edu/~wandell/data/papers/2019-Replication-Generalization-Neuroimage.pdf">Lerma et al.</a>).</p>

<p>One way to analyze and discover this structure may be through unsupervised
learning approaches (linear and non-linear). If this analysis reveals the kind
of structure that would benefit from the strengths of the NN, then it might be
worth pursuing.</p>

<p>This also relates to a typology of different kinds of data that we might
encounter with more or less of this kind of structure. For example, things that
are more “image-like” vs. things that are more “table-like”. It also relates to
findings about the relative disadvantage of CNNs in analysis of large tabular
data, where <a href="https://arxiv.org/abs/2207.08815">other kinds of non-linear models may be better suited</a>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[We are writing up a project that explored the potential benefits of using neural network algorithms to model relationships between white matter tract profiles and phenotypes. One of the main questions we are asking in this work is whether NNs would be able to capitalize on the structure that exists in the data and discover non-linear relationships between white matter tissue properties and chronological age of the participants. It turned out that the benefits in model accuracy where rather small (although we also know of use cases where the benefits are much more dramatic).]]></summary></entry><entry><title type="html">pyAFQ 1.0 is out; some thoughts on documentation</title><link href="https://arokem.github.io/rokem-research/rokem-research/2023/02/11/documentation.html" rel="alternate" type="text/html" title="pyAFQ 1.0 is out; some thoughts on documentation" /><published>2023-02-11T00:00:00+00:00</published><updated>2023-02-11T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2023/02/11/documentation</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2023/02/11/documentation.html"><![CDATA[<p>It’s been a little while. One reason to write now is that we finally release a version 1.0 of our software <a href="https://yeatmanlab.github.io/pyAFQ/">pyAFQ</a>. This has been a long time in the making. This is a project that I started working on together with Jason Yeatman soon after first arriving at UW in 2015. However, it really only took off in earnest after John Kruper joined us in this effort in 2019, and with the <a href="https://projectreporter.nih.gov/project_info_details.cfm?aid=9886761&amp;icde=46874320&amp;ddparam=&amp;ddvalue=&amp;ddsub=&amp;cr=2&amp;csb=default&amp;cs=ASC&amp;pball=">funding</a> that we received in late 2019.</p>

<p>One of the things that I hope to do next is an overhaul of the documentation. I have noticed that we field a set of the same questions from collaborators and I am hopeful that rethinking the documentation will have an impact on our ability to collaborate with researchers from a variety of backgrounds who want to use the methods, and will also enable us to more effectively capture the knowledge that we accumulate through these interactions. I am excited about the framework that is laid out in <a href="https://documentation.divio.com/">divio’s documentation system</a>, and we are currently working to implement this with Qiqi Liang, who recently joined the group as a research assistant. This is currently work in progress in a <a href="https://github.com/yeatmanlab/pyAFQ/pull/948">PR</a> that will hopefully replace our current documentation before too long.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[It’s been a little while. One reason to write now is that we finally release a version 1.0 of our software pyAFQ. This has been a long time in the making. This is a project that I started working on together with Jason Yeatman soon after first arriving at UW in 2015. However, it really only took off in earnest after John Kruper joined us in this effort in 2019, and with the funding that we received in late 2019.]]></summary></entry><entry><title type="html">Setting up a scalable dask Jupyterhub (cargo-cult edition, as of 20210214)</title><link href="https://arokem.github.io/rokem-research/rokem-research/2021/02/14/daskhub.html" rel="alternate" type="text/html" title="Setting up a scalable dask Jupyterhub (cargo-cult edition, as of 20210214)" /><published>2021-02-14T00:00:00+00:00</published><updated>2021-02-14T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2021/02/14/daskhub</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2021/02/14/daskhub.html"><![CDATA[<p>Setting up a flexibly-configured Dask-powered Jupyterhub for the <a href="https://autofq.org">DIRECT</a> research group has been a goal of mine for a while. Back in 2019, I did some <a href="https://github.com/nrdg/hcp-pangeo-experiments">experiments</a> that demonstrated that we could use Dask as a really effective way to scale some of the computations that we do with neuroimaging data. These experiments were done on a Jupyterhub that I set up on AWS using a set of snakemake scripts that I got from Scott Henderson, who had originally created these as part of his work with the <a href="https://pangeo.io/">Pangeo project</a>. These worked well after only some minimal tweaking, but it turned out that there were issues with this setup, and we had to take down our Jupyterhub. I got distracted by other things and left that aside. Over the time that has passed since, I also worked with Erik Sundell, on another hub I had set up for another project (described <a href="https://arokem.github.io/2019-BRAINI-PanNeuro-slides/#/">here</a>). I had originally set up this hub on GCP, but due to changing needs of the project, and some administrative quirks that I will not go into, we decided to move our hub to AWS. When we did that, Erik moved our configuration from the one that I had originally designed, that fairly closely matched the Pangeo configuration as of early 2019, to one that is much more generic and more flexible (but as you’ll see, still leans heavily on pangeo). Recently, we got some <a href="https://escience.washington.edu/azure-cloud-credits-available/">cloud credits from Azure</a> for our DIRECT work, which gave me the opportunity and impetus to try this again. Over the time that had passed between me setting up the GCP hub and the move to AWS, another thing that happened is that work on Pangeo and other work on Dask had given rise to the very useful <a href="https://github.com/dask/helm-chart/tree/main/daskhub"><code class="language-plaintext highlighter-rouge">daskhub</code> helm chart</a> that can now be (in principle) used to deploy these hubs and configure them.</p>

<p>So, I set about to create an AKS cluster following the <a href="https://docs.dask.org/en/latest/setup/kubernetes-helm.html#helm-install-dask-for-mulitple-users">relevant Dask documentation</a>. The first step was to use <a href="https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/microsoft/step-zero-azure-autoscale.html">the zero to jupyterhub documentation</a> to create an AKS cluster. The only thing that needed tweaking there was the kubernetes version used in the <code class="language-plaintext highlighter-rouge">aks cluster create</code> call. I ended up with 1.19.6, based mostly on the fact that I have 1.19.3 installed locally. I also set this up with a minimum of 3 instances and a max of 100. That should be enough.</p>

<p>The Dask documentation is not very explicit about how to use multiple files for configuration (e.g., <code class="language-plaintext highlighter-rouge">config.yaml</code> vs. <code class="language-plaintext highlighter-rouge">secrets.yaml</code>). I think that I can split this up arbitrarily, but not 100% sure how to do that, so to be on the safe side, I just kept everything together in one file called <code class="language-plaintext highlighter-rouge">secrets.yaml</code>. After an initial install of the helm chart following the Dask documentation, I got the IP address for the cluster, and set up GitHub authentication. After doing that, I set up a custom docker image and installed that into the hub (more about this image below). After progressively changing this file, I ended up with the following (<code class="language-plaintext highlighter-rouge">'xxx'</code> are my secrets):</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">jupyterhub</span><span class="pi">:</span>
  <span class="na">singleuser</span><span class="pi">:</span>
    <span class="na">extraEnv</span><span class="pi">:</span>
      <span class="na">DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE</span><span class="pi">:</span> <span class="s1">'</span><span class="s">{JUPYTER_IMAGE_SPEC}'</span>
      <span class="na">DASK_DISTRIBUTED__DASHBOARD_LINK</span><span class="pi">:</span> <span class="s1">'</span><span class="s">/user/{JUPYTERHUB_USER}/proxy/{port}/status'</span>
      <span class="na">DASK_LABEXTENSION__FACTORY__MODULE</span><span class="pi">:</span> <span class="s1">'</span><span class="s">dask_gateway'</span>
      <span class="na">DASK_LABEXTENSION__FACTORY__CLASS</span><span class="pi">:</span> <span class="s1">'</span><span class="s">GatewayCluster'</span>
    <span class="na">image</span><span class="pi">:</span>
      <span class="na">name</span><span class="pi">:</span> <span class="s">ghcr.io/nrdg/autofq-daskhub</span>
      <span class="na">tag</span><span class="pi">:</span> <span class="s">0546ce5f054b</span>

  <span class="na">proxy</span><span class="pi">:</span>
    <span class="na">secretToken</span><span class="pi">:</span> <span class="s2">"</span><span class="s">xxx"</span>
  <span class="na">hub</span><span class="pi">:</span>
    <span class="na">networkPolicy</span><span class="pi">:</span>
      <span class="na">enabled</span><span class="pi">:</span> <span class="no">false</span>

    <span class="na">readinessProbe</span><span class="pi">:</span>
      <span class="na">enabled</span><span class="pi">:</span> <span class="no">false</span>
    <span class="na">services</span><span class="pi">:</span>
      <span class="na">dask-gateway</span><span class="pi">:</span>
        <span class="na">apiToken</span><span class="pi">:</span> <span class="s2">"</span><span class="s">xxx"</span>
    <span class="na">config</span><span class="pi">:</span>
      <span class="na">Authenticator</span><span class="pi">:</span>
        <span class="na">admin_users</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="s">arokem</span>
      <span class="na">GitHubOAuthenticator</span><span class="pi">:</span>
        <span class="na">client_id</span><span class="pi">:</span> <span class="s">xxx</span>
        <span class="na">client_secret</span><span class="pi">:</span> <span class="s">xxx</span>
        <span class="na">oauth_callback_url</span><span class="pi">:</span> <span class="s">http://xxx/hub/oauth_callback</span>
      <span class="na">JupyterHub</span><span class="pi">:</span>
        <span class="na">admin_access</span><span class="pi">:</span> <span class="no">true</span>
        <span class="na">authenticator_class</span><span class="pi">:</span> <span class="s">github</span>
  <span class="na">cull</span><span class="pi">:</span>
    <span class="na">enabled</span><span class="pi">:</span> <span class="no">false</span>

<span class="na">dask-gateway</span><span class="pi">:</span>
  <span class="na">gateway</span><span class="pi">:</span>
    <span class="na">auth</span><span class="pi">:</span>
      <span class="na">jupyterhub</span><span class="pi">:</span>
        <span class="na">apiToken</span><span class="pi">:</span> <span class="s2">"</span><span class="s">xxx"</span>
    <span class="na">extraConfig</span><span class="pi">:</span>
      <span class="na">optionHandler</span><span class="pi">:</span> <span class="pi">|</span>
            <span class="s">from dask_gateway_server.options import Options, Integer, Float, String</span>
            <span class="s">def option_handler(options):</span>
                <span class="s">if ":" not in options.image:</span>
                    <span class="s">raise ValueError("When specifying an image you must also provide a tag")</span>
                <span class="s">return {</span>
                    <span class="s">"image": options.image,</span>
                <span class="s">}</span>
            <span class="s">c.Backend.cluster_options = Options(</span>
                <span class="s">String("image", default="ghcr.io/nrdg/autofq-daskhub:0546ce5f054b", label="Image"),</span>
                <span class="s">handler=option_handler,</span>
            <span class="s">)</span>
  <span class="na">traefik</span><span class="pi">:</span>
    <span class="na">service</span><span class="pi">:</span>
      <span class="na">type</span><span class="pi">:</span> <span class="s">ClusterIP</span>  <span class="c1"># Access Dask-Gateway through JupyterHub.</span>

</code></pre></div></div>

<p>A few things to note about this:</p>

<ol>
  <li>I am cargo-culting Erik’s work <a href="https://github.com/learning-2-learn/l2lhub-deployment/tree/aws-2020/deployments/l2l">here</a>. For example, I am not sure whether the <code class="language-plaintext highlighter-rouge">traefik</code> section at the end is necessary or if the <code class="language-plaintext highlighter-rouge">readinessProbe</code> section under <code class="language-plaintext highlighter-rouge">jupyterhub.hub</code>. They don’t seem to harm.</li>
  <li>The <code class="language-plaintext highlighter-rouge">networkPolicy.enabled == false</code> bit, however, is essential for GatewayCluster to work. This is based on Erik’s work on the l2l hub, and also on <a href="https://github.com/dask/helm-chart/issues/142">this</a> issue (I get a <code class="language-plaintext highlighter-rouge">ServerDisconnectedError</code> error otherwise).</li>
  <li>For things to work smoothly, I had to change the machine type that Azure uses on the AKS nodepool to the beefier ‘Standard_D8s_v3’ (from Standard_D4s_v3 that was there originally, I believe). Otherwise, scaleup events (e.g., adding workers) can cause everything to come crashing down. I think it’s because scaleups require more memory than each one of these machines had.</li>
  <li>I don’t like it when Jupyterhub culls pods. In particular, it seems that the criterion for culling is not whether any computing is happening on the pod, but if there is enough interaction. This can be very frustrating if you are trying to run long-running computations, while doing something else (like, say, writing this).</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">ghcr.io/nrdg/autofq-daskhub</code> image configuration is maintained on <a href="https://github.com/nrdg/autofq-daskhub">this repository</a>.</p>

<p>A few important points about the Docker configuration:</p>

<ol>
  <li>I spent some time trying to inherit from an image higher up in the dependency chain (as suggested <a href="https://github.com/pangeo-data/pangeo-docker-images/issues/187">here</a>), but I ended up using the Dockerfile that Erik had written for the l2l hub instead.</li>
  <li>It intentionally does everything, without using any of the onbuild tricks that the pangeo stack uses. Again, this relies on Erik’s lead on the l2l image config.</li>
  <li>I still have some work to do to figure out how to install some tricky dependencies. In particular, some of our software relies on cvxpy for convex optimization methods, and that turns out to be a bit of a pain to install, even with conda (why oh why are there so many conflicts with conda? Is that a new thing?).</li>
  <li>One big gotcha with ghcr is that images are per default private and you have to explicitely make an image public for the hub to be able to pull that.</li>
</ol>

<p>The initial installation of the hub and upgrades are done via:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>helm upgrade <span class="nt">--wait</span> <span class="nt">--install</span> <span class="nt">--render-subchart-notes</span> <span class="se">\</span>
    dhub dask/daskhub <span class="se">\</span>
    <span class="nt">--version</span><span class="o">=</span>4.5.7 <span class="se">\</span>
    <span class="nt">--values</span><span class="o">=</span>secrets.yaml
</code></pre></div></div>

<p>The code I use to start a cluster and use it in notebooks:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dask_gateway</span> <span class="kn">import</span> <span class="n">GatewayCluster</span>
<span class="n">gateway</span> <span class="o">=</span> <span class="n">GatewayCluster</span><span class="p">()</span>
<span class="n">gateway</span><span class="p">.</span><span class="n">scale</span><span class="p">(</span><span class="n">n_workers</span><span class="p">)</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">gateway</span><span class="p">.</span><span class="n">get_client</span><span class="p">()</span>
</code></pre></div></div>

<p>Things I still need to sort out:</p>

<ol>
  <li><del><a href="https://github.com/dask/helm-chart/issues/131">This issue</a> is still happening with this configuration. That is, using the dask labextension currently doesn’t work (and could confuse users, as it launches a <code class="language-plaintext highlighter-rouge">LocalCluster</code>). I can still get a view on the work that the cluster is doing by clicking on the provided dashboard link.</del></li>
  <li>The initialization of the <code class="language-plaintext highlighter-rouge">GatewayCluster</code> class instance is still a bit flaky. It looks like maybe it takes a while for the <code class="language-plaintext highlighter-rouge">dask-scheduler</code> pod to get started on a cold cluster, so I sometimes need to repeat this step a couple of times, before it works.</li>
  <li>My current configuration has 2GB RAM per worker. That might be a little low for the kinds of things we’d like to do with this.</li>
  <li>It would be great to automate everything, so that changes to configuration immediately trigger an update to the hub. Right now, I have to make a PR for the docker image to build (at least that’s automated!), then merge the PR and wait for it to build, copy the hash of the new version of the image into my <code class="language-plaintext highlighter-rouge">secrets.yaml</code> file and run the upgrade on my laptop. I know it’s possible to have the CI/CD do all of this, but I would still need to piece it together. I realize that this will save more time in the long run. Sigh.</li>
</ol>

<p>Finally, something a bit more meta. All this made me think of conversations that I was recently part of where people said things like “we are scientists, not cloud engineers!”, suggesting that large research collaborations should rely (exclusively?) on non-scientists with specialized knowledge to design, implement and maintain systems like this. I’d like to push against that notion just a bit. It’s true that I am relying on Erik and others with the specialized expertise as a starting point for this work, but I think that understanding how these systems and technologies work a bit better is relevant to scientific knowledge creation in many fields where this technology is going to be used. Just like I believe software engineering is. Or math. This doesn’t neccesarily mean that we should drop everything and just focus on cloud engineering issues, but we should try to make sure that it has a place within our work. For me personally, one of my main learning points from all this was to more effectively use <code class="language-plaintext highlighter-rouge">kubectl describe pod</code> to debug issues that come up and try to interpret events that happened that caused issues to arise. There’s definitely more to learn.</p>

<p><strong>EDIT (20210215)</strong>: After first writing this post, I figured out some things that were originally wrong, so I fixed them in the config file that I posted here. I also discovered (maybe that shouldn’t have been a surprise?) that some issues that come up can be “fixed” by turning the damn thing off and then on again. In this case, I ran into some really puzzling behaviors with dask schedulers dying on me left and right, leaving me unable to run anything on the cluster (basically a persistent form of issue 2 above). Power cycling the Azure VMs seems to have resolved that issue.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Setting up a flexibly-configured Dask-powered Jupyterhub for the DIRECT research group has been a goal of mine for a while. Back in 2019, I did some experiments that demonstrated that we could use Dask as a really effective way to scale some of the computations that we do with neuroimaging data. These experiments were done on a Jupyterhub that I set up on AWS using a set of snakemake scripts that I got from Scott Henderson, who had originally created these as part of his work with the Pangeo project. These worked well after only some minimal tweaking, but it turned out that there were issues with this setup, and we had to take down our Jupyterhub. I got distracted by other things and left that aside. Over the time that has passed since, I also worked with Erik Sundell, on another hub I had set up for another project (described here). I had originally set up this hub on GCP, but due to changing needs of the project, and some administrative quirks that I will not go into, we decided to move our hub to AWS. When we did that, Erik moved our configuration from the one that I had originally designed, that fairly closely matched the Pangeo configuration as of early 2019, to one that is much more generic and more flexible (but as you’ll see, still leans heavily on pangeo). Recently, we got some cloud credits from Azure for our DIRECT work, which gave me the opportunity and impetus to try this again. Over the time that had passed between me setting up the GCP hub and the move to AWS, another thing that happened is that work on Pangeo and other work on Dask had given rise to the very useful daskhub helm chart that can now be (in principle) used to deploy these hubs and configure them.]]></summary></entry><entry><title type="html">Finding visual pathways</title><link href="https://arokem.github.io/rokem-research/rokem-research/2020/11/09/visual-pathways-wip.html" rel="alternate" type="text/html" title="Finding visual pathways" /><published>2020-11-09T00:00:00+00:00</published><updated>2020-11-09T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2020/11/09/visual-pathways-wip</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2020/11/09/visual-pathways-wip.html"><![CDATA[<p>Some recent work is focusing in on finding the primary visual pathways. This is motivated by a project in Jason Yeatman’s lab that is looking at these pathways, and particularly some exciting work by Sendy Caffara that focuses on comparisons between properties of these pathways and physiological responses to visual stimuli. It’s pretty amazing that you can compare properties of the brain at such different scales and find pretty tight correspondences. At any rate, Sendy has found a few ROIs that can be used to define endpoints and exclusion regions for streamlines in large tractographies (order 10M streamlines) that provide very nice optic radiations. David Bloom in my group has diligently worked to engineer a solution that makes pyAFQ even more flexible than before in defining custom bundles, so that we can integrate Sendy’s work into the software. Finally, I have been working on integrating this solution into the pyAFQ API and to write an example that shows how to find the visual pathways. After a few experiments, I think that this is doable. The key, I believe is to generate a lot of streamlines around the part of the brain that we are interested, massively oversampling this part of the brain. And then use the ROIs to refine down to only the OR. I think that we can come up with a combination of fast tractography, and ROI-based and streamline-based bundle segmentation, that will provide nice OR segmentations. But I’ll still need to demonstrate that.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Some recent work is focusing in on finding the primary visual pathways. This is motivated by a project in Jason Yeatman’s lab that is looking at these pathways, and particularly some exciting work by Sendy Caffara that focuses on comparisons between properties of these pathways and physiological responses to visual stimuli. It’s pretty amazing that you can compare properties of the brain at such different scales and find pretty tight correspondences. At any rate, Sendy has found a few ROIs that can be used to define endpoints and exclusion regions for streamlines in large tractographies (order 10M streamlines) that provide very nice optic radiations. David Bloom in my group has diligently worked to engineer a solution that makes pyAFQ even more flexible than before in defining custom bundles, so that we can integrate Sendy’s work into the software. Finally, I have been working on integrating this solution into the pyAFQ API and to write an example that shows how to find the visual pathways. After a few experiments, I think that this is doable. The key, I believe is to generate a lot of streamlines around the part of the brain that we are interested, massively oversampling this part of the brain. And then use the ROIs to refine down to only the OR. I think that we can come up with a combination of fast tractography, and ROI-based and streamline-based bundle segmentation, that will provide nice OR segmentations. But I’ll still need to demonstrate that.]]></summary></entry><entry><title type="html">Motion correction for dMRI</title><link href="https://arokem.github.io/rokem-research/rokem-research/2020/10/19/hmc.html" rel="alternate" type="text/html" title="Motion correction for dMRI" /><published>2020-10-19T00:00:00+00:00</published><updated>2020-10-19T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2020/10/19/hmc</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2020/10/19/hmc.html"><![CDATA[<p>One of the things that I have been thinking about in the last couple of weeks is head motion correction for diffusion MRI. A few things have come together: the first is that I have been interested in the idea originally proposed in <a href="https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.23186">this paper</a> for a few years. The idea, which is now used in some of the most popular algorithms for motion and eddy current correction, is that direct registration of a diffusion-weighted image to a non-diffusion-weighted image (for example, to the b0 normalization image in a particular scan) is not a great idea, because diffusion-weighted images are supposed to look different from the non-diffusion-weighted images. In particular, in parts of the image that contain large bundles that curve around, slightly different diffusion-weighting directions should cause slight shifts in the location of the brightest pixels. If registered directly, this would cause all kinds of mis-registrations. The idea is instead to use the available data to create a model of the signal in each voxel and to use this model to predict what the image should look like in a particular volume (diffusion-weighting direction) and then register to the predicted image. An algorithm based on this idea would be useful in <a href="https://github.com/nipreps/dmriprep">dmriprep</a>, so I have been working to implement it. Another thing that should help with this is that my previous work demonstrated that a regularized sparse fascicle model provides very accurate predictions of diffusion data. Previously, we used the Elastic Net algorithm to fit this kind of model, with non-negativity constraints. But for motion correction, we don’t really care if the parameters are aphysical (i.e., negative weights in the fODF), so it makes sense to use ridge regression, as implemented in <a href="https://nrdg.github.io/fracridge/">fracridge</a>. In parallel, DIPY <a href="https://github.com/dipy/dipy/pull/2025">now</a> has a simplified API to registration, which should make implementation of the registration part much more straightforward. Taking all these bits and putting them all together is going to be non-trivial, though. An initial sketch of an implementation is <a href="https://github.com/nrdg/hmc">here</a>, but I should really move that into a PR on dmriprep soon.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[One of the things that I have been thinking about in the last couple of weeks is head motion correction for diffusion MRI. A few things have come together: the first is that I have been interested in the idea originally proposed in this paper for a few years. The idea, which is now used in some of the most popular algorithms for motion and eddy current correction, is that direct registration of a diffusion-weighted image to a non-diffusion-weighted image (for example, to the b0 normalization image in a particular scan) is not a great idea, because diffusion-weighted images are supposed to look different from the non-diffusion-weighted images. In particular, in parts of the image that contain large bundles that curve around, slightly different diffusion-weighting directions should cause slight shifts in the location of the brightest pixels. If registered directly, this would cause all kinds of mis-registrations. The idea is instead to use the available data to create a model of the signal in each voxel and to use this model to predict what the image should look like in a particular volume (diffusion-weighting direction) and then register to the predicted image. An algorithm based on this idea would be useful in dmriprep, so I have been working to implement it. Another thing that should help with this is that my previous work demonstrated that a regularized sparse fascicle model provides very accurate predictions of diffusion data. Previously, we used the Elastic Net algorithm to fit this kind of model, with non-negativity constraints. But for motion correction, we don’t really care if the parameters are aphysical (i.e., negative weights in the fODF), so it makes sense to use ridge regression, as implemented in fracridge. In parallel, DIPY now has a simplified API to registration, which should make implementation of the registration part much more straightforward. Taking all these bits and putting them all together is going to be non-trivial, though. An initial sketch of an implementation is here, but I should really move that into a PR on dmriprep soon.]]></summary></entry><entry><title type="html">Life support</title><link href="https://arokem.github.io/rokem-research/rokem-research/2020/08/24/life-support.html" rel="alternate" type="text/html" title="Life support" /><published>2020-08-24T00:00:00+00:00</published><updated>2020-08-24T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2020/08/24/life-support</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2020/08/24/life-support.html"><![CDATA[<p>Over the weekend, and inspired by a conversation we had on Friday in the weekly telecon of our data intensive research in connectomics (DIRIC) group, I started working on a new project that I call <a href="https://github.com/nrdg/life-support">life support</a> (because what I really need right now are new projects…). The idea is rather simple and is not mine (it is due to my postdoc advisor, Brian Wandell): if you have a tractography and you try to create tract profiles, you often find that some features of the tract profile are more affected by the environment that the streamlines are passing through than by the characteristics of that bundle. This can muddy the interpretation of the tract profile in all kinds of ways, because the features may be due to the size of a certain narrow passage, or due to changes in other crossing pathways. To deal with this, one way is to generate a bundle-specific signal, based on a model, such as LiFE and then generate the tract profile based on this predicted signal. The advantage of this approach is that it could effectively remove confounding effects. The <a href="https://github.com/nrdg/life-support/blob/master/2020-08-22-ls1.ipynb">notebook</a> I have in that repo is a first pass at doing this. For now, it’s not working as well as I’d hoped, but one initial observation is how easy it is to start prototyping something like this with the new version of pyAFQ.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Over the weekend, and inspired by a conversation we had on Friday in the weekly telecon of our data intensive research in connectomics (DIRIC) group, I started working on a new project that I call life support (because what I really need right now are new projects…). The idea is rather simple and is not mine (it is due to my postdoc advisor, Brian Wandell): if you have a tractography and you try to create tract profiles, you often find that some features of the tract profile are more affected by the environment that the streamlines are passing through than by the characteristics of that bundle. This can muddy the interpretation of the tract profile in all kinds of ways, because the features may be due to the size of a certain narrow passage, or due to changes in other crossing pathways. To deal with this, one way is to generate a bundle-specific signal, based on a model, such as LiFE and then generate the tract profile based on this predicted signal. The advantage of this approach is that it could effectively remove confounding effects. The notebook I have in that repo is a first pass at doing this. For now, it’s not working as well as I’d hoped, but one initial observation is how easy it is to start prototyping something like this with the new version of pyAFQ.]]></summary></entry><entry><title type="html">NWB+HDF5+ZARR+Dask</title><link href="https://arokem.github.io/rokem-research/rokem-research/2020/08/17/nwb-hdf5-zarr-dask.html" rel="alternate" type="text/html" title="NWB+HDF5+ZARR+Dask" /><published>2020-08-17T00:00:00+00:00</published><updated>2020-08-17T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2020/08/17/nwb-hdf5-zarr-dask</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2020/08/17/nwb-hdf5-zarr-dask.html"><![CDATA[<p>This was not meant to be a usual year of research, after all. Sigh. 
Trying to pick this back up, though.</p>

<p>One of the things that I have been working in the past week or so is 
a new and rather exciting development, spearheaded by Ben Dichter and 
Daniel Sotoude. Based on <a href="https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314">previous work</a> in the geoscience community, they have been working on enabling reading of NWB files stored as HDF5 into ZARR: https://github.com/catalystneuro/HDF5Zarr. 
This means that a large NWB file containing multi-channel recordings of neurophysiology can be stored in 
cloud storage and through a combination of gcsfs and their work read into a ZARR accessible on a 
jupyterhub node.</p>

<p>My first set of experiments with this approach uses a 70 GB neurophysiology recording, provided by Yoni Browning and Beth Buffalo. 
The details of their experimental setup and what exactly is in these time-series is interesting, but not so important for 
what follows.</p>

<p>The first thing that can be done in this setup is to look at the data. All of it. Even though the hub is running on a 
machine with only 13 GB of RAM, we can access it through ZARR/gcsfs at near-interactive speed. For example, 
I implemented the following widget:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chan = widgets.SelectMultiple(
    options=np.arange(arr.shape[1]),
    value=[0],
    description='Channels:',
    disabled=False
    
)

time = widgets.IntSlider(
    value=0,
    min=0,
    max=arr.shape[-1]-5000,
    step=1,
    description='Starting timepoint:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)

time_d = widgets.IntSlider(
    value=5000,
    min=0,
    max=10000,
    step=1,
    description='Window duration',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)



def f(time, time_d=5000, chan=[0]):
    fig, ax = plt.subplots()
    for c in chan:
        if c &lt; 124:
            ax.plot(arr[c, time:time+time_d], label=f"Channel {c}")
    ax.legend()
    ax.set_xlabel("Time (samples)")
    ax.set_ylabel("Voltage")
    plt.show()

out = widgets.interactive_output(f, {'chan': chan, 'time': time, 'time_d': time_d})

widgets.HBox([widgets.VBox([chan, time, time_d]), out])

</code></pre></div></div>
<p>This is not painful: I can select a channel, or even several channels and see their time-series appear within a very short time. In fact, 
I suspect that most of that time is Matplotlib dealing with the rendering of all this data locally. The key seems to be that data access patterns matter. So, data should be  addressed along the chunks stored in the zarr. That is <code class="language-plaintext highlighter-rouge">arr[chan, x:x+delta]</code> is good, but <code class="language-plaintext highlighter-rouge">arr[chan1:chan2, x:x+delta]</code> can be disastrous. But I need to experiment a bit more.</p>

<p>Second, by attaching this hub to a Dask kube cluster, we can process the data. For example, if we want to do a spectral decomposition using wavelets, we can write something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
from mne.time_frequency import tfr_array_morlet
from functools import partial
freqs = np.logspace(1, 2, 20)
my_morlet = partial(tfr_array_morlet, sfreq=fs, freqs=freqs)
morlet_arr = arr.map_blocks(my_morlet, dtype=np.complex128, new_axis=2)

</code></pre></div></div>

<p>This generates the Dask array that could compute the Morlet wavelet decompisition for the entire dataset, for 
20 frequency bands. This would result in a rather large variable: about 0.5 TB here. But this can still be 
viewed at near interactive speed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chan = widgets.Select(
    options=np.arange(arr.shape[1]),
    value=[0],
    description='Channel:',
    disabled=False
    
)

time = widgets.IntSlider(
    value=0,
    min=0,
    max=arr.shape[-1]-500,
    step=1,
    description='Starting timepoint:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)

time_d = widgets.IntSlider(
    value=50,
    min=0,
    max=1000,
    step=1,
    description='Window duration',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
)



def f(time, time_d=5000, chan=0):
    fig, ax = plt.subplots()
    result = np.abs(morlet_arr[0, chan, :, time:time+time_d].compute())
    ax.matshow(result)
    ax.legend()
    ax.set_xlabel("Time (samples)")
    ax.set_ylabel("Frequency bin")
    plt.show()

out = widgets.interactive_output(f, {'chan': chan, 'time': time, 'time_d': time_d})

widgets.HBox([widgets.VBox([chan, time, time_d]), out])
</code></pre></div></div>

<p>Again, this is not painful, although it can only show one channel at a time and does this with a 
small, maybe one second, delay.</p>

<p>But this doesn’t do everything we’d like to do. For example, we want to normalize to z-score in 
every frequency band in every channel. I think that this could be done by creating a new array:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>morlet_mean = da.mean(morlet_arr, axis=-1)
morlet_std = da.std(morlet_arr, axis=-1)
zscored = (morlet_arr[:,0,:,:] - morlet_mean[:, 0, :]) / morlet[:, 0, :]
</code></pre></div></div>

<p>Calculating the z-scored value here is challenging, because this would require aggregating across all of the channel 
data. And this is where Dask should help. I haven’t done this entire computation yet, but 
calculating the mean for the channel is about 10 minutes of computing. We do also need to worry about the 
edges of each block when doing this computation, so one of the next set of experiments to do would use 
the Dask <code class="language-plaintext highlighter-rouge">arr.map_overlap</code> method and a more elaborate Morlet function that also windows each chunk to 
combine with neighboring chunks.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This was not meant to be a usual year of research, after all. Sigh. Trying to pick this back up, though.]]></summary></entry><entry><title type="html">Paper about ML for glaucoma</title><link href="https://arokem.github.io/rokem-research/rokem-research/2020/02/18/glaucoma-ml.html" rel="alternate" type="text/html" title="Paper about ML for glaucoma" /><published>2020-02-18T00:00:00+00:00</published><updated>2020-02-18T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2020/02/18/glaucoma-ml</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2020/02/18/glaucoma-ml.html"><![CDATA[<p>We’re almost almost ready to submit a paper about an automated algorithm for
detection of glaucoma based on data from the UK Biobank. This paper is the
result of a lot of work by Parmita Mehta, who is a PhD student in Computer
Science and a long-time collaborator, and a continuation of a long series of
collaborations with Aaron Lee. I will not say here too much about the results,
except to say that giving a talk about these results last week at a local
seminar (slides <a href="https://arokem.github.io/2020-02-12-UWIN/#/">here</a>) made me
think about the value of such work. There is the (obvious?) value of
demonstrating the potential utility of such algorithms. And it’s (vaguely?)
possible that some variant of this algorithm will find its way into clinical
application one day. But for now, I think that one important and potentially
quite fruitful outcome of this kind of work is in considering the interplay
between “brute force” machine learning, that aims to find the most accurate
representation of the data for predictive accuracy, and an interpretive
methodology that tries to pick apart the results of an accurate algorithm, to
derive some insights. Here, the interpretive methodology takes three different
forms: the first is pixel-by-pixel allocation of credit in deep learning
algorithms. The spatial maps provided by such an analysis can be quite
compelling. Another approach uses sub-sampling of the data, to ask what
information is provided by different parts of the data. This kind of approach is
then also further formalized in using SHAP values. One thing that became clear
in giving a talk about this work is that it would be worth coming up with a
simple and intuitive explanation of how SHAPs work. But even when these values
and maps are derived there is still often a challenge to synthesize what it is
that the algorithm is telling us. So this interplay is further complemented by
<em>a lot</em> of domain knowledge. In this case, the knowledge is derived from close
collaboration with ophthalmologists (primarily Aaron and also Christine
Petersen, who have been working with us closely on the manuscript) and from
reading the literature. Here, a long line of literature on the effects of
glaucoma on different parts of the retina (whoa, eyes are complicated…).</p>

<p>One cool potential conclusion of the work, reiterating previous results that
we’ve found in at least one more
<a href="https://www.nature.com/articles/s41598-019-42042-y">case</a> is that deep learning
algorithms can be sensitive to information that is “hidden in plain sight” in
features of an image that are very subtle and would be hard to extract in a
top-down manner, just based on what we think that an image represents. This
allows the algorithm to point to parts of the retina that would not have been
considered useful for a diagnosis, and in which you might think no information
should be present based on the standard analysis of the images. This is a rather
interesting and important conclusion, as data-driven approaches to analysis of
images becomes much more central.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[We’re almost almost ready to submit a paper about an automated algorithm for detection of glaucoma based on data from the UK Biobank. This paper is the result of a lot of work by Parmita Mehta, who is a PhD student in Computer Science and a long-time collaborator, and a continuation of a long series of collaborations with Aaron Lee. I will not say here too much about the results, except to say that giving a talk about these results last week at a local seminar (slides here) made me think about the value of such work. There is the (obvious?) value of demonstrating the potential utility of such algorithms. And it’s (vaguely?) possible that some variant of this algorithm will find its way into clinical application one day. But for now, I think that one important and potentially quite fruitful outcome of this kind of work is in considering the interplay between “brute force” machine learning, that aims to find the most accurate representation of the data for predictive accuracy, and an interpretive methodology that tries to pick apart the results of an accurate algorithm, to derive some insights. Here, the interpretive methodology takes three different forms: the first is pixel-by-pixel allocation of credit in deep learning algorithms. The spatial maps provided by such an analysis can be quite compelling. Another approach uses sub-sampling of the data, to ask what information is provided by different parts of the data. This kind of approach is then also further formalized in using SHAP values. One thing that became clear in giving a talk about this work is that it would be worth coming up with a simple and intuitive explanation of how SHAPs work. But even when these values and maps are derived there is still often a challenge to synthesize what it is that the algorithm is telling us. So this interplay is further complemented by a lot of domain knowledge. In this case, the knowledge is derived from close collaboration with ophthalmologists (primarily Aaron and also Christine Petersen, who have been working with us closely on the manuscript) and from reading the literature. Here, a long line of literature on the effects of glaucoma on different parts of the retina (whoa, eyes are complicated…).]]></summary></entry><entry><title type="html">dmriprep development sprint</title><link href="https://arokem.github.io/rokem-research/rokem-research/2020/01/23/dmriprep-sprint.html" rel="alternate" type="text/html" title="dmriprep development sprint" /><published>2020-01-23T00:00:00+00:00</published><updated>2020-01-23T00:00:00+00:00</updated><id>https://arokem.github.io/rokem-research/rokem-research/2020/01/23/dmriprep-sprint</id><content type="html" xml:base="https://arokem.github.io/rokem-research/rokem-research/2020/01/23/dmriprep-sprint.html"><![CDATA[<p>Reliable, robust and efficient preprocessing of MRI data is hard. So many things
can go wrong. Building a general-purpose pipeline for preprocessing also faces
the challenge that even for just one type of data (e.g., dMRI) there are
multiple variations on the manner in which the data can be collected (for
example, are multiple gradient strengths collected in each scan, or in separate
scans? Are fieldmaps collected for susceptibility distortion correction, or b0
scans with reverse phase encode directions? Etc.).
<a href="https://fmriprep.readthedocs.io/en/stable/"><code class="language-plaintext highlighter-rouge">fmriprep</code></a> provides an excellent
template to follow as an example of a robust, general-purpose
pipeline for preprocessing of data collected in many different kinds of fMRI
experiments. And so, for a while now, we have been thinking about a <code class="language-plaintext highlighter-rouge">dmriprep</code>
that would emulate the success of <code class="language-plaintext highlighter-rouge">fmriprep</code> for the dMRI community. Initially,
this was a local effort (with Adam Richie-Halford and Anisha Keshavan at the
helm), but after presenting our initial work on this at OHBM, very quickly we
were able to get together other members of the community: Oscar Esteban (at
Stanford), who is the lead developer of <code class="language-plaintext highlighter-rouge">fmriprep</code>, was already interested in
expanding <code class="language-plaintext highlighter-rouge">fmriprep</code> to an ecosystem of
<a href="https://github.com/nipreps"><code class="language-plaintext highlighter-rouge">niprep</code> tools</a> and wanted to make sure that we did it
right. Matt Cieslak (Penn), who has in the meanwhile created
<a href="https://qsiprep.readthedocs.io/en/latest/"><code class="language-plaintext highlighter-rouge">qsiprep</code></a>, which does a lot of what
we might want a <code class="language-plaintext highlighter-rouge">dmriprep</code> to do, was also interested in contributing his
(extensive) knowledge and experience to a community-oriented effort. Over the
course of the last few months, several others joined the effort as well: Gari Lerma (Stanford), Derek
Pisner (UT Austin), Erin Dickie and Michael Joseph (both at CAMH). We were able
to pull in Jelle Veraart (NYU) into some of our discussions, to contribute from
his expertise on the physics of dMRI and particularly on the way to mitigate
noise and other artifacts in dMRI data processing. Jelle has thought a lot
about a process for generating community consensus around dMRI processing
(including a session at ISMRM devoted to starting this process), and we’d
like to be part of this process, so his contribution is crucial.
Ross Lawrence (JHU) has also more recently joined the effort, as part of the
contributions that Joshua Vogelstein and his team are making to open source
(Ross is part of Jovo’s team).</p>

<p>Distributed software development is challenging. It is hard to figure ut who is
doing what, and what the overall architecture should look like. Even if we are
following a well-worn template laid out by <code class="language-plaintext highlighter-rouge">fmriprep</code>. We had started doing
bi-weekly telecons, but we needed an opportunity to get together in person and
hammer out our process and coordinate our expectations. Luckily, I had some
funding available from the
<a href="http://msdse.org/">Moore and Sloan Data Science Environments</a> grant here at
UW eScience that I could use to support travel and accommodation for a three-day
code sprint. And so, on January 13th - 15th, we all congregated in Seattle
(with the exception of Jelle, who couldn’t make it). The sprint gave us just
the opportunity we needed to lay the ground work for the library, in terms of
development infrastructure (testing, documentation, continuous integration,
etc.) and an excellent opportunity to have some in-depth discussions about the
things we would like <code class="language-plaintext highlighter-rouge">dmriprep</code> to do for us. At the end of all this, we could
even go so far as to write down a
<a href="https://nipreps.github.io/dmriprep/roadmap.html">roadmap</a> for future developments
during this year. This lays the groundwork for the telecons that we will continue
to have on a bi-weekly basis (and that are open to anyone to join…).</p>

<p>For me personally, three things stand out as highlights. The first important thing
that I take from this sprint is how I might implement the philosophy of “release
early and often” more seriously in my own work in other projects. For example,
it took us several years to finally <a href="https://arokem.github.io/rokem-research/2020/01/15/pyAFQ-01.html">release a 0.1 of
pyAFQ</a>, but it
really shouldn’t have. And if we adopt some of the approaches that are part of
the genetic make-up of <code class="language-plaintext highlighter-rouge">dmriprep</code>, with its origins in <code class="language-plaintext highlighter-rouge">fmriprep</code>, we will be
releasing more and hopefully leveraging this to make more rapid progress and
detect/fix problems with the software more rapidly.</p>

<p>The second thing that I was excited about is an approach that Matt developed to
correcting for head motion and potentially also for eddy currents. This approach
has its roots (at least in my mind) in a
<a href="https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.23186">2012 paper</a> by Amitay,
Jones and Assaf. The idea is that we are limited in how we can register
different volumes to each other because differences between volumes are due to
both artifactual effects that we’d like to correct for: motion and eddy
currents; but also due to systematic effects that we’d like to retain. Different
parts of the tissue lose signal because of the different gradients applied in
different scans. This is particularly pernicious for high b-values (where a lot
of the signal is lost) and for parts of the brain in which orientation changes
gradually. The approach proposed by Amitay et al. is that a model of the
diffusion in high b-values could be used to predict what the image should look
like. This prediction is then used as a target for registration. This approach
was subsequently popularized by Andersson and Sotiropolous in FSL’s popular
<code class="language-plaintext highlighter-rouge">eddy</code> tool (and their approach is also described in a
<a href="https://www.sciencedirect.com/science/article/pii/S1053811915009209">paper</a>).
Their model of diffusion is slightly more complex than the CHARMED model used by
Amitay et al., but it’s not clear that it more accurately represents the data.
Matt has really run with this approach by using the well-motivated 3D SHORE
model to fit and predict the data using a cross-validation approach (he calls it
<a href="https://github.com/mattcieslak/ohbm_shoreline/blob/master/cieslakOHBM2019.pdf">SHORELine</a>).
However, in line with the goals of <code class="language-plaintext highlighter-rouge">qsiprep</code>, this approach would be limited to
multi-shell diffusion. For <code class="language-plaintext highlighter-rouge">dmriprep</code>, we need to expand this approach slightly
to also work for single-shell data. So, we need a model that accurately predicts
the data. Matt’s previous experiments suggest that DTI systematically fails in
some places (this is well-understood as a consequence of complex fiber
configurations that are not well-captured by DTI). On the other hand, CSD seems
to overfit. Luckily, during my postdoc, I developed a model that does exactly
that: <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123272">fits the data and predicts it really accurately</a>, and fortunately this model
is implemented in DIPY, including both fitting and prediction. One of the next
stages in development (already prototyped by Derek <a href="https://github.com/nipreps/dmriprep/pull/62">here</a>) uses this Sparse Fascicle Model as the predictive model in the heart of a SHORELine-like algorithm. To be continued!</p>

<p>Finally, the last take-home is my optimism about community-led projects that
pool knoweldge, talent and resources across disparate groups and institutions.
Working together towards shared goals, when possible, makes a lot of sense. The
potential to save duplicated effort and to produce outcomes that take into
account more different use-cases is tantalizing.The challenges of bridging
between different work cultures, different scientific goals and inclinations, as
well as between the incentive structure governing the contributions of
individuals are non-trivial, but learning more about the patterns of
collaboration that facilitate productive and happy collaborations is a
worthwhile endeavour in and of itself. Maybe, like families, all happy
collaborations are alike in some essential way? Hopefully, that’s exactly
where <code class="language-plaintext highlighter-rouge">dmriprep</code> is headed.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Reliable, robust and efficient preprocessing of MRI data is hard. So many things can go wrong. Building a general-purpose pipeline for preprocessing also faces the challenge that even for just one type of data (e.g., dMRI) there are multiple variations on the manner in which the data can be collected (for example, are multiple gradient strengths collected in each scan, or in separate scans? Are fieldmaps collected for susceptibility distortion correction, or b0 scans with reverse phase encode directions? Etc.). fmriprep provides an excellent template to follow as an example of a robust, general-purpose pipeline for preprocessing of data collected in many different kinds of fMRI experiments. And so, for a while now, we have been thinking about a dmriprep that would emulate the success of fmriprep for the dMRI community. Initially, this was a local effort (with Adam Richie-Halford and Anisha Keshavan at the helm), but after presenting our initial work on this at OHBM, very quickly we were able to get together other members of the community: Oscar Esteban (at Stanford), who is the lead developer of fmriprep, was already interested in expanding fmriprep to an ecosystem of niprep tools and wanted to make sure that we did it right. Matt Cieslak (Penn), who has in the meanwhile created qsiprep, which does a lot of what we might want a dmriprep to do, was also interested in contributing his (extensive) knowledge and experience to a community-oriented effort. Over the course of the last few months, several others joined the effort as well: Gari Lerma (Stanford), Derek Pisner (UT Austin), Erin Dickie and Michael Joseph (both at CAMH). We were able to pull in Jelle Veraart (NYU) into some of our discussions, to contribute from his expertise on the physics of dMRI and particularly on the way to mitigate noise and other artifacts in dMRI data processing. Jelle has thought a lot about a process for generating community consensus around dMRI processing (including a session at ISMRM devoted to starting this process), and we’d like to be part of this process, so his contribution is crucial. Ross Lawrence (JHU) has also more recently joined the effort, as part of the contributions that Joshua Vogelstein and his team are making to open source (Ross is part of Jovo’s team).]]></summary></entry></feed>