Indiana University Psychology Department
November 5, 2015

Data Science meets Neuroscience at the University

Ariel Rokem, University of Washington eScience Institute

Follow along at http://arokem.github.io/2015-11-05-IU/

Outline

Data Science is an interdisciplinary solution to many of the problems facing modern-day research

An example from neuroimaging

Data Science comes to the University

All research is becoming data-intensive research
All research is becoming data-intensive research
All research is becoming data-intensive research
Including neuroimaging...
Van Horn and Toga (2014)

The fourth paradigm of science

1. Empirical (experimental)

2. Theoretical (mathematical)

3. Simulation (computational)

4. Data-intensive (eScience)


Jim Gray

The eScience Institute
Our mission: "All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data... In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields"
DSE sponsors

Data Science?

In a variety of fields

Data is enabling new ways of doing things

But data also poses challenges

The 4 V's: Volume, Velocity, Variety, Veracity/Validity

Made his wealth wrangling financial data.

As mayor, started 311, and made NYC data openly available.

Data Science?


Data science
Drew Conway

Data Science?

Statistics and machine learning

Programming and software engineering

Data management

Data visualization and communication

A focus on reproducibility and openess

Neuroimaging and Data Science

Normal behavior is supported by brain connectivity

Image from Catani and ffytche (2015)

Not just passive cables

Brain connections change with development

Individual differences account for differences in behaviour

Adapt with learning

This has clinical significance

Diffusion MRI

Isotropic diffusion

Diffusion MRI

Anisotropic diffusion

Diffusion MRI

Modeling diffusion

Basser, Mattielo and Le Bihan (1994)

Diffusion statistics

Mean diffusivity
Fractional anisotropy
Principal diffusion direction

From diffusion to tracks

From diffusion to tracks

From diffusion to tracks

Diffusion MRI: the challenge of validation

Algorithm 1
Algorithm 2

A statistical learning approach

In-vivo validation
(following Breiman, 2001, and others)
Measurement #1
Measurement #2
Test-retest reliability
Model
Cross-validation
Rokem et al. (2015)
Rokem et al. (2015)
Corpus callosum
Corticospinal tract
Superior
longitudinal fasciculus
DTI
Crossing fiber model
Rokem et al. (2015)

Diffusion MRI: the challenge of reproducibility

"An article about a computational result is advertising, not scholarship. The actual scholarship is the full software environment code and data, that produced the result" Buckheit and Donoho (2005)

DIPY: Diffusion MRI in Python

Part of the NIPY community

Started in 2009 by Eleftherios Garyfallidis

Contributors from at least six different countries and many different labs

Why Python?

The lingua franca of reproducible computational science

Open source

Easy to learn

Phenomenal ecosystem of open-source tools

The scipy & nipy ecosystem

The solar system

The scipy & nipy ecosystem

The solar system

The scipy & nipy ecosystem

The solar system

The scipy & nipy ecosystem

The solar system
DIPY

"Open" as in "Open development"

Openly available source code is good

Open development is better!

Dipy cross-validation API

gtab = gradient_table(...)

model = ReconstModel(gtab, ...)

fit = model.fit(data, ...) # => ReconstFit

prediction = fit.predict(gtab, ...)

For example

model = dti.TensorModel(gtab)

fit = model.fit(data1)

prediction = fit.predict(gtab)

RMSE = np.sqrt(\
np.mean((prediction - data2) ** 2), -1))

rRMSE = RMSE / np.sqrt(\
np.mean((data1 - data2) ** 2), -1))

When you've only measured once

k-fold cross-validation

# Use a k of 2

dti_pred = kfold_xval(dti_model, data, 2)

csd_pred = kfold_xval(csd_model, data, 2)

Also: LiFE (Pestilli et al., 2014)

fiber_model = life.FiberModel(gtab)

fit = fiber_model.fit(data, tracks)

prediction = fit.predict(gtab)

optimized_tracks = tracks[fit.beta>0]

http://tinyurl.com/dipy-wmm
(powered by http://mybinder.org)

Data Science comes to the University

DSE
Pi-shaped
(Following Alex Szalay; Jake Vanderplas)

New role for data scientists

Facilitate data-intensive research in different fields
(inter- and cross- disciplinary)

Focus on methodology

Focus on reproducibility

Contribute to openly available tools, rather than/in addition to peer-reviewed publications

"Career paths for data scientists that recognize and reward contributions in methodology, computation, or development of tools are important."

(From a recent NIH BD2K RFA)

Data science education:

Degree programs

Specialized courses

Workshops (Software Carpentry, ...)

Project-oriented training

Incubator projects

Focused, intensive, collaborative projects

Data scientists + domain scientists

Results that wouldn't be possible otherwise

The eScience Institute

Data Science for Social Good

Urban@UW
Urban @ UW

Inspired by DSSG program at U Chicago, GA Tech

10-week internship program

16 DSSG fellows/students

6 high-school students from ALVA program

4 projects (+project leads!)

+ Data scientist mentors

Predictors of Permanent Housing for Homeless Families

Housing
Project Leads: Anjana Sundaram, Neil Roche, Bill & Melinda Gates Foundation
DSSG Fellows: Joan Wang, Jason Portenoy, Fabliha Ibnat, Chris Suberlak
ALVA Students: Cameron Holt, Xilalit Sanchez
eScience Data Scientist Mentors: Ariel Rokem, Bryna Hazelton
Family Trajectories through Programs trajectories
http://tinyurl.com/dssg-homeless

A few lessons we learned

    It is possible to both:

  • Do interesting things with data, with social good implications
  • Provide highly effective training

  • Trainee diversity poses a challenge in formal settings

  • But might be a strength in the context of project work!

Stakeholder involvement is important
(no projects "thrown over the fence")

In-house expertise (data scientists, program managers) are an important asset

But (hypothesis) DSSG can be translated into other settings

We wrote a paper with some ideas.

Outline

Data Science is an interdisciplinary solution to many of the problems facing modern-day research

An example from neuroimaging

Data Science comes to the University

http://arokem.org
arokem@gmail.com
@arokem
github.com/arokem