Ariel Rokem, University of Washington eScience Institute
Follow along at
Data Science is an interdisciplinary solution to many of the problems facing modern-day research
An example from neuroimaging
Data Science comes to the University
1. Empirical (experimental)
2. Theoretical (mathematical)
3. Simulation (computational)
4. Data-intensive (eScience)
In a variety of fields
Data is enabling new ways of doing things
But data also poses challenges
The 4 V's: Volume, Velocity, Variety, Veracity/Validity
Made his wealth wrangling financial data.
As mayor, started 311, and made NYC data openly available.
Statistics and machine learning
Programming and software engineering
Data management
Data visualization and communication
A focus on reproducibility and openess
Brain connections change with development
Individual differences account for differences in behaviour
Adapt with learning
This has clinical significance
Started in 2009 by Eleftherios Garyfallidis
Contributors from at least six different countries and many different labs
The lingua franca of reproducible computational science
Open source
Easy to learn
Phenomenal ecosystem of open-source tools
Openly available source code is good
Open development is better!
gtab = gradient_table(...)
model = ReconstModel(gtab, ...)
fit = model.fit(data, ...) # => ReconstFit
prediction = fit.predict(gtab, ...)
For example
model = dti.TensorModel(gtab)
fit = model.fit(data1)
prediction = fit.predict(gtab)
RMSE = np.sqrt(\
np.mean((prediction - data2) ** 2), -1))
rRMSE = RMSE / np.sqrt(\
np.mean((data1 - data2) ** 2), -1))
When you've only measured once
k-fold cross-validation
# Use a k of 2
dti_pred = kfold_xval(dti_model, data, 2)
csd_pred = kfold_xval(csd_model, data, 2)
Also: LiFE (Pestilli et al., 2014)
fiber_model = life.FiberModel(gtab)
fit = fiber_model.fit(data, tracks)
prediction = fit.predict(gtab)
optimized_tracks = tracks[fit.beta>0]
http://tinyurl.com/dipy-wmm
(powered by http://mybinder.org )
Data Science comes to the University
(Following Alex Szalay; Jake Vanderplas)
New role for data scientists
Facilitate data-intensive research in different fields
(inter- and cross- disciplinary)
Focus on methodology
Focus on reproducibility
Contribute to openly available tools, rather than/in addition to peer-reviewed publications
"Career paths for data scientists that recognize and reward contributions in methodology, computation, or development of tools are important."(From a recent NIH BD2K RFA)
Data science education:
Degree programs
Workshops (Software Carpentry, ...)
Project-oriented training
Incubator projects
Focused, intensive, collaborative projects
Data scientists + domain scientists
Results that wouldn't be possible otherwise
Data Science for Social Good
Inspired by DSSG program at U Chicago, GA Tech
10-week internship program
16 DSSG fellows/students
6 high-school students from ALVA program
4 projects (+project leads!)
+ Data scientist mentors
Predictors of Permanent Housing for Homeless Families
Project Leads: Anjana Sundaram, Neil Roche, Bill & Melinda Gates Foundation
DSSG Fellows: Joan Wang, Jason Portenoy, Fabliha Ibnat, Chris Suberlak
ALVA Students: Cameron Holt, Xilalit Sanchez
eScience Data Scientist Mentors: Ariel Rokem, Bryna Hazelton
Family Trajectories through Programs
http://tinyurl.com/dssg-homeless
A few lessons we learned
It is possible to both:
- Do interesting things with data, with social good implications
- Provide highly effective training
-
Trainee diversity poses a challenge in formal settings
-
But might be a strength in the context of project work!
Stakeholder involvement is important
(no projects "thrown over the fence")
In-house expertise (data scientists, program managers) are an important asset
But (hypothesis) DSSG can be translated into other settings
We wrote a paper with some ideas.
Outline
Data Science is an interdisciplinary solution to many of the problems facing modern-day research
An example from neuroimaging
Data Science comes to the University
http://arokem.org
arokem@gmail.com
@arokem
github.com/arokem