Ariel Rokem, University of Washington eScience Institute
Follow along at:
Data science education
Development of tools and practices for reproducible research
Building a data science community: open, rigorous and ethical
Data-driven research
1. Empirical (experimental)
2. Theoretical (mathematical)
3. Simulation (computational)
4. Data-driven scientific discovery
Brain connections develop and mature with age
Individual differences account for differences in behaviour
Adapt and change with learning
Human Connectome Project (HCP), N = 1,200
Pediatric Imaging Neurocognition and Genetics (PING),
N >1,000
Healthy Brain Network (HBN), N = 10,000
Adolescent Brain Cognitive Development (ABCD),
N = 10,000
UK Biobank, N = 500,000
New data sets will enable important new discoveries
Data-driven discovery
How do we adapt research to big datasets?
Use deep learning and citizen science
How do we analyze and interpret multi-dimensional data?
Leverage machine learning techniques that use known brain structure
How do we collaborate, publish and teach?
Open science: sharing data, software and training
Some methods require expert examination
Time consuming, tedious
→ Do not scale well!
Expert → results
Expert → training data → machine learning → results
A family of machine learning algorithms
Biologically inspired
Inspired by the visual system
Capitalize on spatial correlations in images
10 years (2006-2016)
9,285 patients
43,328 OCT volumes
2.64 million OCT images
2.5 TB of data
Linked to EPIC electronic medical records
For each OCT we know:
Visual acuity
OCT interpretation
Diagnosis
Treatment determinations
In some cases - longitudinal measurements
Patient-level AUC = 0.97 (Lee et al., 2016)
Expert → results
Expert → training data → machine learning → results
But: for many tasks, not enough training data
→ Amplify labeled data-sets with citizen science
Expert → citizen science → training data → machine learning → results
Are you at work but feel like playing Tinder? Why not play braindr (https://t.co/yXw191Q7Hy) instead, and help neuroscientists rate the quality of brain images? Swipe left to fail bad quality images! Built with @vuejs and @Firebase #citizenscience pic.twitter.com/tpI9Y3UKOb
— anisha (@akeshavan_) February 7, 2018
Use images classified by citizen scientists as training data
Train a neural network on aggregated citizen scientist labels
Transfer learning with VGG16/ImageNet
Application: cross-modal image registration
How do we adapt research to big datasets?
Use deep learning and citizen science
How do we analyze and interpret multi-dimensional data?
Leverage machine learning techniques that use known brain structure
How do we collaborate, publish and teach?
Open science: sharing data, software and training
How do we adapt research to big datasets?
Use deep learning and citizen science
How do we analyze and interpret multi-dimensional data?
Leverage machine learning techniques that use known brain structure
How do we collaborate, publish and teach?
Open science: sharing data, software and training
Accurately predict individual variability
Expose important biological features
Tracts provide the anatomically correct coordinate frame
Tractometry is feature engineering
Tractometry is feature engineering
Amyotrophic Lateral Sclerosis (ALS)
Mass univariate analysis: reduced statistical power
Focus on a region of interest: lose the full picture
The objective:
But in our case p (number of variables) >> n (number of subjects)
Regularization: the Lasso
Where l are groups of variables
p(l) is the number of variables in group l
In our case: all the measurements of a tissue propetry within a tract
Enforces selection of groups
But does not enforce L1 sparsity within included groups
Enforces sparsity both at the group level and the within-group level
Subsumes the Lasso (λ1 = 0)
And the Group Lasso (λ2 = 0)
But more meta-parameters
→ Nested cross validation + hyperoptimization
Classification accuracy of 84% (+/- 1%)
AUC of 0.91 (+/- 0.01)
Capitalizes on brain structure
Predicts continuous measures
(e.g, "brain age", reading skills)
Classifies of different states (patient/control)
Identifies dense or sparse biological feature sets
Healthy Brain Network dataset
ABCD dataset
Multi-region, multi-neuron recordings
Neurons → features
Brain regions → groups
Trials → observations
How do we adapt research to big datasets?
Use deep learning and citizen science
How do we analyze and interpret multi-dimensional data?
Leverage machine learning techniques that use known brain structure
How do we collaborate, publish and teach?
Open science: sharing data, software and training
How do we adapt research to big datasets?
Use deep learning and citizen science
How do we analyze and interpret multi-dimensional data?
Leverage machine learning techniques that use known brain structure
How do we collaborate, publish and teach?
Open science: sharing data, software and training
Data sharing is not incentivized and is not easy enough
Results from large multi-dimensional datasets are hard to understand
Hard to communicate
Hard to reproduce
A web-based application
Leverages modern visualization frameworks
Builds a web-site for a diffusion MRI dataset
Automatically uploads the website to GitHub
Enhances published results
Linked visualizations facilitate easy exploration
Enables new discoveries in old datasets
Dimensionality reduced data in tidy table format
Facilitates interdisciplinary collaboration
Image processing at scale
Machine learning
Exploration, visualization and data sharing
Methods in data science are rapidly changing
Learning often requires substantial hands-on experience
Week-long events
Combination of learning and project work
A fine balance of pedagogy and hacking
How do we adapt research to big datasets?
Use deep learning and citizen science
How do we analyze and interpret multi-dimensional data?
Leverage machine learning techniques that use known brain structure
How do we collaborate, publish and teach?
Open science: sharing data, software and training