Ariel Rokem, University of Washington eScience Institute
Follow along at:
The era of brain observatories
Allen Institute for Brain Science
n=1,200
n=500,000
New data sets will enable important new discoveries
New methods
Data-driven discovery
Methods that work in standard use may not apply to large datasets
=> Train machine learning algorithms to replace expert decision making
Tools are needed for data exploration and transparent sharing of results
=> Build browser-based applications for exploratory data analysis and data sharing
Algorithms are needed to extract information from complex high-dimensional data
=> Translate statistical techniques into practice in neuroscience
Sociotechnical structures are strained: collaboration, publication, training
=> Open source software collaborations and science-focused hack weeks
Some methods require expert examination
Time consuming, tedious
=> Do not scale well!
Expert => results
Expert => training data => machine learning => results
10 years (2006-2016)
9,285 patients
43,328 OCT volumes
2.64 million OCT images
2.5 TB of data
Linked to
For each OCT we know:
Visual acuity
OCT interpretation
Diagnosis
Treatment determinations
In some cases - longitudinal measurements
A family of machine learning algorithms
Biologically inspired
A family of machine learning algorithms
Biologically inspired
Implement a cascade of linear/non-linear operations
Capitalize on spatial correlations in images
Inspired by the mammalian visual system
Binary classification doesn't model clinical decision making
Patients can have any of a several diseases
Or more than one disease
=> Train several networks and integrate across them
Expert => results
Expert => training data => machine learning => results
But: for many tasks, not enough training data
=> Amplify labeled data-sets with citizen science
Expert => citizen science => training data => machine learning => results
Are you at work but feel like playing Tinder? Why not play braindr (https://t.co/yXw191Q7Hy) instead, and help neuroscientists rate the quality of brain images? Swipe left to fail bad quality images! Built with @vuejs and @Firebase #citizenscience pic.twitter.com/tpI9Y3UKOb
— anisha (@akeshavan_) February 7, 2018
When there is enough training data: deep learning
When we need to scale up: citizen scientists
Model of expertise (random forest) for aggregation
Model of perception (neural network) for automation and scaling
Other tasks
Tumor segmentation in MRI
Other types of data and other procedures
Methods that work in standard use may not apply to large datasets
=> Train machine learning algorithms to replace expert decision making
Tools are needed for data exploration and transparent sharing of results
=> Build browser-based applications for exploratory data analysis and data sharing
Results from large datasets are hard to understand
Hard to communicate
Hard to reproduce
Data sharing is not incentivized and is not easy enough
Brain connections develop and mature with age
Individual differences account for differences in behaviour
Adapt and change with learning
Classify patients based on the tissue properties in this part of the brain
Random Forest algorithm => 80% accuracy
A web-based application
Builds a web-site for a diffusion MRI dataset
Automatically uploads the website to Github
Enhances published results
Linked visualizations facilitate easy exploration
Enables new discoveries in old datasets
Exploratory data analysis
Automated data sharing
Dimensionality reduced data in tidy table format
Other analysis pipelines
Dimensionality reduction in multi-channel neural recordings
Methods that work in standard use may not apply to large datasets
=> Train machine learning algorithms to replace expert decision making
Tools are needed for data exploration and transparent sharing of results
=> Build browser-based applications for exploratory data analysis and data sharing
Algorithms are needed to extract information from complex high-dimensional data
=> Translate statistical techniques into practice in neuroscience
But in our case p (number of variables) >> n (number of subjects)
Enforces sparsity
But ignores group structure in the data
Accuracy: ~71% (AUC: ~71%)
Does not discover the right features
Top 10 features include some CST, but also other parts of the brain
Where l are groups of variables
p(l) is the number of variables in group l
In our case: all the measurements of a tissue propetry within a tract
Enforces selection of groups
But does not enforce L1 sparsity within included groups
Enforces sparsity both at the group level and the within-group level
Subsumes the Lasso (λ1 = 0)
And the Group Lasso (λ2 = 0)
But more meta-parameters
Nested cross-validation
Nested cross-validation
Nested cross-validation
Classification accuracy of ~84% (AUC of 0.9)
Top 10 features selected include CST
Sparse Group Lasso accurately discovers structure in dMRI data
Classification of disease states
In a regression setting, prediction of continuous measures
(e.g, "brain age", IQ, reading skills)
Multi-region, multi-neuron recordings
Neurons => features
Brain regions => groups
Trials => observations
Methods that work in standard use may not apply to large datasets
=> Train machine learning algorithms to replace expert decision making
Tools are needed for data exploration and transparent sharing of results
=> Build browser-based applications for exploratory data analysis and data sharing
Algorithms are needed to extract information from complex high-dimensional data
=> Translate statistical techniques into practice in neuroscience
Sociotechnical structures are strained: collaboration, publication, training
=> Open source software collaborations and science-focused hack weeks
Required for reproducibility
Enables building on previous work
Methods in data science are rapidly changing
Learning often require substantial hands-on experience
=> Hack weeks
Week-long events
Combination of learning and project work
Participant driven
Astrohackweek
Neurohackweek
Geohackweek
Methods that work in standard use may not apply to large datasets
=> Train machine learning algorithms to replace expert decision making
Tools are needed for data exploration and transparent sharing of results
=> Build browser-based applications for exploratory data analysis and data sharing
Algorithms are needed to extract information from complex high-dimensional data
=> Translate statistical techniques into practice in neuroscience
Sociotechnical structures are strained: collaboration, publication, training
=> Open source software collaborations and science-focused hack weeks
"All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data... In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields"