Neuroinformatics and data science

Data Science and Neuroinformatics

Ariel Rokem, University of Washington eScience Institute

Slides available at: https://arokem.github.io/2018-09-18-CNC

Analysis

Interpretation

Kashturi et al. (2015)
Bellec et al. (2016)
Huber et al. (under review)

The era of brain observatories

Allen Institute for Brain Science

n=1200

n=~10,000

n=500,000

Opportunities

New data sets will enable important new discoveries

Data-driven discovery

Analysis

Interpretation

Kashturi et al. (2015)
Bellec et al. (2016)
Huber et al. (under review)

Challenges

Data arriving at unprecedented volume, variety and velocity

=> Instead of moving the data to the compute, need to bring the compute to the data

=> Need new tools and approaches to process, analyze and interpret

=> Web-based analysis tools become first-class citizens in our tool-kit

=> New sociotechnical structures needed to facilitate training and collaboration

Bringing the compute to the data

To the cloud!

Infinitely scalable

"Elastic"

But the cloud is hard to operate

Challenge: accessible cloud computing

Adam Richie-Halford

Cloudknot: package your code and run it on AWS Batch


					import cloudknot as ck

					def awesome_func(...):
						...

					knot = ck.Knot(func=awesome_func)

					...

					future = knot.map(args)

Richie-Halford and Rokem (2018)

Scaling expertise with citizen science

Anisha Keshavan

Jason Yeatman

https://braindr.us

Keshavan, Yeatman &
Rokem (in review)

Multiple ratings per image

Keshavan, Yeatman &
Rokem (in review)

Aggregating across raters

XGBoost (Chen & Guestrin, 2016)

Keshavan, Yeatman &
Rokem (in prep)

Scaling expertise using citizen scientist ratings

Keshavan, Yeatman &
Rokem (in prep)

Scaling expertise using citizen scientist ratings

Keshavan, Yeatman &
Rokem (in prep)

Challenge: tools for exploration of complex data

Results from large datasets are hard to understand

Hard to communicate

Hard to reproduce

Data sharing is not incentivized and is not easy enough

Sharing and exploring large datasets on the web

Jason Yeatman

Adam
Richie-Halford

Josh Smith

Anisha
Keshavan

Amyotrophic Lateral Sclerosis (ALS)

Sarica et al. (2017)

https://yeatmanlab.github.io/Sarica_2017

Yeatman, Richie-Halford, Smith, Keshavan & Rokem (2018)
Nature Communications

Automatic data sharing

Yeatman, Richie-Halford, Smith, Keshavan & Rokem (2018)
Nature Communications

Further exploration

Yeatman, Richie-Halford, Smith, Keshavan & Rokem (2018)
Nature Communications

Statistical learning approaches to large datasets

Adam Richie-Halford

Noah Simon

Jason Yeatman

Diffusion MRI data has group structure

Sparse Group Lasso

Classification accuracy of ~84% (AUC of 0.9)

Top 10 features selected include CST

Sarica et al. (2017)

Open source software is a necessary complement to brain observatories

Required for reproducibility

Enables building on previous work

Braindr:
https://github.com/akeshavan/braindr-analysis

AFQ-Browser:
https://github.com/yeatmanlab/pyAFQ
https://github.com/yeatmanlab/AFQ-Browser

Sparse Group Lasso:
https://github.com/richford/AFQ-Insight

How do we learn about all of these things?

Data Science Option

Similar to the Comp Neuro option

Take a few courses:

Data management

Data analysis

Machine learning and stats

Neuroinformatics Special Interest Group

Discussions

Tutorials

Code reviews

...

Neurohackademy

A Summer Institute in Neuroscience and Data Science (=> 2021)

Huppenkothen, Arendt, Hogg, Ram, Vanderplas & Rokem
(2018)

Contact information

http://arokem.org

arokem@gmail.com

@arokem

github.com/arokem