Cloudknot:

Scaling your existing (Python) code in the (AWS) cloud

Adam Richie-Halford, University of Washington Dept of Physics

Ariel Rokem, University of Washington eScience Institute

Follow along at http://arokem.github.io/2018-10-23-cloudknot/

License

What we do

Ariel

  • Data Scientist, UW eScience Institute

Adam

  • PhD Candidate, UW Physics
  • Nuclear theory: quantum Monte Carlo studies of neutron matter
  • Neuroscience: visualization and machine learning tools for neuroimaging data

We use Python

  • Post-processing of QMC results
  • Computational neuroanatomy, DIPY
  • White matter tractometry, pyAFQ

We love working in our Python environment

pyAFQ screenshot
Diffusion MRI analysis using pyAFQ in a Jupyter notebook

We also like to use cloud computing

Pros:

  • Linear scalability
  • Elasticity
  • Ability to handle large datasets

Cons:

  • Learning curve
  • Difficult to choose instances
  • Additional overhead of provisioning resources
Easy EC2 Instance Info
ec2instances.info
Easy EC2 Instance Info
ec2instances.info
Easy EC2 Instance Info
ec2instances.info
Easy EC2 Instance Info
ec2instances.info

Challenge

Reap the benefits of AWS from the comfort of our Python env

Have an adventure without leaving The Shire

The Shire with Python logo

Single Program


                        import cloudknot as ck

                        def awesome_func(...):
                            ...

                        knot = ck.Knot(func=awesome_func)




                    
Cloudknot workflow

Multiple Data


                        import cloudknot as ck

                        def awesome_func(...):
                            ...

                        knot = ck.Knot(func=awesome_func)

                        ...

                        future = knot.map(args)
                    
Cloudknot workflow

Analysis of MRI data

Analysis of diffusion MRI data
By everyone's idle (originally posted to Flickr as A brain - I has it) [CC BY-SA 2.0 ], via Wikimedia Commons

Analysis of diffusion MRI data
Yeatman, et al.; Nature Communications 9, 940 (2018)

Analysis of MRI data

Compare to Dask, Myria, Spark using previous benchmark study.

Analysis of MRI data

Takeaway

  • Previous MRI benchmark was performed by a team of 4 graduate students and postdocs over 6 months.
  • Cloudknot implementation took Ariel one day
  • For 25 subjects, Cloudknot was 10-25% slower
  • Cloudknot favors workloads where development time is more important than execution time

Analysis of microscopy data

Sample output from diff_classier visualization tools
Chad Curtis. diff_classifier, 2018.

Analysis of microscopy data

  • Complicated software dependencies
    • ImageJ and TrackMate
    • Written in Java, scripted in Jython
    • Installation can be managed in a Docker image
  • Large datasets, measured in TB
    • Managed using an AMI that includes a larger volume
    • Default Batch AMI volumed limited to 30 GB
  • Takeaways
    • Longer development time due to custom AMI and Docker image
    • This lab has completely transitioned from using bespoke cluster to AWS.
    • Enables capability computing (rather than capacity computing)

The eScience Incubator

  • Work with an eScience data scientist on your project
  • 1 quarter, 2 days/week
  • Proposals due: November 9th
  • eScience website
  • Info session: this Thursday (25th) at 1 PM at the DSS (Phy/Astro tower, 6th floor)

Conclusion

  • Cloudknot favors workloads where development time matters more than execution time.
  • For many data science problems, this is an acceptable trade.
  • Simplified API makes cloud computing more accessible.
    1. import cloudknot
    2. cloudknot.Knot()
    3. map() method

Future developments

  • Generalize to other clouds

Github repo: https://github.com/richford/cloudknot

Documentation: https://richford.github.io/cloudknot/index.html

We welcome issues and contributions!