Top Banner
How to train the next generation for Big Data Projects: building a curriculum Christopher G. Wilson, Ph.D. Associate Professor Physiology and Pediatrics Center for Perinatal Biology Experimental Biology, Mar 28 th , 2015
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2015 03-28-eb-final

How to train the next generation for Big Data Projects: building a curriculum

Christopher G. Wilson, Ph.D. Associate Professor Physiology and Pediatrics Center for Perinatal Biology

Experimental Biology, Mar 28th, 2015

Page 2: 2015 03-28-eb-final

Outline

•  Assessing the need for a “Big Data” Analytics course •  Structure and grading of the course •  Overview of the curriculum •  Advantages to Python/IPython •  Examples/use cases •  Coalition institutions and participating faculty

Page 3: 2015 03-28-eb-final

Is a Big Data analytics course necessary?

•  “Back in the day, when *I* was a graduate student…” •  First year Physics lab as a training ground… •  Contemporary students live in a digital world… •  Office suites are NOT suited to large-scale data analytics!

Page 4: 2015 03-28-eb-final

Work-flow of “Big Data” analysis

Page 5: 2015 03-28-eb-final

Or…

•  Obtain data •  Scrub data •  Explore data •  Model the data •  Interpret the data •  Present the data

Page 6: 2015 03-28-eb-final
Page 7: 2015 03-28-eb-final

Why use Free/Open-Source Software?

•  In this era of shrinking science funding, free software makes more economic sense. •  Bugs/security issues are fixed FASTER than proprietary software. •  With access to the source code, we can customize the software to fit OUR needs. •  Reproducibility of analyses and algorithms is easier when all code is free, can be shared, and examined/dissected. •  Free/Open-source software tends to be more reliable and stable. •  See Eric Raymond’s The Cathedral and the Bazaar for a more comprehensive explanation.

Page 8: 2015 03-28-eb-final

Using a “flipped” classroom

•  On-line material or reading is provided to the student either before or during the class meeting time •  The instructor provides a short summary/overview lecture (~20 min) •  The remaining class time is spent working on the subject matter as individuals and groups—with the instructor and TA present •  More effective for learning “hands on” skills like programming, bioinformatics, web design, etc.

Page 9: 2015 03-28-eb-final

Why use a flipped classroom model instead of lecturing for 50

minutes and assigning homework?

Page 10: 2015 03-28-eb-final

The data analytics team •  Project manager—responsible for setting clear project objectives and deliverables.

The project manager should be someone with more experience in data analysis and a more comprehensive background than the other team members.

•  Statistician—should have a strong mathematics/statistics background and will be responsible for reporting and developing the statistics workflow for the project.

•  Visualization specialist—responsible for the design/development of data visualization (figures/animation) for the project.

•  Database specialist—develops ontology/meta-tags to represent the data and incorporate this information in the team's chosen database schema.

•  Content Expert—has the strongest background in the focus area of the project (Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and is responsible for providing background material relevant to the project's focus.

•  Web developer/integrator—responsible for web-content related to the project, including the final report formatting (for web/hardcopy display).

•  Data analyst—the most junior member of the team will take on general responsibilities to assist the other team members. This is a learning opportunity for a team member who is new to data analysis and needs time to develop the skills necessary to fully participate in the workflow.

Page 11: 2015 03-28-eb-final

Student self-assessment

Survey created using Google Forms

Page 12: 2015 03-28-eb-final

Student self-assessment

From Doing Data Science by Cathy O’Neil and Rachel Schutt

Page 13: 2015 03-28-eb-final

Grading

•  Pass/No Pass •  Weekly quizzes (concepts from short lectures, on-line resources, simple

code fragments/pseudo-code, etc.) •  Projects

•  One individual project (basics of using IPython, simple statistics computed via interaction with R—or using Pandas—and simple visualization of a dataset).

•  Two short projects (small group, designed to develop team-based distribution of workload, team roles assigned by instructor).

•  Larger scale project using a Big Data dataset (students will “self-organize” their team roles). This project is envisioned as the final exam for the class and each team will present their results and project summary to the class.

•  Final projects will be posted on the class website along with IPython notebooks and supporting materials used for the project.

Page 14: 2015 03-28-eb-final

Syllabus Overview (10 week course)

Foundations 1: Using text editors, using the IPython notebook for data exploration, using version control software (git), using the class wiki. Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas, data visualization in IPython. Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms, bars, etc.) dynamical systems analyses of data variability, information theory measures (entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum), wavelets. Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene-array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays. Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology for biomedical/patient data (XML), using secure databases (REDCap). Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA) and what it means for data management, de-identifying patient data (handling PHI), data security best practices, making data available to the public—implications for data transparency and large-scale data mining.

Page 15: 2015 03-28-eb-final

Why Python?

•  Python is an easy-to-learn, complete programming language that has rapidly become an important scientific programming and data analysis environment with usage across multiple disciplines.

•  Python was originally developed with a philosophy of “easy to read” code incorporating object-oriented, imperative, and functional programming styles.

•  Python allows the incorporation of specialized modules based upon low-level code (C/C++) so it can run very fast.

•  Python has modules developed specifically for scientific computing and signal processing (NumPy/SciPy).

•  Python has well-documented import/export hooks into databases (both SQL and NOSQL) that are key to working with Big Data.

Page 16: 2015 03-28-eb-final

Why IPython?

•  IPython is an interactive data exploration and visualization shell that supports the inclusion of code, inline text, mathematical expressions, 2D/3D plotting, multimedia, and dynamic widgets.

•  IPython is a suite of tools designed to cover scientific workflow from interactive data transformation and analysis to publication.

•  The IPython notebook uses a web browser as its display “front end” and provides a rich interactive environment similar that seen in Mathematica.

•  IPython notebooks makes it possible to save analysis procedures and output—providing reproducible, curatable data analysis, and an easy way to share algorithms/methods.

•  IPython supports parallel coding and distributed data analysis to take advantage of cloud/high-performance clusters.

Page 17: 2015 03-28-eb-final

Python as a data analytics environment

Page 18: 2015 03-28-eb-final

IPython interface

http://ipython.org

Page 19: 2015 03-28-eb-final

Line plots with error bars

import numpy as np import matplotlib.pyplot as plt # example data x = np.arange(0.1, 4, 0.5) y = np.exp(-x) plt.errorbar(x, y, xerr=0.2, yerr=0.4) plt.show()

Page 20: 2015 03-28-eb-final

Heatmaps

import numpy as np import numpy.random import matplotlib.pyplot as plt # Generate some test data x = np.random.randn(8873) y = np.random.randn(8873) heatmap, xedges, yedges = np.histogram2d(x, y, bins=50) extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]] plt.clf() plt.imshow(heatmap, extent=extent) plt.show()

Page 21: 2015 03-28-eb-final

Scatterplots

import numpy as np import matplotlib.pyplot as plt N = 50 x = np.random.rand(N) y = np.random.rand(N) colors = np.random.rand(N) area = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radiuses plt.scatter(x, y, s=area, c=colors, alpha=0.5) plt.show()

Page 22: 2015 03-28-eb-final

3D contour map

from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt from matplotlib import cm fig = plt.figure() ax = fig.gca(projection='3d') X, Y, Z = axes3d.get_test_data(0.05) ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3) cset = ax.contour(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='x', offset=-40, cmap=cm.coolwarm) cset = ax.contour(X, Y, Z, zdir='y', offset=40, cmap=cm.coolwarm) ax.set_xlabel('X') ax.set_xlim(-40, 40) ax.set_ylabel('Y') ax.set_ylim(-40, 40) ax.set_zlabel('Z') ax.set_zlim(-100, 100) plt.show()

Page 23: 2015 03-28-eb-final

Example: Patient physiology waveforms + EMR

Page 24: 2015 03-28-eb-final

Example: Interrogating sequence data

Page 25: 2015 03-28-eb-final

Summary

•  Free/Libre Open-Source software provides a viable “tool stack” for Big Data analytics. •  Python provides a robust, easy-to-use foundation for data analytics. •  IPython provides an easy to use interactive front-end for data transformation, analysis, visualization, presentation, and distribution. •  Team-based science depends upon developing a wide range of data analytics skills. •  We have developed a coalition of institutions to serve students who wish to be become data scientists.

Page 26: 2015 03-28-eb-final

Coalition Institutions

Page 27: 2015 03-28-eb-final

The coding Queen and her Court…

Abby Dobyns

Princesses of Python

Rhaya Johnson Regie Felix and Adaeze Anyanwu

And a Princeling….

Jamie Tillett

Page 28: 2015 03-28-eb-final

Acknowledgements

Loma Linda

•  Traci Marin •  Charles Wang •  Wilson Aruni •  Valery Filippov UC Riverside

•  Thomas Girke (Bioinformatics)

My laboratory’s git repository:

La Sierra University •  Marvin Payne CSU San Bernardino •  Art Concepcion

(Bioinformatics) UC Irvine •  Alex Nicolau

(Comp Sci/Bioinf)

https://github.com/drcgw/bass

Page 29: 2015 03-28-eb-final

Further reading

•  Doing Data Science by Cathy O’Neil and Rachel Schutt •  Data Analysis with Open-Source Tools by Philipp Janert •  The Art of R Programming by Norman Matloff •  R for Everyone by Jared P. Lander •  Python for Data Analysis by Wes McKinney •  Think Python by Allen B. Downey •  Think Stats by Allen B. Downey •  Think Complexity by Allen B. Downey •  Every one of Edward Tufte’s books (The Visual Display of Quantitative Information, Visual Explanations, Envisioning Information, Beautiful Evidence)

Page 30: 2015 03-28-eb-final

Questions?!