Data Science, Data Curation, Human-Data Interaction Bill Howe, Ph.D. Associate Professor, Information School Adjunct Associate Professor, Computer Science & Engineering Associate Director and Senior Data Science Fellow, eScience Institute 7/26/2016 Bill Howe, UW 1
68
Embed
Data Science, Data Curation, and Human-Data Interaction
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Science,
Data Curation,
Human-Data Interaction
Bill Howe, Ph.D.Associate Professor, Information School
Adjunct Associate Professor, Computer Science & Engineering
Associate Director and Senior Data Science Fellow, eScience Institute
7/26/2016 Bill Howe, UW 1
Dave BeckDirector of Research,
Life SciencesPh.D. Medicinal
Chemistry, Biomolecular
Structure & Design
Jake VanderPlasDirector of Research,
Physical SciencesPh.D., Astronomy
Valentina StanevaData ScientistPh.D., Applied Mathematics and Statistics
Ariel RokemData ScientistPh.D., Neuroscience
Andrew GartlandResearch ScientistPh.D., Biostatistics
Bryna HazeltonResearch ScientistPh.D., Physics
Bernease HermanData ScientistBS, Statswas SE at Amazon
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the bottleneck to data sharing
Maxim Gretchkin
Hoifung Poon
color = labels supplied as metadata
clusters = 1st two PCA dimensions on the gene expression data itself
Can we use the expression data directly to curate algorithmically?
Maxim Gretchkin
Hoifung Poon
The expression data and the text labels appear to disagree
Maxim Gretchkin
Hoifung Poon
Better Tissue Type Labels
Domain knowledge (Ontology)
Expression data
Free-text Metadata
2 Deep Networkstext
expr
SVM
Deep Curation Maxim Gretchkin
Hoifung Poon
Distant supervision and co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results.
Free-text classifierExpression classifier
VIZIOMETRICS:COMPREHENDING VISUAL INFORMATION
IN THE SCIENTIFIC LITERATURE
Human-Data Interaction
7/26/2016 Bill Howe, UW 37
Observations
• Figures in the literature are the currency of
scientific ideas
• Almost entirely unexplored
• Our thought: Mine patterns in the visual
literature
Step 1: Dismantling Composite Figures
Poshen Lee
ICPRAM 2015
Step 2: Classification
• Divide images into small patches
• Take a random sample
• Run k-means on samples (k = 200)
• For each figure in training set, generate
a length-200 feature vector by similarity
to clusters. Train a model.
• For each test image, create the vector
and classify by the model
Do high-impact papers have fewer equations, as indicated by Fawcett and Higginson? (Yes)
Poshen LeeJevin West
high impact papers low impact papers
Do high-impact papers have more diagrams? (Yes)
Poshen LeeJevin West
Do papers in top journals tend to involve more or less visual information? (More)
Poshen LeeJevin West
7/26/2016 Poshen Lee, UW 52
viziometrics.org
7/26/2016 Poshen Lee, UW 53
Burrows-Wheeler Alignment
Computation
DNA Sequencing
Citations: 7807 +11 since 2016
Eigenfactor: 0.0000574719
DNA Methylation Brain Cancer
Chromosomal Aberrations
Cancer Genome Atlas
Citations: 2094 +7 since 2016
Eigenfactor: 0.0000279023
Memory-efficient Computation
DNA Sequencing
Citations: 7459 +17 since 2016
Eigenfactor: 0.0000875579
Molecular biology
GeneticsGenomics
DNA
Citations: 3766 +15 since 2016
Eigenfactor: 0.0000183255
viziometrics.org
INFORMATION EXTRACTION
FROM FIGURES
Information-critical figures
Metabolic pathway diagrams
Phylogenetic heat maps
Architecture diagrams
Sean Yang
Normalize
Sean Yang
Corner Detection Line Detection
Extract Tree Structure
Sean Yang
VISUALIZATION
RECOMMENDATION
7/26/2016 Bill Howe, UW 59
60
Example of a Learned Rule (1)
low x-entropy => bad scatter plot
7/26/2016 Bill Howe, UW 61
bad scatter plotgood scatter plot
Example of a Learned Rule (3)
63
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)
Voyager
7/26/2016 Bill Howe, UW 64
Kanit “Ham” Wongsuphasawat
Dominik Moritz
InfoVis 15
Jeff Heer Jock Mackinlay
Anushka Anand
SCALABLE GRAPH
CLUSTERING
7/26/2016 Bill Howe, UW 65
Seung-Hee BaeScalable Graph Clustering
Version 1Parallelize Best-known Serial Algorithm
ICDM 2013
Version 2Free 30% improvement for any algorithm
TKDD 2014 SC 2015
Version 3Distributed approx. algorithm, 1.5B edges
Recap
• “Human-Data Interaction” is the bottleneck!
– SQLShare: Mining SQL logs to uncover user
behavior
– Myria/RACO: Polystore Optimization
– Deep Curation: Zero-training labeling of scientific
• OCCs:Big Data / Database researcher with broad impact and expertise in research data management, • Democratizing Data Science
– Ourselves: Reduce overhead in attention-scarce regimes– Other fields: Reduce overhead of interdisciplinary research– The public: Reduce overhead of communicating with the public and policymakers
• SQLShare– Why? What? Impact?– Key: RDM, NSF-funded, hundreds of users– Are these workloads any different than a typical database?
• HaLoop– Why? What? Impact?– Key: Papers, new subfield in big data
• Myria– Why? What? Impact?– Key: Funding
• Viziometrics– Why? What? Impact?
• Data Curation through an Algorithmic Lens– Why? What? Impact?– Volume, variety, velocity. Volume: tasks that scale with the number of records: movement, validation. Variety: tasks that scale with the number of datasets:
metadata attachment, cataloging, metadata verfication. Velocity: tasks that scale with the time since release. Data journalism, legal cases– Example? Maxim’s work. Prevalence of missing and incorrect labels. – Is this dataset what it says it is?– Why? Reproducibility crisis– Is this fully automatic? No. Training data, computational steering