Ground: Managing Metadata in the Big Data EcosystemVikram SreekantiAMP Lab, U.C. Berkeley
What is data?
• 20th Century Data: Accounting• “02/16: Sally withdrew $100 from checking.”
• 21st Century Data: Raw materials• Sally’s online purchase records…• ... and timeseries data from her FitBit• ... and popular films for various demographics• ... and weather forecast for the next 48 hours.
What is Metadata?
• Data about data• This used to be so simple!
What is Metadata?
• Data about data• This used to be so simple!
• But... schema on use• One of many changes
InterpretationAnalysis Interoperability
Reproducibility Governance & The Collective
What is Metadata?
Analysis
Case: Data Analysis
Wrangle
Visualize
AnalyzeData
Results
METAMNESIA
—JIM GRAY
One of the things that my research advisor Mike Harrison taught me to do is to WRITE THINGS DOWN. I’M IN THE FLOW.WRITE THINGS DOWN. TENSION
You will never know your data better than when you are wrangling and analyzing it.
The flow state
TAKE ACTION
Data Analytics Infrastructure team
“Write down what you can, we’ll fill in the rest.”
Taking Action: Football
• Video data Annotations.• Metadata from manual annotation
Taking Action: Football
• Video data Annotations.• Passive metadata: sensor streams• NFL + MS = Cool.
Taking Action: Football
• Video data Annotations.• Passive metadata: sensor streams• NFL + MS = Cool.
• Metadata + Simulation• NFL + MS + EA = POV.
Capture what people do with data. Augment as appropriate. Interpolate as needed.
Taking Action: Data Analysis
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
CASE:Data Debugging
CASE:Data Debugging
Relationships
Master Data on Customers
Call detail from HDFS
Data Wrangling Script
Python Numpy
Churn Analysis
Hypothesis Wrangle
Pythonv2.7
Numpyv1.9.3
Wranglev3.0
Versioned Relationships
Master Data on CustomersMDM 10/11/15
Call detail from HDFSv1.26
Data Wrangling Scriptgit hash 0x6987a68a9876b7
Churn Analysisgit hash 0x987667e876f033
Hypothesis
Common ground?
• SW market exploding
• n2 connections
• Need a shared place to Write it down, Link it up
InterpretationAnalysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
CASE:Recommender System
• Consider a recommender system like Netflix• Consists of data (user views & ratings, movie features) and a
statistical model.• For any piece of data: “Sally watched The Shining”• This fact is much more meaningful with a model: the model makes the
recommendations!• The model is also no good without data: data is used to train & refine the
model.
CASE:Recommender System
• Any machine learning system has this coupling.
• Interpretation of the data depends on the model we choose.• Models are parametrized by data.• The meaning & value of data in any context is the coupling of model +
data.
Reproducibility
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
Interpretation
• models interpret data• data conditions models
(data + metadata)-on-use
Can metadata cure cancer?
No.
But it’s going to be useful.
Case: Cancer Genomics
Generalpopulation data
(“1000 genomes”)
Compare
Clustering AlgorithmPatient Data
Put leukemia cells on slide
Robot putschemistry on slides
Robot puts slide on gene sequencerX 1000 patients
Data Lineage
Back to tissue and bar codes on slides!
Logical & Physical• Tissue• Data (and metadata)• Code
It gets messier
Generalpopulation data
(“1000 genomes”)
Compare
Put leukemia cells on slide
Robot putschemistry on slides
Robot puts slide on gene sequencerX 1000 patients
Parameter Sweep
Parameters
Clustering Algorithm
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
Reproducibility
• instrumentation• lineage: success & failure
lab notebook-on-use
Governance & The Collective
Interpretation
• models interpret data• data conditions models
(data + metadata)-on-use
Back at the Enterprise
We’re talking Governance.• And self-service for end users
CASE:Jupyter Notebook
• An electronic lab notebook• Evolution of IPython Notebook• Writing it down since 2011
Running a Class from NotebooksAssignments are notebooks• Students create versions• Staff solution is a version
Grading• Execute each notebook on some data• Annotating the notebook with grades• Updating a grades spreadsheet
Homework Governance
Skools ’n rools!• Students can’t see each others’ HW• Students can’t see solution• Unless they’ve turned in theirs
and it’s after April 12 and they have a Berkeley login
• Graders can’t see student names• Students can’t update
grade spreadsheet
Collective Intelligence
Rules should be a small part of school.
If we do things well…• People get smarter• Educational software gets smarter• Organizations get smarter
Fueled by observing, learning, iterating.
Write things down, fill in later.
Collective, Intelligent Governance
By the people. Emergent governance.• Sandbox → Annotations → Awareness → Reuse → Debate → ConsensusFor the people. Collective Intelligence emerges.
http://blogs.forrester.com/michele_goetz/15-09-24-are_data_preparation_tools_changing_data_governance
Analysis
• tap the flow• fill in the rest
metadata-on-use
Interoperability
• metadata as protocol• general formats
standards-on-use
Reproducibility
• instrumentation• lineage: success & failure
lab notebook-on-use
Governance & The Collective• by & for the people• collective intelligence
governance-on-use
Interpretation
• models interpret data• data conditions models
(data + metadata)-on-use
What we’re doing: Ground
• Focus our design on useful & interesting challenges for real problems
• Develop a general but expressive data model• Don’t prescribe design principles;
support as many as possible
Data Model: Core
• “Thing”: basic logical object• Immutable• Every “Thing” has a version history.
Models
Usage
Versions
Design Principle: Immutability & Versioning
• Recall versioning• Reproducibility = time travel.• Alternate histories: What-if?
Pythonv2.7
Numpyv1.9.3Master Data on
CustomersMDM 10/11/15
Call detail from HDFSv1.26
Data Wrangling Scriptgit hash 0x6987a68a9876b7
Churn Analysisgit hash 0x987667e876f033
Hypothesis Wranglev3.0
Data Model: Mantle
• Structures, Nodes, Edges, Graphs• Subclasses of “Thing”
• Allows for modeling of dataModels
Usage
Versions
Data Model: Crust
• Lineage relationships between “Thing”s
Models
Usage
Versions
Design Principle: Lineage
• Lineage is fundamental to any metadata system• Versions track the state of things
(”who”, ”what”, and “when”)• Lineage captures causes &
influences (”how”)
Design Principle: Postel’s Law• “Be conservative in what
you do, be liberal in what you accept from others.”
Design Principle: Neutrality
• Open-source & vendor-neutral• Related to Postel’s Law: Be as
diverse as possible while still being useful.
Check out what we’ve done so far: https://github.com/ground-metadata/ground
Reach out if you’re interested: @viksree
Most slides were taken from Joe Hellerstein’s Strata NYC 2015 Talk: “Time to go Meta (on use)”.