Optimizing the Data Supply Chain for Data Science
Post on 19-Jan-2017
2742 Views
Preview:
Transcript
Optimizing theData Supply Chainfor Data Science
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Marc HadfieldCEO, Vital A.I.
about: vital ai
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Software Applications:Artificial Intelligence, Machine Learning, Data Science.
Software Vendor & Consulting Services
agenda
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
• Data Models • How A.I., Data Science, & Data Governance relate • Data Supply Chain & the Data Product • Problem: the “Telephone Game” across the DSC • Architecture Transition from Data Warehouse to DSC • Data Models and DSC; a Framework for Solutions • Examples • Collaboration & Visualization
note: general methodology, with some specific examples from Vital AI implementations.
takeaways:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
• The Data Supply Chain is a supply chain to deliver Data Products
• Data Models can capture the implicit meaning of data (and that is the goal!)
• Data Models can help negotiate the implicit differences across the DSC
• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners
about data models:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Semantic Models
big data:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
volume, velocity, variety, veracity
variety: data models“Product”: different meaning in Manufacturing vs Retail context
Healthcare, same entity: “Patient”, “InsuredPerson”, “BillableEntity”
example:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Class: PersonProperty: birthday
Standardized Unique Global Identifier (URI) data type: date relationship with property: age allowed range of values (can’t be born in the future) typical (average/expected) value…(Birthdays in Wikipedia vs Customer Database)
about: vital ai tech
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Vital AI Development Kit (VDK)VitalSigns — Data Modeling & Code Generation
VitalService — Common API for Databases, Machine Learning, Apache Spark, Data Transforms
about: vital ai tech
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
VitalServiceQuery
ExecutableQuery
Query Generator
Common Query API:Relational DB (SQL) Graph DB (Sparql) Key/Value Store NOSQL DBDocument DBApache SparkHive (Hadoop) Predictive Models (a query for an unknown value)
Goal: Build A.I. applications across variety of infrastructure with consistent API & Models.
example data:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Person:Recipient
Person:Sender Message
hasRecipient
hasSender
example “MetaQL” query:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
GRAPH { value segments: ["mydata"] ARC { node_constraint { Message.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {
Person.props().emailAddress.equalTo(“john@example.org") }
node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }
“Person” may have subtypes, like Student or Employee.
a.i. and data quality
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
data models & machine learning:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
using the meaning of classes and properties, automatically generate predictive models.
predictive models features:birthday, zip code, …
data governance =defining the meaning of data = feature (pre)engineering
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
critical aspect of data science
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Progression of Analytics:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
where a.i. happens
Progression of Analytics:
Garbage In = Garbage Out
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
= Bad A.I.
data governance required for Good A.I.
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
one more point ondata governance…
think outside the box(data warehouse)
data governance: data in motion
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
vs.inside data warehouse
outside data warehouse
supply chain
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
supply chain:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
product
data supply chain:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
data product
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Retail Recommendations… Shipping/Logistics Optimization… Compliance, Auditing, Security, Fraud Detection…
data product:
why data supply chain?
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Partner DW Your DW
"No matter who you are, most of the smartest people work for someone else.” — Bill Joy.
why data supply chain?
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Partner DW Your DW
"No matter who you are, most of the smartest people data works for someone else.” — Bill Joy. (revised)
data supply chain
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Partner DW
Your DW
why not ETL?
Partner DW
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Extract…
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
not quite as expected…
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Transform…
a bit extreme…
Load…
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
a bit messy…
Clean…
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
a lot of manual effort…
… your imported data61 Broadway Suite 1105
New York, NY 10006info@vital.ai
http://www.vital.ai
Your DW61 Broadway Suite 1105
New York, NY 10006info@vital.ai
http://www.vital.ai
Partner DW
Why?
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
what goes wrong?
telephone game…
You61 Broadway Suite 1105
New York, NY 10006info@vital.ai
http://www.vital.ai
Partner
Model “A”
Model “B”
Implicit Model
Resolution: Make explicit the implicit. Align Data Models.
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Reason:Implicit assumptions in the data. ETL can’t see the forest for the trees. (or it’s very difficult with missing assumptions)
Example: Internet of Things
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Predictive Analytics
“Nest for Office Buildings” Office Tower with Building Management System (BMS) containing 100,000 monitored points (temperature, energy usage of chiller, fan speed, etc.) with significant missing data, errors, and noise. Reconciliation of data to produce predictive models to minimize energy usage. Rules for data correctness.
Sensor Data Validation:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Source data had temperature values of “0” (zero) which meant either the temperature was 0 degrees or that the sensor had an error.Data Model “knows” that it’s rarely 0 degrees in July (far from the standard deviation), and that the temperature can be compared to weather data on a day in December for reasonableness. If Data Model also knows the maintenance schedule for the sensors, then it “knows” when to expect 0 error values and exclude them.
Missing Maintenance Assumptions. Fill in secondary (weather) data for validation.
how did we get here?61 Broadway Suite 1105
New York, NY 10006info@vital.ai
http://www.vital.ai
Architecture Review:a quick step back…
What is a Data Supply Chain architecture?
“traditional” data warehouse:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
ETL within the organization.Data Governance across the organization.
DW
tech co. “agile” data warehouse:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
storage
compute
HDFS
Spark
DataSetsJobs
Batch/StreamingBuild Predictive Models Realtime: Spark/Storm
hadoop cluster
enterprise: data lake
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
storage
compute
HDFS
Spark
X(save $)
“Data Swamp”
aside: Data Lake
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
better analogy: Scriptorium
library,manuscript copying, & book distribution.
but not as Pithy as “Lake”…
tech co. microservices (micro-SOA):
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
storage
compute
service
“Composed” App
external: social data, weather API
independent clusters,local data expertise
optimize development processes, scale up.
microservices example:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Amazon: product search uses 170 independent microservices
including services for predicting customer characteristics, getting product images, etc.
http://www.infoworld.com/article/2903144/application-development/how-to-succeed-with-microservices-architecture.html
Netflix similar architecture
Data Supply Chain:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
storage
compute
service
Data Product
“ETL”
Owner “A” Owner “B”
optimize development processes, scale up.
independent clusters,local data, ownership
Interaction Points:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Data Product
service
compute ETL
Owner “A” Owner “B”
Data Lineage: Cloudera Navigator
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
…within a Data Warehouse
trace back jobs that produced every data field.
Data Supply Chain with Provenance:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
include provenance data directly in imported dataset. use in rules to interpret the data.
entity-123 | hasSource | datasource-A entity-123 | name | “John Doe”
Data Warehouse B
Interaction Points: Data Models
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Data Product
service
computeETL
Data Models: Gatekeepers & Transform
Owner “A” Owner “B”
Data Supply Chain using Models:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
storage
compute
service
Data Product
ETL
Owner “A” Owner “B”Model Server
Data Models: focus of data governance
Semantic Data Models:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Make explicit the meaning of data
Transformation and Validation Rules leverage the Model and Meaning.Such Rules may be packaged with the Model, and managed together.
Protect against implicit assumptions
Example: Financial Services
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
A B C
Service Provider
Reconciliation of Corporate Structure across 1,000’s of organizations. Compliance Rules barring communication between “researchers” and “traders”.Rules to infer if “Mary” is a “researcher” or “trader”.Conflicting concepts of “Branch-Office”, “Direct-Report”, etc. across the Globe.
Example: Hospital Group
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
A B C
Data Analytics
Reconciliation across Patient Records, Insurance, & Billing for Patient Predictive Analytics.Rules for identity: “same person”
Data Models:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
OWL: Semantic Ontology Model (W3C Standard, Various Standards for Rules)
VitalSigns: Generate Codevalidation, transformation, …
VitalSigns: Versioning, Dependencies, Exchange, Storage, Change Management (Semantic “Diff”)
Example: Personally Identifiable Information
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Data Governance determines that “Profession” and “ZipCode” cannot be used together. (Maybe a single “Dentist” in a small town…)
Within a single Data Warehouse we can bar these data elements from being combined. But:Microservice A provides value of “Profession” Microservice B provides value of “ZipCode” How to enforce that these two microservices cannot be combined?
Example: Personally Identifiable Information
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Validation code enforcing data usage:
Person person123 = get_person_details(“entity-123”) // this call works: person123.profession = get-profession(person123)// this call blocks because of data model validation // person123 already has “profession” propertyperson123.zipcode = get-zipcode(person123)
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
GatekeepersExternally Managed.Active not Passive, more like “code”.Defining what should exist, not cataloguing what exists.Can decide when to be tolerant or strict.
Semantic Data Models:
Collaborative Conversations:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Infrastructure DevOps
Data Scientists
Business +Domain Experts
Developers SemanticModel
Collaborative Conversations:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Business +Domain Experts
SemanticModel
Business +Domain Experts
SemanticModel
Partner A Partner B
Model Alignment
What Concepts to combine, not what Tables to combine (that comes later).
Authoring Tool: OWL IDE Protege
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Visualization: Semantic Data
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Visualization: WebVOWL
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
in conclusion, takeaways:
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
• The Data Supply Chain is a supply chain to deliver Data Products
• Data Models can capture the implicit meaning of data (and that is the goal!)
• Data Models can help negotiate the implicit differences across the DSC
• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners
Questions?
61 Broadway Suite 1105 New York, NY 10006
info@vital.ai http://www.vital.ai
Marc HadfieldCEO, Vital A.I.marc@vital.ai
top related