Optimizing the Data Supply Chain for Data Science

Post on 19-Jan-2017

2742 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

Transcript

Optimizing theData Supply Chainfor Data Science

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Marc HadfieldCEO, Vital A.I.

about: vital ai

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Software Applications:Artificial Intelligence, Machine Learning, Data Science.

Software Vendor & Consulting Services

agenda

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

• Data Models • How A.I., Data Science, & Data Governance relate • Data Supply Chain & the Data Product • Problem: the “Telephone Game” across the DSC • Architecture Transition from Data Warehouse to DSC • Data Models and DSC; a Framework for Solutions • Examples • Collaboration & Visualization

note: general methodology, with some specific examples from Vital AI implementations.

takeaways:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

• The Data Supply Chain is a supply chain to deliver Data Products

• Data Models can capture the implicit meaning of data (and that is the goal!)

• Data Models can help negotiate the implicit differences across the DSC

• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners

about data models:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Semantic Models

big data:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

volume, velocity, variety, veracity

variety: data models“Product”: different meaning in Manufacturing vs Retail context

Healthcare, same entity: “Patient”, “InsuredPerson”, “BillableEntity”

example:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Class: PersonProperty: birthday

Standardized Unique Global Identifier (URI) data type: date relationship with property: age allowed range of values (can’t be born in the future) typical (average/expected) value…(Birthdays in Wikipedia vs Customer Database)

about: vital ai tech

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Vital AI Development Kit (VDK)VitalSigns — Data Modeling & Code Generation

VitalService — Common API for Databases, Machine Learning, Apache Spark, Data Transforms

about: vital ai tech

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

VitalServiceQuery

ExecutableQuery

Query Generator

Common Query API:Relational DB (SQL) Graph DB (Sparql) Key/Value Store NOSQL DBDocument DBApache SparkHive (Hadoop) Predictive Models (a query for an unknown value)

Goal: Build A.I. applications across variety of infrastructure with consistent API & Models.

example data:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Person:Recipient

Person:Sender Message

hasRecipient

hasSender

example “MetaQL” query:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

GRAPH { value segments: ["mydata"] ARC { node_constraint { Message.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint {

Person.props().emailAddress.equalTo(“john@example.org") }

node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }

“Person” may have subtypes, like Student or Employee.

a.i. and data quality

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

data models & machine learning:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

using the meaning of classes and properties, automatically generate predictive models.

predictive models features:birthday, zip code, …

data governance =defining the meaning of data = feature (pre)engineering

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

critical aspect of data science

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Progression of Analytics:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

where a.i. happens

Progression of Analytics:

Garbage In = Garbage Out

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

= Bad A.I.

data governance required for Good A.I.

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

one more point ondata governance…

think outside the box(data warehouse)

data governance: data in motion

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

vs.inside data warehouse

outside data warehouse

supply chain

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

supply chain:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

product

data supply chain:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

data product

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Retail Recommendations… Shipping/Logistics Optimization… Compliance, Auditing, Security, Fraud Detection…

data product:

why data supply chain?

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Partner DW Your DW

"No matter who you are, most of the smartest people work for someone else.” — Bill Joy.

why data supply chain?

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Partner DW Your DW

"No matter who you are, most of the smartest people data works for someone else.” — Bill Joy. (revised)

data supply chain

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Partner DW

Your DW

why not ETL?

Partner DW

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Extract…

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

not quite as expected…

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Transform…

a bit extreme…

Load…

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

a bit messy…

Clean…

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

a lot of manual effort…

… your imported data61 Broadway Suite 1105

New York, NY 10006info@vital.ai

http://www.vital.ai

Your DW61 Broadway Suite 1105

New York, NY 10006info@vital.ai

http://www.vital.ai

Partner DW

Why?

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

what goes wrong?

telephone game…

You61 Broadway Suite 1105

New York, NY 10006info@vital.ai

http://www.vital.ai

Partner

Model “A”

Model “B”

Implicit Model

Resolution: Make explicit the implicit. Align Data Models.

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Reason:Implicit assumptions in the data. ETL can’t see the forest for the trees. (or it’s very difficult with missing assumptions)

Example: Internet of Things

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Predictive Analytics

“Nest for Office Buildings” Office Tower with Building Management System (BMS) containing 100,000 monitored points (temperature, energy usage of chiller, fan speed, etc.) with significant missing data, errors, and noise. Reconciliation of data to produce predictive models to minimize energy usage. Rules for data correctness.

Sensor Data Validation:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Source data had temperature values of “0” (zero) which meant either the temperature was 0 degrees or that the sensor had an error.Data Model “knows” that it’s rarely 0 degrees in July (far from the standard deviation), and that the temperature can be compared to weather data on a day in December for reasonableness. If Data Model also knows the maintenance schedule for the sensors, then it “knows” when to expect 0 error values and exclude them.

Missing Maintenance Assumptions. Fill in secondary (weather) data for validation.

how did we get here?61 Broadway Suite 1105

New York, NY 10006info@vital.ai

http://www.vital.ai

Architecture Review:a quick step back…

What is a Data Supply Chain architecture?

“traditional” data warehouse:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

ETL within the organization.Data Governance across the organization.

DW

tech co. “agile” data warehouse:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

storage

compute

HDFS

Spark

DataSetsJobs

Batch/StreamingBuild Predictive Models Realtime: Spark/Storm

hadoop cluster

enterprise: data lake

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

storage

compute

HDFS

Spark

X(save $)

“Data Swamp”

aside: Data Lake

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

better analogy: Scriptorium

library,manuscript copying, & book distribution.

but not as Pithy as “Lake”…

tech co. microservices (micro-SOA):

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

storage

compute

service

“Composed” App

external: social data, weather API

independent clusters,local data expertise

optimize development processes, scale up.

microservices example:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Amazon: product search uses 170 independent microservices

including services for predicting customer characteristics, getting product images, etc.

http://www.infoworld.com/article/2903144/application-development/how-to-succeed-with-microservices-architecture.html

Netflix similar architecture

Data Supply Chain:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

storage

compute

service

Data Product

“ETL”

Owner “A” Owner “B”

optimize development processes, scale up.

independent clusters,local data, ownership

Interaction Points:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Data Product

service

compute ETL

Owner “A” Owner “B”

Data Lineage: Cloudera Navigator

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

…within a Data Warehouse

trace back jobs that produced every data field.

Data Supply Chain with Provenance:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

include provenance data directly in imported dataset. use in rules to interpret the data.

entity-123 | hasSource | datasource-A entity-123 | name | “John Doe”

Data Warehouse B

Interaction Points: Data Models

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Data Product

service

computeETL

Data Models: Gatekeepers & Transform

Owner “A” Owner “B”

Data Supply Chain using Models:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

storage

compute

service

Data Product

ETL

Owner “A” Owner “B”Model Server

Data Models: focus of data governance

Semantic Data Models:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Make explicit the meaning of data

Transformation and Validation Rules leverage the Model and Meaning.Such Rules may be packaged with the Model, and managed together.

Protect against implicit assumptions

Example: Financial Services

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

A B C

Service Provider

Reconciliation of Corporate Structure across 1,000’s of organizations. Compliance Rules barring communication between “researchers” and “traders”.Rules to infer if “Mary” is a “researcher” or “trader”.Conflicting concepts of “Branch-Office”, “Direct-Report”, etc. across the Globe.

Example: Hospital Group

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

A B C

Data Analytics

Reconciliation across Patient Records, Insurance, & Billing for Patient Predictive Analytics.Rules for identity: “same person”

Data Models:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

OWL: Semantic Ontology Model (W3C Standard, Various Standards for Rules)

VitalSigns: Generate Codevalidation, transformation, …

VitalSigns: Versioning, Dependencies, Exchange, Storage, Change Management (Semantic “Diff”)

Example: Personally Identifiable Information

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Data Governance determines that “Profession” and “ZipCode” cannot be used together. (Maybe a single “Dentist” in a small town…)

Within a single Data Warehouse we can bar these data elements from being combined. But:Microservice A provides value of “Profession” Microservice B provides value of “ZipCode” How to enforce that these two microservices cannot be combined?

Example: Personally Identifiable Information

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Validation code enforcing data usage:

Person person123 = get_person_details(“entity-123”) // this call works: person123.profession = get-profession(person123)// this call blocks because of data model validation // person123 already has “profession” propertyperson123.zipcode = get-zipcode(person123)

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

GatekeepersExternally Managed.Active not Passive, more like “code”.Defining what should exist, not cataloguing what exists.Can decide when to be tolerant or strict.

Semantic Data Models:

Collaborative Conversations:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Infrastructure DevOps

Data Scientists

Business +Domain Experts

Developers SemanticModel

Collaborative Conversations:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Business +Domain Experts

SemanticModel

Business +Domain Experts

SemanticModel

Partner A Partner B

Model Alignment

What Concepts to combine, not what Tables to combine (that comes later).

Authoring Tool: OWL IDE Protege

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Visualization: Semantic Data

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Visualization: WebVOWL

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

in conclusion, takeaways:

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

• The Data Supply Chain is a supply chain to deliver Data Products

• Data Models can capture the implicit meaning of data (and that is the goal!)

• Data Models can help negotiate the implicit differences across the DSC

• Data Models offer a means to collaborate on data standards (meaning) across the DSC partners

Questions?

61 Broadway Suite 1105 New York, NY 10006

info@vital.ai http://www.vital.ai

Marc HadfieldCEO, Vital A.I.marc@vital.ai

top related