Top Banner
LutzFinger.com How to extract significant business value from big data
123

Workshop Data Manager

Apr 13, 2017

Download

Business

Lutz Finger
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Workshop Data Manager

Lu

tzFi

nger

.com

How to extract significant business value from big

data

Page 2: Workshop Data Manager

Lu

tzFi

nger

.com

Lutz

Page 3: Workshop Data Manager

Lu

tzFi

nger

.com

Disclaimer

This presentation is solemnly my opinion and not necessarily the

opinion of my employer Harvard, Linkedin or Cornell.

Page 4: Workshop Data Manager

Lu

tzFi

nger

.com

AgendaThe right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with Data

BreakTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 5: Workshop Data Manager

Lu

tzFi

nger

.com

Why is there such hype?

Page 6: Workshop Data Manager

Lu

tzFi

nger

.com

PREDICTING

Page 7: Workshop Data Manager

Lu

tzFi

nger

.com

The ones who predict:

image by Mike under Creative Commons

Page 8: Workshop Data Manager

Lu

tzFi

nger

.com

McK Study forecasted:

10 Times More Managers per Data Savvy Person

Page 9: Workshop Data Manager

Lu

tzFi

nger

.com

?

Page 10: Workshop Data Manager

Lu

tzFi

nger

.com

Actionable Insights

Page 11: Workshop Data Manager

Lu

tzFi

nger

.com

ASK the right Questions.

MEASURE the right data – even if it is not Big data.

Take Actions and LEARN from them.

?

Page 12: Workshop Data Manager

Lu

tzFi

nger

.com

Page 13: Workshop Data Manager

Lu

tzFi

nger

.com

Google had the right Questionis difficult to find

Page 14: Workshop Data Manager

Lu

tzFi

nger

.com

Fisheye Learning

Page 15: Workshop Data Manager

Lu

tzFi

nger

.com

Already Known Asks

by rg

iese

king

und

er C

reat

ive

Com

mon

s (C

C B

Y 2

.0)

Who should get an E-Shot?

Territory Planning for my Sales Force

Budget Planning of Marketing Spent

Online Product Recommendation

Real Time Betting for Ad-spaces

Customer Segmentation

Social Media Influencers

Call Center Routing based on Questions

Capacity Forecasting …. and more

Page 16: Workshop Data Manager

Lu

tzFi

nger

.com

The “So-What” Test

Page 17: Workshop Data Manager

Lu

tzFi

nger

.com

Data by itselfis

USELESS

Information by itselfis

USELESS

Only action counts!

Page 18: Workshop Data Manager

Lu

tzFi

nger

.com

Data by itselfis

USELESS

Information by itselfis

USELESS

Let’s connect...

Page 19: Workshop Data Manager

Lu

tzFi

nger

.com

Benchmarking

Recommending

Page 20: Workshop Data Manager

Lu

tzFi

nger

.com

A good ‘so what’?

Dat

a by

Lin

kedI

n

Example: Laboratory Manager?

Page 21: Workshop Data Manager

Lu

tzFi

nger

.com

A bad ‘so what’

300+ Million Member at LinkedIn

60.000 with a Job Title that might fit

19.000 who switched after 3 to 8 years

24 who had the same career path

Page 22: Workshop Data Manager

Lu

tzFi

nger

.com

Benchmarking in Health Care

Page 23: Workshop Data Manager

Lu

tzFi

nger

.com

Benchmarking is Overhyped

Page 24: Workshop Data Manager

Lu

tzFi

nger

.com

Recommendations: Your FocusPeople You May Know

Groups You May Like

Ads in Which You May Be Interested

Companies You May Want to Follow

Pulse

Similar Profiles

Page 25: Workshop Data Manager

Lu

tzFi

nger

.com

Recommendation in Health Care

Page 26: Workshop Data Manager

Lu

tzFi

nger

.com

Many Good Examples

Benchmark Recommendations

Page 27: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 28: Workshop Data Manager

Lu

tzFi

nger

.com

“Data is the new oil”- World Economic Forum

Page 29: Workshop Data Manager

Lu

tzFi

nger

.com

“DATA IS THE NEW OIL”

Oil Mine the oil

Use the oil

Goal

Page 30: Workshop Data Manager

Lu

tzFi

nger

.com

V OF “BIG DATA”

Data at scale(TB, PB … )

Data in many forms(Structured, unstructured ...)

Speed(Streaming, real time, near time ..)

Uncertainty(imprecise, not always up-to-date ..)

Page 31: Workshop Data Manager

Lu

tzFi

nger

.com

1st. Round Monopoly

Photo by William Warby under the Creative Commons (CC BY 2.0)

Page 32: Workshop Data Manager

Lu

tzFi

nger

.com

$3.2 billion

Page 33: Workshop Data Manager

Lu

tzFi

nger

.com

Prediction

Photo by KOMUnews under the Creative Commons (CC BY 2.0)

Boring could be the New Sexy!

Page 34: Workshop Data Manager

Lu

tzFi

nger

.com

Data Might (Not) Be A Barrier To Enter

Page 35: Workshop Data Manager

Lu

tzFi

nger

.com

Data Might (Not) Be A Barrier To Enter

Page 36: Workshop Data Manager

Lu

tzFi

nger

.com

Public Data is Not

Page 37: Workshop Data Manager

Lu

tzFi

nger

.com

DATACategorical

• Ordinal: Monday, Tuesday, Wednesday• Nominal: Man, Woman

Quantitative:• Ratio: Kelvin, Height, Weight• Interval: Celsius, Fahrenheit

Structure:• Structured• Unstructured• Semi-structured / Meta data

Read more: “On the Theory of scales of measurement”S.Stevens 1946

Page 38: Workshop Data Manager

Lu

tzFi

nger

.com

Data Is Kingbut not all data is equal.

Page 39: Workshop Data Manager

Lu

tzFi

nger

.com

The Tale of “Social Media” DataSo

urce: ‘Ask M

easure Learn’ by O’Reilly M

edia

Page 40: Workshop Data Manager

Lu

tzFi

nger

.com

Structured Data Is Often Better

New York Weather in April 2013

Source: ‘Ask Measure Learn’ by O’Reilly Media

Page 41: Workshop Data Manager

Lu

tzFi

nger

.com

Sometimes, it’s worth it.

RE @dave_mcgregor: Publicly pledging to never fly @delta again. The worst airline ever. U have lost my patronage forever du to ur incompetence

Completely unimpressed with @continental or @united. Poor communication, goofy reservations systems and all to turn my trip into a mess.

@SouthWestAir I know you don't make the weather. But at least pretend I am not a bother when I ask if the delay will make miss my connection

Page 42: Workshop Data Manager

Lu

tzFi

nger

.com

But Data Is King

This will give birth to devices (i.e., the Star Trek Tricorder) that allow you, the consumer, to self-diagnose, anytime, anywhere.

Page 43: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 44: Workshop Data Manager

Lu

tzFi

nger

.com

About Innovation

By

Alis

tair

Cro

ll

Page 45: Workshop Data Manager

Lu

tzFi

nger

.com

The Media industry has changed! The retail industry has change! The Education sector is changing! Finance Industry and healthcare sector are under attack. Which industry will be next?

Page 46: Workshop Data Manager

Lu

tzFi

nger

.com

Team Work

Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)

A. The Ask ○ is it actionable? “So What?○ is it Benchmarking / is it Recommendation

B. The Data ○ do only you have this data?○ do you have a feedback loop?

Page 47: Workshop Data Manager

Lu

tzFi

nger

.com

LUNCH

Page 48: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 49: Workshop Data Manager

Lu

tzFi

nger

.com

Are Retail Banks ‘dead’?

Page 50: Workshop Data Manager

Lu

tzFi

nger

.com

Decision Trees Step by Step

by Maciej Lewandowski under Creative Commons (CC BY-SA 2.0)

Page 51: Workshop Data Manager

Lu

tzFi

nger

.com

Split Apples & Mandarins

Page 52: Workshop Data Manager

Lu

tzFi

nger

.com

What Is The Target Variable?

Page 53: Workshop Data Manager

Lu

tzFi

nger

.com

What Is The Features To Describe The Target?

Page 54: Workshop Data Manager

Lu

tzFi

nger

.com

What Is The Features To Describe The Target?

• Weight: light, medium, heavy - or x gram• Size: round or not• Color:green, orange, red• Surface: flat or porous surface• …

Page 55: Workshop Data Manager

Lu

tzFi

nger

.com

Which Feature Works Best?

● The variable with the most important information about target variable.

● Which variable can split the group as homogeneous with respect to the target variable.

(pure vs. inpure)

Page 56: Workshop Data Manager

Lu

tzFi

nger

.com

Color Red?

Color Orange?

Split on Color Red vs. Split on Color Orange

Which One Is Better?

Page 57: Workshop Data Manager

Lu

tzFi

nger

.com

We Need A Way To Describe Chaos

"Cla

ude

Elw

ood

Sha

nnon

(191

6-20

01)"

by

Sou

rce.

Lic

ense

d un

der

Fair

use

via

Wik

iped

ia

Page 58: Workshop Data Manager

Lu

tzFi

nger

.com

ENTROPYEntropy is a measure of disorder.

Entropy only tells us how impure one individual subset is.

Page 59: Workshop Data Manager

Lu

tzFi

nger

.com

ENTROPY & PROBABILITY

entropy = -p1 * log (p1) - p2 * log (p2) - ….

Page 60: Workshop Data Manager

Lu

tzFi

nger

.com

● Highest Entropy Reduction

● Highest Information Gain

Page 61: Workshop Data Manager

Lu

tzFi

nger

.com

1st. Entropy Without Splitentropy = -p1 * log (p1) - p2 * log (p2)

Apple: 8 out of 15 p(apple)= 8/15

Mandarines: 7 out of 15 p(mandarine)= 7/15

ENTROPY (Without Split):

-p(apple)*log(p(apple)) -p(mandarins)*log(p(mandarines))

= 0.996791632 = 1

very impure

Page 62: Workshop Data Manager

Lu

tzFi

nger

.com

Color Red?

Color Orange?

entropy = -p1 * log (p1) - p2 * log (p2)

ENTROPY (After Split on Red):

= 8/15* ENTROPY (Split on Red=’no’) + 7/15* ENTROPY (Split on Red=’yes’)

= 0.43 + 0.28 = 0.71

INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29

ENTROPY (Split on Red=’no’):= -6/8*(log2(6/8))-2/8*(log2(2/8))= 0.81

ENTROPY (Split on Red=’yes’):= -6/7*(log2(6/7)) -1/7*(log2(1/7))= 0.59

ENTROPY (Split on Orange=’yes’):= -6/6*(log2(6/6))= 0

ENTROPY (Split on Orange=’no’):= -8/9*(log2(8/9))-1/9*(log2(1/9))= 0.50

ENTROPY (After Split on Orange):

= 6/15* ENTROPY (Split on Orange=’no’) + 9/15* ENTROPY (Split on Orange=’yes’)

= 0 + 0.23 = 0.23

INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77

Page 63: Workshop Data Manager

Lu

tzFi

nger

.com

INFORMATION GAIN (IG)Information gain measures how much a

given feature improves (decreases) entropy over the whole segmentation it creates.

How important is this feature for the prediction?

Page 64: Workshop Data Manager

Lu

tzFi

nger

.com

Decision Tree

Color Orange? ROOT NODE

LEAFS

Page 65: Workshop Data Manager

Lu

tzFi

nger

.com

Decision Tree

Color Orange?

Decision Tree Structure

Page 66: Workshop Data Manager

Lu

tzFi

nger

.com

Which Feature Would Be Better?

Page 67: Workshop Data Manager

Lu

tzFi

nger

.com

Heavy?

Always Start With Highest IG

Page 68: Workshop Data Manager

Lu

tzFi

nger

.com

Hyperplanes

Page 69: Workshop Data Manager

Lu

tzFi

nger

.com

Hyperplane (2 dimensions)

Mandarines Red Green

Ligh

tH

eavy

Page 70: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 71: Workshop Data Manager

Lu

tzFi

nger

.com

Back To The Lending Industry

Page 72: Workshop Data Manager

Lu

tzFi

nger

.com

BIG ML

Competitors:

● Algorithms.io● SnapAnalytx● wise.io● Predixion Software● Google Prediction

API

Page 73: Workshop Data Manager

Lu

tzFi

nger

.com

Real Data Set

Page 74: Workshop Data Manager

Lu

tzFi

nger

.com

Build Database

How Do You Deal With Categorical vs. Numeric Variables in Decision Trees?

screenshot from bigML tool

Page 75: Workshop Data Manager

Lu

tzFi

nger

.comConfigure And Build

Model

Select The Objective Field - What To Train The Model On?

That is the Row ID - surely no impact.

screenshot from bigML tool

Page 76: Workshop Data Manager

Lu

tzFi

nger

.com

screenshot from bigML tool

Page 77: Workshop Data Manager

Lu

tzFi

nger

.com

screenshot from bigML tool

Page 78: Workshop Data Manager

Lu

tzFi

nger

.com

Highest Information Gain

screenshot from bigML tool

Page 79: Workshop Data Manager

Lu

tzFi

nger

.com

Using The Model

screenshot from bigML tool

Page 80: Workshop Data Manager

Lu

tzFi

nger

.com

Using The Model

screenshot from bigML tool

Page 81: Workshop Data Manager

Lu

tzFi

nger

.com

Using The Model

screenshot from bigML tool

Page 82: Workshop Data Manager

Lu

tzFi

nger

.com

Found 2,470 New Instances

screenshot from bigML tool

Page 83: Workshop Data Manager

Lu

tzFi

nger

.com

How Can I Improve Now Quality?

Page 84: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 85: Workshop Data Manager

Lu

tzFi

nger

.com

Pitfalls with Data

Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…

Page 86: Workshop Data Manager

Lu

tzFi

nger

.com

How Did They Improve Scoring?

Page 87: Workshop Data Manager

Lu

tzFi

nger

.com

Social Network InfoCould Social Network improve the quality of our

prediction?

Who is more credit-worthy?

a. Tim whose friends are all very credit worthy

b. Tom whose friends are not creditworthy

Page 88: Workshop Data Manager

Lu

tzFi

nger

.com

Ethical?

Page 89: Workshop Data Manager

Lu

tzFi

nger

.com

Nobel Worthy!

Muhammad YunusPhoto by University of Salford under Creative Commons CC BY 2.0

Page 90: Workshop Data Manager

Lu

tzFi

nger

.com

In the EU insurers will no longer be allowed to take the gender of their customers into account for insurance premiums:

● young men's premiums will fall by up to 10%

● young women's premiums will rise by up to 30%

by: BBC News: http://www.bbc.com/news/business-12608777

Not Everything That Is Possible Is Legal

Page 91: Workshop Data Manager

Lu

tzFi

nger

.com

Pitfalls with Data

Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…

Page 92: Workshop Data Manager

Lu

tzFi

nger

.com

The Tale of Big Data

Page 93: Workshop Data Manager

Lu

tzFi

nger

.com

Overfitting

To tailor a model to training data at the expense of being generalizable for previously unseen data

points. The model becomes perfect in describing noise and spurious correlations.

TRADE OFF

Complexity of a Model & Overfitting Likelihood

Page 94: Workshop Data Manager

Lu

tzFi

nger

.com

How Trustworthy Is This Prediction?

• 45 instances• 59% confidence

screenshot from bigML tool

Page 95: Workshop Data Manager

Lu

tzFi

nger

.com

The Need for Domain Knowledge

Page 96: Workshop Data Manager

Lu

tzFi

nger

.com

Pitfalls with Data

Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…

Page 97: Workshop Data Manager

Lu

tzFi

nger

.com

Give Credit or Not?49% Confidence

screenshot from bigML tool

Page 98: Workshop Data Manager

Lu

tzFi

nger

.com

CONFUSION MATRIX

Pregnant(60)

Not pregnant(940)

Pregnant (A) true positive

(B) false positive

Not pregnant

(C) false negative

(D) true negativeC

lass

ifier

Reality

Page 99: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 100: Workshop Data Manager

Lu

tzFi

nger

.com

How Invented Big Data Infrastructure?

Page 101: Workshop Data Manager

Lu

tzFi

nger

.com

How Invented Big Data Infrastructure?

Page 102: Workshop Data Manager

Lu

tzFi

nger

.com

Issue Of YahooCENTRALIZED SYSTEMS ARE EXPENSIVE

• diminishing returns in power (overhead issue)• exponential cost to scale.• slow to transport (ETL) the data

Scan 1000 TB Datasets on a 1000 node cluster:• Remote Storage @ 10 MB/s = 165 min• Local Storage @ 200 MB's = 8 min

MAKE SYSTEMS FAULT TOLERANT1000 nodes - a machine a day will break

Page 103: Workshop Data Manager

Lu

tzFi

nger

.com

The VisionCHEAP Systems

• can run on commodity hardware

Computation are done DECENTRAL• ability to ‘dispatch’ a task• parallelize work-streams

Fault TOLERANTno matter where and when break is not an issue

Page 104: Workshop Data Manager

Lu

tzFi

nger

.com

Page 105: Workshop Data Manager

Lu

tzFi

nger

.com

How To Access HDFS

Hadoop Storage (HDFS / HBase / Solr)

Map Reduce

Page 106: Workshop Data Manager

Lu

tzFi

nger

.com

Via The Normal Languages

Hadoop Storage (HDFS / HBase / Solr)

Map Reduce

Map

Red

uce

Hiv

e

Pig

/Cas

scad

ing

Gira

ph

Mah

out

SQL Like

Scripting Like

Graph Oriented

ML Engine

Page 107: Workshop Data Manager

Lu

tzFi

nger

.com

Pro & Con

Hadoop Storage (HDFS / HBase / Solr)

Map Reduce

Map

Red

uce

Hiv

e

Pig

/Cas

scad

ing

Gira

ph

Mah

out

SQL Like

Scripting Like

Graph Oriented

ML Engine

Store

ETL: Extract / Transform / Load

DB / Key Value Store

Visualize

Pro:way better than traditional BI

Con:Heavy tech involvement. 12-18 month for non-tech company to implement a schema

Page 108: Workshop Data Manager

Lu

tzFi

nger

.com

Pro & Con

Hadoop Storage (HDFS / HBase / Solr)

Map Reduce

Map

Red

uce

Hiv

e

Pig

/Cas

scad

ing

Gira

ph

Mah

out

SQL Like

Scripting Like

Graph Oriented

ML Engine

DB / Key Value Store

Visualize

New Approaches:

● Spark● Tez● Flink

Page 109: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 110: Workshop Data Manager

Lu

tzFi

nger

.com

Why Is It So Hard To Become Data Driven

Page 111: Workshop Data Manager

Lu

tzFi

nger

.com

Ingredients of Data Products

The question?

Ask

The need?

The Why? MeasureThe Data?

The features?

Team

All of them are necessary - None of them is sufficient!

The algorithms?

The right Skills?

Collaboration

111

Page 112: Workshop Data Manager

Lu

tzFi

nger

.com

How To Ingest Ideas

Hack - Days & IncubatorInternal Process

External Competition

Close Collaboration between Business & Data Scientists“All we do is Data” - Jeff Weiner

112

Page 113: Workshop Data Manager

Lu

tzFi

nger

.com

What Would You Need To Do To Be A Leader In Data

Page 114: Workshop Data Manager

Lu

tzFi

nger

.com

Agenda

The right AskData is KingTeam-Work: Discover an Ask

LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team

Page 115: Workshop Data Manager

Lu

tzFi

nger

.com

Old vs. New

Old School Today / Big data

Data Amount Gigabytes & Terabytes Petabytes & Exabytes

IT Infrastructure Centralized Decentralized / Parallelized

Data Types Structured Structured & unstructured

Schema Stable schema Schema on the fly

When and How is the ASK formulated?

Set ask Ad-hoc ask

Page 116: Workshop Data Manager

Lu

tzFi

nger

.com

How to build a Data Team

Page 117: Workshop Data Manager

Lu

tzFi

nger

.com

Data Scientist

Page 118: Workshop Data Manager

Lu

tzFi

nger

.com

Data Scientist Confusion

Page 119: Workshop Data Manager

Lu

tzFi

nger

.com

New Ways To Automate

Page 120: Workshop Data Manager

Lu

tzFi

nger

.com

Data Scientist

BI Analyst

Engineer

Product Manager

Communication Skills Domain Knowledge

Page 121: Workshop Data Manager

Lu

tzFi

nger

.com

You Learned

image by Mike under Creative Commons

• The Ask is the most Important part - you need Domain Knowledge

• Data Science is NO Rocket Science

• Data is King & There is Monopoly Game happening

• Data Can be misleading

• Data is a Team Sport

Page 122: Workshop Data Manager

Lu

tzFi

nger

.com

Thank You

Page 123: Workshop Data Manager

Lu

tzFi

nger

.com

What to MEASURE?

• Error• Correlation• Cost&• Privacy

Workbook “Measure” at LutzFinger.com