LutzFinger.com How to extract significant business value from big data
Lu
tzFi
nger
.com
How to extract significant business value from big
data
Lu
tzFi
nger
.com
Lutz
Lu
tzFi
nger
.com
Disclaimer
This presentation is solemnly my opinion and not necessarily the
opinion of my employer Harvard, Linkedin or Cornell.
Lu
tzFi
nger
.com
AgendaThe right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with Data
BreakTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
Why is there such hype?
Lu
tzFi
nger
.com
PREDICTING
Lu
tzFi
nger
.com
The ones who predict:
image by Mike under Creative Commons
Lu
tzFi
nger
.com
McK Study forecasted:
10 Times More Managers per Data Savvy Person
Lu
tzFi
nger
.com
?
Lu
tzFi
nger
.com
Actionable Insights
Lu
tzFi
nger
.com
ASK the right Questions.
MEASURE the right data – even if it is not Big data.
Take Actions and LEARN from them.
?
Lu
tzFi
nger
.com
Lu
tzFi
nger
.com
Google had the right Questionis difficult to find
Lu
tzFi
nger
.com
Fisheye Learning
Lu
tzFi
nger
.com
Already Known Asks
by rg
iese
king
und
er C
reat
ive
Com
mon
s (C
C B
Y 2
.0)
Who should get an E-Shot?
Territory Planning for my Sales Force
Budget Planning of Marketing Spent
Online Product Recommendation
Real Time Betting for Ad-spaces
Customer Segmentation
Social Media Influencers
Call Center Routing based on Questions
Capacity Forecasting …. and more
Lu
tzFi
nger
.com
The “So-What” Test
Lu
tzFi
nger
.com
Data by itselfis
USELESS
Information by itselfis
USELESS
Only action counts!
Lu
tzFi
nger
.com
Data by itselfis
USELESS
Information by itselfis
USELESS
Let’s connect...
Lu
tzFi
nger
.com
Benchmarking
Recommending
Lu
tzFi
nger
.com
A good ‘so what’?
Dat
a by
Lin
kedI
n
Example: Laboratory Manager?
Lu
tzFi
nger
.com
A bad ‘so what’
300+ Million Member at LinkedIn
60.000 with a Job Title that might fit
19.000 who switched after 3 to 8 years
24 who had the same career path
Lu
tzFi
nger
.com
Benchmarking in Health Care
Lu
tzFi
nger
.com
Recommendations: Your FocusPeople You May Know
Groups You May Like
Ads in Which You May Be Interested
Companies You May Want to Follow
Pulse
Similar Profiles
Lu
tzFi
nger
.com
Recommendation in Health Care
Lu
tzFi
nger
.com
Many Good Examples
Benchmark Recommendations
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
“Data is the new oil”- World Economic Forum
Lu
tzFi
nger
.com
“DATA IS THE NEW OIL”
Oil Mine the oil
Use the oil
Goal
Lu
tzFi
nger
.com
V OF “BIG DATA”
Data at scale(TB, PB … )
Data in many forms(Structured, unstructured ...)
Speed(Streaming, real time, near time ..)
Uncertainty(imprecise, not always up-to-date ..)
Lu
tzFi
nger
.com
1st. Round Monopoly
Photo by William Warby under the Creative Commons (CC BY 2.0)
Lu
tzFi
nger
.com
$3.2 billion
Lu
tzFi
nger
.com
Prediction
Photo by KOMUnews under the Creative Commons (CC BY 2.0)
Boring could be the New Sexy!
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
Lu
tzFi
nger
.com
Data Might (Not) Be A Barrier To Enter
Lu
tzFi
nger
.com
Public Data is Not
Lu
tzFi
nger
.com
DATACategorical
• Ordinal: Monday, Tuesday, Wednesday• Nominal: Man, Woman
Quantitative:• Ratio: Kelvin, Height, Weight• Interval: Celsius, Fahrenheit
Structure:• Structured• Unstructured• Semi-structured / Meta data
Read more: “On the Theory of scales of measurement”S.Stevens 1946
Lu
tzFi
nger
.com
Data Is Kingbut not all data is equal.
Lu
tzFi
nger
.com
The Tale of “Social Media” DataSo
urce: ‘Ask M
easure Learn’ by O’Reilly M
edia
Lu
tzFi
nger
.com
Structured Data Is Often Better
New York Weather in April 2013
Source: ‘Ask Measure Learn’ by O’Reilly Media
Lu
tzFi
nger
.com
Sometimes, it’s worth it.
RE @dave_mcgregor: Publicly pledging to never fly @delta again. The worst airline ever. U have lost my patronage forever du to ur incompetence
Completely unimpressed with @continental or @united. Poor communication, goofy reservations systems and all to turn my trip into a mess.
@SouthWestAir I know you don't make the weather. But at least pretend I am not a bother when I ask if the delay will make miss my connection
Lu
tzFi
nger
.com
But Data Is King
This will give birth to devices (i.e., the Star Trek Tricorder) that allow you, the consumer, to self-diagnose, anytime, anywhere.
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
About Innovation
By
Alis
tair
Cro
ll
Lu
tzFi
nger
.com
The Media industry has changed! The retail industry has change! The Education sector is changing! Finance Industry and healthcare sector are under attack. Which industry will be next?
Lu
tzFi
nger
.com
Team Work
Photo by Creative Sustainability under the Creative Commons (CC BY 2.0)
A. The Ask ○ is it actionable? “So What?○ is it Benchmarking / is it Recommendation
B. The Data ○ do only you have this data?○ do you have a feedback loop?
Lu
tzFi
nger
.com
LUNCH
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
Are Retail Banks ‘dead’?
Lu
tzFi
nger
.com
Decision Trees Step by Step
by Maciej Lewandowski under Creative Commons (CC BY-SA 2.0)
Lu
tzFi
nger
.com
Split Apples & Mandarins
Lu
tzFi
nger
.com
What Is The Target Variable?
Lu
tzFi
nger
.com
What Is The Features To Describe The Target?
Lu
tzFi
nger
.com
What Is The Features To Describe The Target?
• Weight: light, medium, heavy - or x gram• Size: round or not• Color:green, orange, red• Surface: flat or porous surface• …
Lu
tzFi
nger
.com
Which Feature Works Best?
● The variable with the most important information about target variable.
● Which variable can split the group as homogeneous with respect to the target variable.
(pure vs. inpure)
Lu
tzFi
nger
.com
Color Red?
Color Orange?
Split on Color Red vs. Split on Color Orange
Which One Is Better?
Lu
tzFi
nger
.com
We Need A Way To Describe Chaos
"Cla
ude
Elw
ood
Sha
nnon
(191
6-20
01)"
by
Sou
rce.
Lic
ense
d un
der
Fair
use
via
Wik
iped
ia
Lu
tzFi
nger
.com
ENTROPYEntropy is a measure of disorder.
Entropy only tells us how impure one individual subset is.
Lu
tzFi
nger
.com
ENTROPY & PROBABILITY
entropy = -p1 * log (p1) - p2 * log (p2) - ….
Lu
tzFi
nger
.com
● Highest Entropy Reduction
● Highest Information Gain
Lu
tzFi
nger
.com
1st. Entropy Without Splitentropy = -p1 * log (p1) - p2 * log (p2)
Apple: 8 out of 15 p(apple)= 8/15
Mandarines: 7 out of 15 p(mandarine)= 7/15
ENTROPY (Without Split):
-p(apple)*log(p(apple)) -p(mandarins)*log(p(mandarines))
= 0.996791632 = 1
very impure
Lu
tzFi
nger
.com
Color Red?
Color Orange?
entropy = -p1 * log (p1) - p2 * log (p2)
ENTROPY (After Split on Red):
= 8/15* ENTROPY (Split on Red=’no’) + 7/15* ENTROPY (Split on Red=’yes’)
= 0.43 + 0.28 = 0.71
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.71 = 0.29
ENTROPY (Split on Red=’no’):= -6/8*(log2(6/8))-2/8*(log2(2/8))= 0.81
ENTROPY (Split on Red=’yes’):= -6/7*(log2(6/7)) -1/7*(log2(1/7))= 0.59
ENTROPY (Split on Orange=’yes’):= -6/6*(log2(6/6))= 0
ENTROPY (Split on Orange=’no’):= -8/9*(log2(8/9))-1/9*(log2(1/9))= 0.50
ENTROPY (After Split on Orange):
= 6/15* ENTROPY (Split on Orange=’no’) + 9/15* ENTROPY (Split on Orange=’yes’)
= 0 + 0.23 = 0.23
INFORMATION GAIN= Entropy (Before) - Entropy (After) = 1 - 0.23 = 0.77
Lu
tzFi
nger
.com
INFORMATION GAIN (IG)Information gain measures how much a
given feature improves (decreases) entropy over the whole segmentation it creates.
How important is this feature for the prediction?
Lu
tzFi
nger
.com
Decision Tree
Color Orange? ROOT NODE
LEAFS
Lu
tzFi
nger
.com
Decision Tree
Color Orange?
Decision Tree Structure
Lu
tzFi
nger
.com
Which Feature Would Be Better?
Lu
tzFi
nger
.com
Heavy?
Always Start With Highest IG
Lu
tzFi
nger
.com
Hyperplanes
Lu
tzFi
nger
.com
Hyperplane (2 dimensions)
Mandarines Red Green
Ligh
tH
eavy
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
Back To The Lending Industry
Lu
tzFi
nger
.com
BIG ML
Competitors:
● Algorithms.io● SnapAnalytx● wise.io● Predixion Software● Google Prediction
API
Lu
tzFi
nger
.com
Real Data Set
Lu
tzFi
nger
.com
Build Database
How Do You Deal With Categorical vs. Numeric Variables in Decision Trees?
screenshot from bigML tool
Lu
tzFi
nger
.comConfigure And Build
Model
Select The Objective Field - What To Train The Model On?
That is the Row ID - surely no impact.
screenshot from bigML tool
Lu
tzFi
nger
.com
screenshot from bigML tool
Lu
tzFi
nger
.com
screenshot from bigML tool
Lu
tzFi
nger
.com
Highest Information Gain
screenshot from bigML tool
Lu
tzFi
nger
.com
Using The Model
screenshot from bigML tool
Lu
tzFi
nger
.com
Using The Model
screenshot from bigML tool
Lu
tzFi
nger
.com
Using The Model
screenshot from bigML tool
Lu
tzFi
nger
.com
Found 2,470 New Instances
screenshot from bigML tool
Lu
tzFi
nger
.com
How Can I Improve Now Quality?
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
Pitfalls with Data
Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…
Lu
tzFi
nger
.com
How Did They Improve Scoring?
Lu
tzFi
nger
.com
Social Network InfoCould Social Network improve the quality of our
prediction?
Who is more credit-worthy?
a. Tim whose friends are all very credit worthy
b. Tom whose friends are not creditworthy
Lu
tzFi
nger
.com
Ethical?
Lu
tzFi
nger
.com
Nobel Worthy!
Muhammad YunusPhoto by University of Salford under Creative Commons CC BY 2.0
Lu
tzFi
nger
.com
In the EU insurers will no longer be allowed to take the gender of their customers into account for insurance premiums:
● young men's premiums will fall by up to 10%
● young women's premiums will rise by up to 30%
by: BBC News: http://www.bbc.com/news/business-12608777
Not Everything That Is Possible Is Legal
Lu
tzFi
nger
.com
Pitfalls with Data
Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…
Lu
tzFi
nger
.com
The Tale of Big Data
Lu
tzFi
nger
.com
Overfitting
To tailor a model to training data at the expense of being generalizable for previously unseen data
points. The model becomes perfect in describing noise and spurious correlations.
TRADE OFF
Complexity of a Model & Overfitting Likelihood
Lu
tzFi
nger
.com
How Trustworthy Is This Prediction?
• 45 instances• 59% confidence
screenshot from bigML tool
Lu
tzFi
nger
.com
The Need for Domain Knowledge
Lu
tzFi
nger
.com
Pitfalls with Data
Data & EthicsMore Data & OverfittingConfidence & Cut-offCause & Correlation…
Lu
tzFi
nger
.com
Give Credit or Not?49% Confidence
screenshot from bigML tool
Lu
tzFi
nger
.com
CONFUSION MATRIX
Pregnant(60)
Not pregnant(940)
Pregnant (A) true positive
(B) false positive
Not pregnant
(C) false negative
(D) true negativeC
lass
ifier
Reality
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
How Invented Big Data Infrastructure?
Lu
tzFi
nger
.com
How Invented Big Data Infrastructure?
Lu
tzFi
nger
.com
Issue Of YahooCENTRALIZED SYSTEMS ARE EXPENSIVE
• diminishing returns in power (overhead issue)• exponential cost to scale.• slow to transport (ETL) the data
Scan 1000 TB Datasets on a 1000 node cluster:• Remote Storage @ 10 MB/s = 165 min• Local Storage @ 200 MB's = 8 min
MAKE SYSTEMS FAULT TOLERANT1000 nodes - a machine a day will break
Lu
tzFi
nger
.com
The VisionCHEAP Systems
• can run on commodity hardware
Computation are done DECENTRAL• ability to ‘dispatch’ a task• parallelize work-streams
Fault TOLERANTno matter where and when break is not an issue
Lu
tzFi
nger
.com
Lu
tzFi
nger
.com
How To Access HDFS
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Lu
tzFi
nger
.com
Via The Normal Languages
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
Lu
tzFi
nger
.com
Pro & Con
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
Store
ETL: Extract / Transform / Load
DB / Key Value Store
Visualize
Pro:way better than traditional BI
Con:Heavy tech involvement. 12-18 month for non-tech company to implement a schema
Lu
tzFi
nger
.com
Pro & Con
Hadoop Storage (HDFS / HBase / Solr)
Map Reduce
Map
Red
uce
Hiv
e
Pig
/Cas
scad
ing
Gira
ph
Mah
out
SQL Like
Scripting Like
Graph Oriented
ML Engine
DB / Key Value Store
Visualize
New Approaches:
● Spark● Tez● Flink
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
Why Is It So Hard To Become Data Driven
Lu
tzFi
nger
.com
Ingredients of Data Products
The question?
Ask
The need?
The Why? MeasureThe Data?
The features?
Team
All of them are necessary - None of them is sufficient!
The algorithms?
The right Skills?
Collaboration
111
Lu
tzFi
nger
.com
How To Ingest Ideas
Hack - Days & IncubatorInternal Process
External Competition
Close Collaboration between Business & Data Scientists“All we do is Data” - Jeff Weiner
112
Lu
tzFi
nger
.com
What Would You Need To Do To Be A Leader In Data
Lu
tzFi
nger
.com
Agenda
The right AskData is KingTeam-Work: Discover an Ask
LunchDecision TreeTeam-Work: Your ModelPitfalls with DataTechnologyTeam-Work: Become Data Driven?BI vs. Data ScienceBuild A Team
Lu
tzFi
nger
.com
Old vs. New
Old School Today / Big data
Data Amount Gigabytes & Terabytes Petabytes & Exabytes
IT Infrastructure Centralized Decentralized / Parallelized
Data Types Structured Structured & unstructured
Schema Stable schema Schema on the fly
When and How is the ASK formulated?
Set ask Ad-hoc ask
Lu
tzFi
nger
.com
How to build a Data Team
Lu
tzFi
nger
.com
Data Scientist
Lu
tzFi
nger
.com
Data Scientist Confusion
Lu
tzFi
nger
.com
New Ways To Automate
Lu
tzFi
nger
.com
Data Scientist
BI Analyst
Engineer
Product Manager
Communication Skills Domain Knowledge
Lu
tzFi
nger
.com
You Learned
image by Mike under Creative Commons
• The Ask is the most Important part - you need Domain Knowledge
• Data Science is NO Rocket Science
• Data is King & There is Monopoly Game happening
• Data Can be misleading
• Data is a Team Sport
Lu
tzFi
nger
.com
Thank You
Lu
tzFi
nger
.com
What to MEASURE?
• Error• Correlation• Cost&• Privacy
Workbook “Measure” at LutzFinger.com