Data
Science
Developing a New Profession
©2014 Gary Rector
Influences
Data Science
Math
Data Engineering
Scientific Method
Business Knowledge
Advanced Computing
Visualization
Curiosity
2
Based on a diagram by Calvin Andrus
“Sexiest Job of the
21st Century”
Harvard Business Review, October 2012:
“…distributed file system processing…related open-source tools, cloud computing, and data visualization…are important breakthroughs, [but] at least as important are the people with the skill set (and the mind-set) to put them to good use. On this front, demand has raced ahead of supply.”
3
Some Relevant Skills
• Math • Probability and statistics
• Algebra, calculus, logic, set theory
• Data and Software Engineering • Algorithms and programming
• Representation, modeling
• Pattern recognition, data mining
• Business Knowledge • Domain-specific
• Communication skills!
4
The Scientific Method
• Question! (Be fearless.)
• Observe, Research, Model (Work.)
• Hypothesize, Predict (Think.)
• Experiment, Test, Document (Work!)
• Analyze, Revise (OK to be wrong.)
• Communicate! (Share.)
5
Definition
Data Science is:
The discipline of applying the scientific method to collections of data, using appropriate technology, to reveal previously-unknown information.
6
A Little History
I am a member of the 3rd generation of modern computer scientists.
• Gen 0: Babbage, Lovelace, Jacquard, …
• Gen 1: Turing, von Neumann, Hopper, Eckert, ...
• Gen 2: Wang, Cray, Dijkstra, Knuth, Wirth, …
• Gen 3: Yourdon, Thompson, Cerf, Berners-Lee, …
• Gen 4: Brin, Page, Zuckerberg, Stone, …
The first commercial computer was installed a year after I was born.
7
Actuarial Science • John Graunt, 1662
– mortality tables
• James Dodson, 1762
– Equitable Life Assurance Society
• National Council on Workmen’s Compensation, 1920
– Calculation of rates required 2 full months of continual work by actuary teams
• 1930’s & 40’s
– Development of stochastic techniques
Source: Wikipedia
8
Turing and Enigma,
circa 1942 Alan Turing and the Bletchley Park crew, using the Colossus machine, mathematics, and luck to analyze mountains of radio transcriptions, crack the German Enigma code, helping to end WWII.
Some of Turing’s work remains classified.
9
The US Presidential
Election of 1952 Walter Cronkite and Charles Collingwood report live on CBS that the frontrunner is Stevenson, but by 8:30 pm EST with a tiny percentage of votes counted, their Univac predicts a landslide win with 100-to-1 odds in favor of Eisenhower.
Data science enters the public mind! 10
Data is the Driver
11
Source: Volvo
A Big Bonus
• Netflix awarded $1 million to a team of scientists who improved the Netflix recommendation system’s ability to predict which movies you will like.
12
Source: Netflix
Saving Lives
At Stanford University, a machine learned to diagnose breast cancer better than human doctors by discovering an innovative method that considers more factors in a tissue sample.
13
Source: Stanford University School of Medicine
Dramatic Cost Reduction
UPS saves $600 MM/year in fuel costs by avoiding left turns (less time waiting )
14
Source: NY Times 12/9/2007
Disease Surveillance
15
©2013, Association for Computing Machinery
Increasing the Value
of Data
Data
Information
Knowledge
Wisdom
16
Selective
Use
It’s all about the “I” in “IT”
Any hardware technology is only a peripheral part of a data science solution.
The heart of a solution is the algorithm.
The value of a solution lies in the
resultant information.
17
A Brief Aside: “Metrics”
A metric is an abstraction of the notion of distance. A true metric has 4 properties:
1. D(x,y) >= 0
2. D(x,y) = 0 if and only if x=y
3. D(x,y) = D(y,x)
4. D(x,z) <= D(x,y) + D(y,z)
Most business uses of “metric” just mean measurement but sometimes have no precise meaning at all…might be opinions.
18
Business Goals
Deliver actionable information through data analytics to:
• Reduce risks
• Reduce costs
• Increase revenue
• Increase efficiency
19
BI and Analytics
Many enterprises associate data science with Business Intelligence, fitting into an organization doing “advanced analytics”.
This is not invalid, but data science can contribute much more than just reports.
20
Business Value
Of the 4 major categories of
Business Intelligence (BI):
• Operational Reporting
• Analysis
• Modeling
• Prediction
Prediction has the highest business value.
Unfortunately, it is also the most complex. The Data Warehousing Institute ( TDWI )
21
React vs. Prevent
vs. Predict
EPRI reports costs of
– $17 to $18 for reactive maintenance
– $11 to $13 for preventive maintenance
–$7 to $9 for predictive maintenance
(Half the cost of reactive maintenance!) (cost per horsepower-year unit)
Source: EPRI Advanced Electric Motor Predictive Maintenance Project
22
Fuel for Decision-making
• Analytics can drill deep or reveal new big-picture perspectives
• But data often has a short “shelf life”
• Software delivers analyses fast, helping managers respond quickly
23
Some Tools
• Statistical analysis packages
• Probabilistic graphical models
• Markov Chain Monte Carlo algorithms
• Simulated annealing
• Textual disambiguation
• Visualization methods
• Cluster and pattern detection
24
Visualization
Graphics communicate faster than spreadsheets or tabular reports for:
•Dashboards, scorecards, alerts
•Geographic and Spatial Information
•Visual discovery and analysis
25
Which is Easier to
Understand?
26
0
20
40
60
80
100
120
140
160
180
200
# Visitors
# Pages Read
190
82
30
15
8 4 7
7 3 27
1
2
3
4
5
6
7
8
9
10 or more
# Visitors # Pages Read
Pattern Detection
Relationships in multi-dimensional data are hard to find without software help.
What is hidden in this sample data?
27
x y z0.5 2.5 9.5
1 4 5.41.2 1.2 2.41.2 6.7 8.81.3 7.6 6.21.6 5.6 4.22.2 0.6 9.72.4 3.3 1.22.5 2.6 11.32.5 6.3 1.92.6 4.4 0.92.6 8.1 3.43.3 1.4 1.73.3 5.3 2.23.4 2.5 8.93.4 6.3 4.83.5 4.2 53.5 7.2 11.93.7 0.3 11.5
4 7 7.54.3 6 11.8
Here is part of a list of 300 data points in 3 dimensions. For example, they might represent Work Order, Crew, and Material or Temperature, Hour, and Load. Looking at the raw data spreadsheet is not helpful, so I’ve plotted this data to show the relationship between pairs of dimensions…
28
0
2
4
6
8
10
12
14
0 5 10 15 20
0
2
4
6
8
10
12
14
0 2 4 6 8 10
29
0
1
2
3
4
5
6
7
8
9
0 5 10 15 20
This example has 3 dimensions, but
a real warehouse may have dozens of dimensions…
What is waiting to be discovered?
Descriptive Applications
• Typical BI trend reports
• eDiscovery
• Data loss prevention
• Phone call metadata mining
• NASA’s 60 yrs. in 15 seconds video
30
Predictive Applications
• Spare Parts Inventory Management
• Crew Scheduling
• Theft & Fraud Detection
• Demand Forecasting
• Weather Forecasting
31
Case Histories
• Data loss prevention
• Mortgage risk
• Golf tee-time pricing
• Retail price optimization
• Valentines’ Day promotions
32
We are Pioneers
This is a very new discipline. There is no single “DSBOK”.
The challenges are more cultural than technical. Ethics matter.
We must seize this opportunity now to make a difference for the profession and for the world at large.
33
A Plan for Success
1. Continue Personal Development
2. Integrate Data Science into Business
3. Nurture our Analytics Community
34
Personal Development
Self-assessment:
Weaknesses & strengths, likes & dislikes.
Decide:
Generalist or specialist? Which specialty?
Commit:
Never stop learning. Do no evil.
35
OCM
• Teams
• Expectations
• Basics
• Next steps
• Stages of data use
• Difficulties of achieving proscriptive use
36
Overload!
• No one knows ALL of this!
• Multi-disciplinary teams are needed
• Use academic sources
• Leverage vendor resources
• Nurture in-house subject experts
Set realistic expectations:
We are all in SALES.
37
Firmly Establish Basics
• Data Architecture
• Data Governance
• Data Quality
Getting the basics right is an absolute prerequisite to doing advanced science sustainably.
38
Apply Data Engineering
• Master Data Management
• Reference Data Management
• Normalization, Reduction, Projection
• Software Development Disciplines
39
Allow Greater Freedom
Bending some corporate rules may help nurture creativity needed for discovery.
Data Science might be best started with a skunkworks. (That may have already happened in larger enterprises.)
At the very least, give your data science team the freedom to experiment without punishment for mistakes.
40
User Support Needs
1. Basic reports are self-service.
2. Complex reports need developers.
3. Experts are needed to create models for forecasting and statistical analysis.
41
Stages of Data Use
1. Descriptive – What do we have?
– What happened? When?
2. Predictive – What is likely to happen?
– What are expected costs?
3. Proscriptive – Recommended actions & timing
– Possible automation of actions 42
Moving to Stage Two
• Establish data quality and governance
• Build a community of analysts • Idea exchange, discussions
• Peer support
• Embed data science with the business
• Employ more visualizations in support of predictive analytics
• Let the business guide the science
43
Stage Three Difficulties
• People don’t want change
• People don’t trust the technology
• People fear losing their jobs
• Myths hinder progress
• Even the law can be a barrier
44
Proscriptive Applications
• Price optimization
• Manufacturing plant scheduling
• Computerized securities trading
• Aircraft autopilot
• Self-driving cars
• Medical devices
45
Ethical Questions
• Privacy versus Public Security
• Quality Control and Correction
• Opt-in and Opt-out Rights
• Ownership and Monetization
• Limits of Liability and Responsibility
46
Summary
• Business wants more Data Science
• Data Science is a team effort
• Data Science is immature
• More and more data is available now that can be mined for predictive uses
• People problems > technical problems
47