1 AUTOMATING ANALYSIS CYPHER 2017
3
LET’S TAKE ONE DAY CRICKET DATA
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
4
Against which countries are
higher averages scored?
Which countries’ players
score more per match?
5
Which player scores the
most per ball?
The player with the highest strike
rate is an obscure South African
whose name most of us have never
heard of.
In fact, this list is filled with players
we have never heard of.
6
RELATIVE IMPACT CAN BE QUANTIFIED SYSTEMATICALLY
Country Player Runs ScoreRate MatchDate Ground Versus
Australia Michael J Clarke 99* 93.39 30-06-2010The Oval England
Australia Dean M Jones 99* 128.57 28-01-1985Adelaide Oval Sri Lanka
Australia Bradley J Hodge 99* 115.11 04-02-2007Melbourne Cricket Ground New Zealand
India Virender Sehwag 99* 99 16-08-2010Rangiri Dambulla International Stad. Sri Lanka
New Zealand Bruce A Edgar 99* 72.79 14-02-1981Eden Park India
Pakistan Mohammad Yousuf 99* 95.19 15-11-2007Captain Roop Singh Stadium India
West Indies Richard B Richardson 99* 70.21 15-11-1985Sharjah CA Stadium Pakistan
West Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002Sardar Patel Stadium India
Zimbabwe Andrew Flower 99* 89.18 24-10-1999Harare Sports Club Australia
Zimbabwe Alistair D R Campbell 99* 79.83 01-10-2000Queens Sports Club New Zealand
Zimbabwe Malcolm N Waller 99* 133.78 25-10-2011Queens Sports Club New Zealand
Australia David C Boon 98* 82.35 08-12-1994Bellerive Oval Zimbabwe
Australia Graeme M Wood 98* 63.22 11-01-1981Melbourne Cricket Ground India
England Ian J L Trott 98* 84.48 20-10-2011Punjab Cricket Association Stadium India
India Yuvraj Singh 98* 89.09 01-08-2001Sinhalese Sports Club Ground Sri Lanka
Ireland Kevin J O'Brien 98* 94.23 10-07-2010VRA Ground Scotland
Kenya Collins O Obuya 98* 75.96 13-03-2011M.Chinnaswamy Stadium Australia
Netherlands Ryan N ten Doeschate 98* 73.68 01-09-2009VRA Ground Afghanistan
New Zealand James E C Franklin 98* 142.02 07-12-2010M.Chinnaswamy Stadium India
Pakistan Ijaz Ahmed 98* 112.64 28-10-1994Iqbal Stadium South Africa
South Africa Jacques H Kallis 98* 74.24 06-02-2000St George's Park Zimbabwe
Take every column in the data
Find the impact of that column
Versus has an impact of 16%. Play against Namibia
Ground has an impact of 12%. MAC, not Eden Park
Country has an impact of 8%. South Africa, not USA
Weekday has an impact of 3%. Tuesday, not Wednesday
Player has no significant impact
MatchDate has no significant impact
14
BUT BEFORE I PROCEED, LET ME CLARIFY TWO THINGS
I refuse to entertain – because
people mistake entertainment for
education.
-- Bret Victor
THIS IS A SIMPLE TUTORIAL.
NO ML, ANN, DNN, ETC.
There are dramatic exceptions to
my argument that the
generalization of software
packages has changed little over
the years: electronic spreadsheets
and simple database systems.
-- Fred Brooks (No Silver Bullet)
WE’LL USE
SPREADSHEETS
15
OVER 100 QUESTIONS EACH, ADMINISTERED TO
STUDENTS, TEACHERS AND SCHOOLS
… AS WELL AS ASSESSMENT OF MARKETS IN
MATHS, READING, SCIENCE & SOCIAL SCIENCE
16
THIS IS WHAT THE DATA LOOKED LIKE
http://s-anand.net/test/nas.csv - grab a copy while it lasts
THE STRIKING THING IS THAT
THERE ARE NO NUMBERS – JUST
CATEGORIES
17
LET’S DO AN EXERCISE
DO CALCULATORS HELP
SCORE IN MATHS?
DO COMPUTERS HELP
SCORE IN MATHS?
WHICH ONE HELPS MORE?
ARE THESE MEANINGFUL?
OR JUST RANDOM?
Correlation is not causation but it
sure is a hint.
-- Edward Tufte
21
FACTORS IMPACTING POULTRY PRODUCTIVITY
We group by every
input factor
… and calculate the
impact on every metric.
By moving from average to the best
group, what’s the improvement?
The actual performance
by each group is shown
0-3m 3-6m 6m-1yr 1-2 yrs > 2 yrs
11 12.3 12.7 15.3 16.1
Our product can create visualisations from data automatically, without any supervision.
Above is an example. Irrespective of the dataset, this visual shows which input parameters
have a significant impact on the output.
Only significant results shown
WHAT EXPLAINS POULTRY MORTALITY?
23
SERVICE REQUEST WORKFLOW
Navigation filters
Process flow diagram
indicating bottlenecks
& volume of requests
Automated analysis to
identify areas which
need work and which
can create maximum
impact
LINK
26
AUTO-PICKING A PRICE FORECASTING MODEL
ProductMoving Average
Auto-regression
SingleExponential Smoothing
ARIMAExponential Smoothing
Over State Space Model
Hybrid ModelNeural
Network
Linear Regression
With All Variables
Product 1 65.13 54.13 65.98 66.16 71.67 73.24 78.96 70.46
Product 2 66.89 56.66 66.74 68.12 74.41 74.65 89.15 73.87
Product 3 37.53 9.84 44.55 42.28 50.49 46.86 61.35 53.03
Product 4 37.16 4.92 50.22 43.50 52.19 53.40 68.63 53.15
Product 5 68.83 71.24 68.38 68.12 75.58 71.47 90.80 72.69
Product 6 69.41 69.60 69.24 70.16 77.55 75.75 80.41 75.09
Product 7 69.27 64.76 68.61 69.21 73.39 74.06 82.10 75.20
Product 8 64.54 52.50 63.93 64.41 68.31 70.82 79.70 70.78
Product 9 57.97 52.64 57.40 58.53 63.90 63.15 78.80 63.04
Product 10 53.61 55.90 54.54 56.47 59.78 58.63 90.28 61.96
Product 11 52.02 26.49 54.92 53.65 60.80 63.89 78.40 52.23
Product 12 45.83 28.50 53.59 49.43 56.09 53.63 85.34 48.33
Product 13 41.30 28.98 40.51 38.88 50.84 47.57 63.76 50.55
Product 14 41.14 17.41 41.51 38.05 45.95 48.69 71.55 44.10
Product 15 86.40 84.00 86.58 87.29 88.80 90.78 99.91 88.04
Product 16 85.76 83.83 85.66 85.59 85.30 88.43 91.76 78.59
27
AUTOMATING CLUSTER DETECTION
A manufacturing firm asked the
question: “How can we predict
which employees will leave me
next?”
One part of the answer is to
take the network of email
traffic among employees. The
ones in close contact,
exchanging emails with an
alumnus are likely candidates
for attrition.
The firm was able to put in
place a retention and defense
mechanism for these
employees.
This is augmented with
additional signals:
• Disengaged employees
• Active on LinkedIn
• Dip in performance
• Atypical browsing
• Collateral downloads
• Peer feedback
• Reduced working hours
• Increased sick leave
The outcome is a monthly list
identifying employees at risk,
and the behaviors that lead to
this conclusion
HR
30
TELECOM CHURN
“Churn of customers is a
particularly severe problem in
the telecom industry.
The challenge is to identify
the propensity of churn up to
a month in advance, even
before a customer moves out,
so that proactive
interventions can begin”
31
OK
WASTED
Marketing cost
Rs 40
MISSED
Acquisition cost
Rs 80OK
No churn Churn
No
ch
urn
Ch
urn
Prediction
Act
ual
8.3% 0.0%
MISSED WASTED
6.61
COST PER CUST.
0.0%
IMPROVEMENT
Base
MODELS
32
Outgoing call
0 0 - 4 15+5-14
1
RECHARGE
AMT > RS 65
01
YN
> 1
RECHARGE
0
N Y
3.2% 3.6%
MISSED WASTED
4.01
COST PER CUST.
39%
IMPROVEMENT
Decision Tree
MODELS
330.6% 2.5%
MISSED WASTED
2.21
COST PER CUST.
66%
IMPROVEMENT
SVM
MODELS
OK
WASTED
Marketing
cost
$1.8
MISSED
Acquisition
cost
$4.1
OK
No churn ChurnN
o c
hu
rnC
hu
rnPrediction
Act
ual
36
SEGMENTING INDIA GEO-DEMOGRAPHICALLY
Previously, the client was treating contiguous regions as a
homogenous entity, from a channel content perspective.
To deliver targeted content, we divided India into 6
clusters based on their demographic behavior. Specifically,
three composite indices were created based on the
economic development lifecycle:
• Education (literacy, higher education) that leads to...
• Skilled jobs (in mfg or services) that leads to...
• Purchasing power (higher income, asset ownership)
Districts were divided (at the average cut-off) by:
Offering targeted content to these clusters will reach a
more homogenous demographic population.
Skilled
Poorer Richer
Unskilled Skilled
Uneducated Educated Uneducated Educated
Unskilled
Purchasing power
Skilled jobs
Education
Poor Breakout Aspirant Owner Business Rich
PoorRural, uneducated agri
workers. Young population
with low income and asset
ownership. Mostly in Bihar,
Jharkhand, UP, MP.
BreakoutRural, educated agri workers
poised for skilled labour.
Higher asset ownership. Parts
of UP, Bihar, MP.
AspirantRegions with skilled labour
pools but low purchasing
power. Cusp of economic
development. Mostly WB,
Odisha, parts of UP
OwnerRegions with unskilled labour
but high economic prosperity
(landlords, etc.) Mostly AP,
TN, parts of Karnataka,
Gujarat
BusinessLower education but working
in skilled jobs, and
prosperous. Typical of
business communities. Parts
of Gujarat, TN, Urban UP,
Punjab, etc
RichUrban educated
population
working in skilled
jobs. All metros,
large cities, parts
of Kerala, TN
The 6 clusters are
LINK
MediaMarketingAnalyticsVisualization
37
WORLD BANK: INNOVATION, TECHNOLOGY & ENTREPRENEURSHIP
Does access to new Technology facilitate Innovation? Does it
facilitate Entrepreneurship? The Global Information Technology
Report findings tell us that "innovation is increasingly based on
digital technologies and business models, which can drive economic
and social gains from ICTs...".
We were curious about whether the data on TCData360 could tell a
story about influential factors on innovation and entrepreneurship.
With over 1800 indicators, we focused on the Networked Readiness
Index, as it has indicators on entrepreneurship, technology, and
innovation.
LINK
Society
WHAT YOU SHOULD TAKE AWAY
PATTERNS OF ANALYSIS ARE
RECURRENT ACROSS DOMAINS
THESE PATTERNS OF ANALYSIS
CAN BE AUTOMATED
BLACK-BOX MODELS NEED
INTERPRETATION (EVEN MORE)
VISUAL INTERACTION HELP
AUGMENT OUR UNDERSTANDING