Advanced Analytics Duncan Ross @duncan3ross [email protected] Based on the 9 Laws of Data Mining by Tom Khabaza THE NINE LAWS OF DATA MINING
Nov 15, 2014
Advanced Analytics
Duncan [email protected]@teradata.com
Based on the 9 Laws of Data Mining by Tom Khabaza
THE NINE LAWS OF DATA MINING
04/08/2023 @duncan3ross
• The last two algorithms you need to know!• An explanation of Bayes’ theorem• The name of the software that will make you $ millions
> Not even a comparison of different software!
What you won’t get from this presentation
The grave of Thomas Bayes (probably) – near “silicon roundabout” Image via Wikimedia
Advanced Analytics
Data Mining laws also work as Data Science laws
THE 0TH LAW
04/08/2023 @duncan3ross
• This question generates more arguments than answers
• Common features> Predicting or classifying things> Based on historical cases (with or without outcomes)> Machine learning techniques> No predefined underlying model assumed
What is data mining?
Image via Wikimedia
04/08/2023 @duncan3ross
What, where, why and how of data mining
9 Laws
CRISP-DM
What?
Where? Unified data architecture
Who?
Why?
How?
04/08/2023 @duncan3ross
CRISP-DM created to help
Advanced Analytics
Prediction increases information locally by generalisation
THE 7TH LAW
04/08/2023 @duncan3ross
• Data mining learns from generalisations> Historical cases build a model of reality
• These general models then predict an outcome that is local to a case and a time> How likely is it that someone will purchase product ‘x’> Will person a influence person b> What number will the ball land on in roulette
• The knowledge gained may have been implied in the data, but it is new and valuable
This may seem obvious
04/08/2023 @duncan3ross
• Results need to be thought of at a group level for assessment> Individual results may be poor even when generated from a
great model
• Two levels of value> Prediction (what, when etc…)> Model (how…)
• The gap between the general and the local is the difference between model building and scoring> Hadoop?> R?
Why the 7th Law is important
Advanced Analytics
There are always patterns
THE 5TH LAW
04/08/2023 @duncan3ross
… is taking the 5th Law to heart
• A major difference between the approach of data mining and data science is in the “Field of Dreams”> Data mining (usually) requires measurable ROI prior to projects> Data science is trading on probable ROI prior to projects
• Fortunately there is still a lot of gold in those hills> And as technologies and data increase the number of hills is also
increasing
The heart of data science…
04/08/2023 @duncan3ross
Graph of hills vs gold extracted
04/08/2023 @duncan3ross
• Just because there are always patterns doesn’t mean that they are useful> Algorithms can (and will) cluster a cloud> Without Laws 1 and 2 patterns may not be a good thing
But…
Advanced Analytics
Business objectives are the origin of every data mining solution
Business knowledge is central to every step of the data mining
process
THE 1ST LAW
Advanced Analytics
THE 2ND LAW
04/08/2023 @duncan3ross
• This story begins with a gains curve…
The sad tale of churn
04/08/2023 @duncan3ross
• To predict churn
• What was the definition of churn?
• What did the business actually want to do?> Predict “churn”?> Predict people who became inactive?> Predict people who became inactive who might not if contacted?
What was the business objective?
04/08/2023 @duncan3ross
• Because we aren’t doing this for the fun of it> Or at least not just for the fun of it
• At every stage ask:> Does this relate to the business question?> Is the original business question still valid?> Is there a better question that could be asked of this data?> Can this be acted on?> What does this actually mean?
• Document the answers, and refer back to them
Why the 1st and 2nd Laws are important
Advanced Analytics
There is no free lunch for the data miner
THE 4TH LAW
04/08/2023 @duncan3ross
• Is….
• I spent a lot of time on this in the 1990s> Neural nets> Regression> Decision trees
• If you know in advance what technique you need to use the problem has already been solved
The last algorithm you will need to learn
04/08/2023 @duncan3ross
The case that worked... then didn‘t
Campaign Topic
Identify fingerprint of churners
Description
SNA offers an opportunity to detect potential churners earlier (possibly before they have completely ceased all on-net activity) and also identifies the individuals who are likely to have the best chance of persuading them to return. The aim of this campaign format is to use SNA to detect potential churners during the process of leaving and motivate them to stay.
Current Approach: New Approach
Active Inactive
Churn detected Churn detected
04/08/2023 @duncan3ross
• Solutions are not generally reproducible> It may work here, but not there
• Methodologies are reproducible
• Learnings may have value
• Time will invalidate even the best models
Why the 4th Law is important
Advanced Analytics
Data preparation is more than half of every data mining process
THE 3RD LAW
04/08/2023 @duncan3ross
Data preparation through a case…
04/08/2023 @duncan3ross
The problems of text data
04/08/2023 @duncan3ross
Data quality raises it’s head…
04/08/2023 @duncan3ross
CREATE dimension table wrk.npath_reboot_5eventsAS SELECT path, COUNT(*) AS path_countFROM nPath
(ON wrk.w_event_f PARTITION BY srv_id ORDER BY evt_ts desc MODE (NONOVERLAPPING ) PATTERN ('X{0,5}.reboot') SYMBOLS
(true as X, evt_name = 'REBOOT' AS reboot) RESULT (FIRST( srv_id OF X) AS srv_id, ACCUMULATE (evt_name OF ANY (X,reboot))
AS path) ) GROUP BY 1 ;
SELECT * FROM GraphGen (ON
(SELECT * from wrk.npath_reboot_5events ORDER BY path_count LIMIT 30 )PARTITION BY 1ORDER BY path_count descitem_format('npath')item1_col('path') score_col('path_count') output_format('sankey')justify('right'));
Note number of paths with a reboot,
following another reboot!
What events lead up to a reboot?
04/08/2023 @duncan3ross
Looks like an issue with the data on the 30th September and beyond, the Reboot data for October seems to have been aggregated and added to September the 30th
More data issues
04/08/2023 @duncan3ross
• Duncan’s theorem> The usefulness of a variable in a model is inversely related to the
amount of time you spend creating it
• Edouard’s corollary> If it turns out to be useful you could have created it in the time
indicated by Duncan’s theorem
Data preparation is tough
04/08/2023 @duncan3ross
• Data just got noisier and less consistent
• Maintaining an analytical data dictionary just moved from vital to really really vital
Welcome to the world of big data
04/08/2023 @duncan3ross
• Because data prep is such a huge task you need to plan for it well> Assume that you will need to do it at least twice
– Experimentation– Model building– Deployment
• Look for software that makes it easy> And repeatable> And documentable
– Scripts ≠ documentation
• Documentation of your data is even more important than documentation of your models> Models can be very sensitive to data inputs
Why the 3rd Law is important
Advanced Analytics
Data mining amplifies perception in the business domain
THE 6TH LAW
04/08/2023 @duncan3ross
srv_id dslam err_cnt srvid_cnt nra_id dslam_cnt errorspersrvid20785675 lgp44-2 2 248 MZL 2 1522254516 ltc56-1 4 314 BOT 10 1521059184 bch66-1 2 184 RIV 15 1521149846 tsm83-1 2 308 LCR 3 1320833837 did75-4 10 216 DID 23 1322295785 gbw68-1 36 170 HRS 1 1221807750 gmo34-1 2 117 BER 17 1221374927 bgl93-1 2 246 G5Y 8 1220291116 ien11-1 2 211 ALZ 2 1221459244 pai34-1 4 210 M7C 3 1121027647 bel60-1 4 223 TRO 10 1120551629 pla13-1 10 332 BED 4 1120633112 crj95-2 2 332 G5Y 8 1120585199 bau06-1 46 349 BLA 21 1021477790 cvl92-1 4 180 IMS 35 1021292874 che78-1 2 163 PIT 2 10
Look for patterns in Network Infrastructure
• Too many end customers to visualise as a graph but network has a hierarchy> Internet Gateway Area Hub Customer Router
• Create a table using standard SQL to join the reference data plus the Customer Hub error data into a single view
04/08/2023 @duncan3ross
Size of Node = number of customersWidth of Edge = number of errors
SELECT * FROM graphgen (ON
(SELECT DISTINCT dmt_act_dslam, nra_id,
nbr_of_srvid, errorspersrv, nbr_of_dslam
FROM wrk.srvid_dslam_err) PARTITION BY 1 ORDER BY errorspersrv item_format('cfilter') item1_col('dmt_act_dslam') item2_col('nra_id') score_col('errorspersrv') cnt1_col('nbr_of_srvid') cnt2_col('nbr_of_dslam') output_format('sigma') directed('false') width_max(10) width_min(1) nodesize_max (3) nodesize_min (1));
Visualise as a Graph using Aster GraphGen
04/08/2023 @duncan3ross
Zoom in on area where the edge width/colour indicates a problem
04/08/2023 @duncan3ross
Add churn information
• Add churn information to find customers connected to this Hub that have cancelled their accounts
04/08/2023 @duncan3ross
Synch Issues by Hub Type
04/08/2023 @duncan3ross
Error and Complaint rates by equipment type
04/08/2023 @duncan3ross
• We don’t exist in a vacuum> We need to sell the results of analysis
• This is a virtuous feedback loop
Why the 6th Law is important
Advanced Analytics
The value of data mining results is not determined by the accuracy or
stability of predictive models
THE 8TH LAW
04/08/2023 @duncan3ross
• Or if it’s right 1 time in 35?
If your model is 98% accurate – so what?
04/08/2023 @duncan3ross
• Type I and Type II errors> What is the cost (opportunity and actual) of a false positive?> What is the cost of a false negative?
• Gains curves> But beware the over accurate curve
• Don’t the forget the user> Decision trees fight back
How can you evaluate models?
Advanced Analytics
All patterns are subject to change
THE 9TH LAW
Advanced Analytics
0 Listen to data miners…7 Data mining brings new knowledge5 And there will always be new knowledge1 Start with the business2 Keep going back to the business4 It won’t get easier with time3 Especially given the state your data is in6 But you will improve business results8 As long as you look for the right outputs9 Goto 0
SUMMARY
Advanced Analytics
• http://khabaza.codimension.net/index_files/9laws.htm
• The Society of Data Miners (coming soon)> Available on LinkedIn
• CRISP-DM
RESOURCES