Guerrilla Analytics Tactics for Coping with Data Science Reality Enda Ridge, PhD 23 February 2015 0 #GuerrillaAnalytics Copyright Enda Ridge 2015
Jul 14, 2015
Guerrilla Analytics
Tactics for Coping with Data Science RealityEnda Ridge, PhD
23 February 2015 0#GuerrillaAnalytics Copyright Enda Ridge 2015
What we are told about Data Science
1#GuerrillaAnalytics Copyright Enda Ridge 2015
“Data is the new science. Big data holds the answers.”
“the sexy job in the next 10 years will be statisticians”
“Data Scientist: The Sexiest Job of the 21st Century”
“Information is the oil of the 21st century, and analytics is the combustion engine.”
http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery
23 February 2015
Hi, we need an update on the insurance policy classification work. It’s going to the Head of Underwriting this afternoon.
Um. Which work? Jo and I are trying two different approaches. And Jo’s on holidays.
I’ll check my mailbox and send you my spreadsheet from last week.
Just need the change in uplift since last week.
Err.....the policy population changed with the extra system extract on Tuesday.
And we added a bunch of business rules to accommodate that.... so we can’t go back to the earlier numbers.
The Data Science Reality
2#GuerrillaAnalytics Copyright Enda Ridge 201523 February 2015
My Journey
Mechanical Engineer
PhD Computer
Science
• “Design of Experiments for the Tuning of Algorithms”
Boutique Consultancy
Forensic Data Analytics
Senior Manager
#GuerrillaAnalytics Copyright Enda Ridge 2015 323 February 2015
ConstraintsComputation takes time!
DynamicRepeatable
Reproducible
DynamicConstrained
DynamicConstrainedReproduce
TestAudit
What is Data Science?
#GuerrillaAnalytics Copyright Enda Ridge 2015 4
Data Analytics Insight
23 February 2015
Common Misconception
#GuerrillaAnalytics Copyright Enda Ridge 2015 5
Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22
23 February 2015
Project Reality – Dynamic
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 6
DataPeopleUnderstandingRulesCode
Project Reality – Constraints
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 7
TimePeopleTechnologyData
Project Reality – Transparency
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 8
ExplainableTestableReproducibleRepeatable
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 9
Data
• Extraction
• Receipt
• Loading
Analytics
• Transform
• Algorithms
• Consolidate
Insight
• Reporting
• Work Products
Disruptions
23 February 2015
Guerrilla Analytics Principles
#GuerrillaAnalytics Copyright Enda Ridge 2015 1023 February 2015
Maintaining Data Provenance mitigates the effect of disruptions on your work
Guerrilla Analytics Principles
• Space is cheap, confusion is expensive 1
• Prefer simple, visual project structures over heavily documented and project-specific rules2
• Prefer automation with program code over manual graphical methods 3
• Version control changes to data and program code 5
Etc...
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 11
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 12
Data
• Extraction
• Receipt
• Loading
Analytics
• Transform
• Algorithms
• Consolidate
Insight
• Reporting
• Work Products
Disruptions
23 February 2015
Data Receipt
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 13
Guerrilla Analytics Environment
• Lost Data
• Multiple Copies of data
• No supporting information
• Local copies of data
• Renamed data
Data Receipt
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 14
Guerrilla Analytics Approach
• Have 1 Data location
• Data Unique Identifiers
• Data log
• Keep supporting material near its data
Data Load
Files
Crazy-name spreadsheet 1Crazy-name spreadsheet 2Crazy-name spreadsheet 3
FNU810A
A_very_long_named_file_v0.2.1.pdf
Analytics Environment
User_markups
Customer_Table
Finance_Report_v1.0
#GuerrillaAnalytics Copyright Enda Ridge 2015 15
Guerrilla Environment
• Renamed files
• Scattered inconsistent locations
• Multiple versions of files
• Replacements of files
23 February 2015
Data Load
Files
Crazy-name spreadsheet 1
Crazy-name spreadsheet 2
Crazy-name spreadsheet 3
FNU810A
A_very_long_named_file_v0.2.1.pdf
Analytics Environment
Crazy-name spreadsheet 1
Crazy-name spreadsheet 2
Crazy-name spreadsheet 3
FNU810A
A_very_long_named_file_v0.2.1.pdf
#GuerrillaAnalytics Copyright Enda Ridge 2015 16
Guerrilla Analytics Approach
• One-to-one mapping from files to datasets
• Keep crazy names
• Minimize prep work
23 February 2015
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 17
Data
• Extraction
• Receipt
• Loading
Analytics
• Transform
• Algorithms
• Consolidate
Insight
• Reporting
• Work Products
Disruptions
23 February 2015
Guerrilla Analytics Environment
• Multiple languages
• Many code files
• Variety of outputs
• Data manipulation on file system
• Data manipulation in analytics environment
• Combinations of tools
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 18
Analytics: Code
Guerrilla Analytics Environment Guerrilla Analytics Approach
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 19
WP_024
Rates cleaned.SQL
Rates_by_city_v1_FINAL.R
Rates_by_city_v2.R
MAP_POSTCODES.SQL
WP_024
010_MAP_POSTCODES.SQL
030_Rates cleaned.SQL
050_Rates_by_cityv2.R
Analytics: Data
ID Addr_1 City
A 10 Main St London
C 5 Junct London
B 54 Shop Rd Dublin
B 123 Middle Str. Galway
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 20
ID Addr_1 City
A 10 MAIN STREET London
B 54 SHOP ROAD Dublin
C 5 JUNCTION London
... ... ...
Analytics: Data
ID Addr_1 City
A 10 Main St London
C 5 Junct London
B 54 Shop Rd Dublin
B 123 Middle Str. Galway
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 21
ID Addr_1 Addr_1_cln City IS_IN_SCOPE
A 10 Main St 10 MAIN STREET London YES
C 5 Junct 5 JUNCTION London YES
B 54 Shop Rd 54 SHOP ROAD Dublin YES
B 123 Middle Str. 123 MIDDLE STREET Galway NO
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 22
Data
• Extraction
• Receipt
• Loading
Analytics
• Transform
• Algorithms
• Consolidate
Insight
• Reporting
• Work Products
Disruptions
23 February 2015
Reporting – Guerrilla Analytics approach
#GuerrillaAnalytics Copyright Enda Ridge 2015 25
1
2
5
Select min/max of transaction_time
WP_030
Select min/max of customer_age
WP_035
Purchases by type
WP_042
23 February 2015
Guerrilla Analytics
#GuerrillaAnalytics Copyright Enda Ridge 2015 26
Data
• Extraction
• Receipt
• Loading
Analytics
• Transform
• Algorithms
• Consolidate
Insight
• Reporting
• Work Products
Disruptions
23 February 2015
Why consolidate?
#GuerrillaAnalytics Copyright Enda Ridge 2015 27
Raw
Duplicates
Customers Clean_Cust
Deduped New_dupes
Work Product
23 February 2015
Why consolidate?
#GuerrillaAnalytics Copyright Enda Ridge 2015 28
Raw
Duplicates
Customers Clean_Cust
Deduped New_dupes
Duplicates_02
Customers
Duplicates
Deduped Clean_cust New_dupes
Work Product
23 February 2015
Consolidating with a Build
#GuerrillaAnalytics Copyright Enda Ridge 2015 29
Deduped
Clean_cust
New_dupesDuplicates_02
Duplicates
Customers
Dupes_latest
Cust_Latest
Raw Latest Clean Rules Interface
Version Controlled Code
23 February 2015
Open Questions
23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 30
Workflows Testing
‘Big Data’Engineering
Keep in Touch!
#GuerrillaAnalytics Copyright Enda Ridge 2015 31
@Enda_Ridge
www.guerrilla-analytics.net
23 February 2015