Data Mining: Introduction, Techniques, Case Studies & Benchmarking Milan April 2nd , 2008 Franco Orsogna Deloitte Enterprise Risk Services Italia
Data Mining:Introduction, Techniques, Case Studies & Benchmarking
MilanApril 2nd , 2008
Franco OrsognaDeloitte Enterprise Risk Services Italia
Data Mining: Introduction and Case Studies2 ©2008 Deloitte Touche Tohmatsu
Index
•Objectives
•Definition
•Main Techniques
•Industry Sectors & Application Fields
•Case Studies with WizRule tool
•Conclusions & possible uses
•Appendix•Data Mining tools & Benchmark
•Focus on WizRule•WizRule vs Leading DM tools
Data Mining: Introduction and Case Studies3 ©2008 Deloitte Touche Tohmatsu
Objectives
• Have a definition of Data Mining
• Show the main techniques available from DM
• Give examples of main application fields
• Present the results of the application of WizRulesoftware on 3 case studies
We included a benchmark in the appendix section among different Data Mining tools available on market
Data Mining: Introduction and Case Studies4 ©2008 Deloitte Touche Tohmatsu
Definition (1/2)
“A fast and inexpensive way of summarizing, exploring, understanding, and analyzing data…without requiring human intervention” (*)
“Knowledge Discovery in Data is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data” (**)
(*) J. Han and M. Kamber, “Data Mining: Concepts and Techniques” , 2000
(**) Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, “Advances in Knowledge Discovery and Data Mining”, 1996
Data Mining: Introduction and Case Studies5 ©2008 Deloitte Touche Tohmatsu
Definition (2/2)
Data Mining is a matter that join together knowledge from different sciences:
•Statistics
•Pattern recognition
•Operating Logic
•Algorithm Theory
•Artificial Intelligence
•Etc…
Data Mining: Introduction and Case Studies6 ©2008 Deloitte Touche Tohmatsu
Main Techniques (1/2)
The main Data Mining Techniques are:
• Association Rules: through this kind of algorithm you can find items correlated by “if-then” rules:
i.e.: If the customer has a high income and owns a car then he could have an interest in the purchase of a car warranty extension
• Clustering: splits data into subsets that naturally belong together
i.e.: a set of customers can be split into subset of customers with similar pattern of consumptions
• Decision Trees: build a pattern of nested “if-then” rules
Salary > €30.000?
Yes
No
# children? Reliable
state-employees?
<3>2
YesNo
Not reliable
Data Mining: Introduction and Case Studies7 ©2008 Deloitte Touche Tohmatsu
Main Techniques (2/2)
• Classification: builds a predictive model to classify data
i.e.: to analyze bank customers in order to create a model for characterizing the new ones
• Summarization: summarizes data into a smaller subset that is able to significantly represent the original population
i.e.: selecting some customers as a sample for the entire population in the data
• Deviation Detection: discovers instances (records) which have a difference from the population
i.e.: atypical users (i.e. users with atypical access privileges) in a corporate DB
• Etc…
Data Mining: Introduction and Case Studies8 ©2008 Deloitte Touche Tohmatsu
Industry Sectors & ApplicationFields (1/2)
Numerous are the Industry Sectors that are using Data Mining:
•Banking•Telecommunications•Retail•Etc…
Application fields are, for example:•Customer acquisition•Cross-sell, co-marketing•Credit Risk analysis•Fraud analysis•Etc…
Data Mining: Introduction and Case Studies9 ©2008 Deloitte Touche Tohmatsu
Industry Sectors & ApplicationFields (2/2)
Two kinds of DM services are possible:
• Data-quality, Data-cleaning, Data-analysis, etc..
• Fraud detection (Audit Support / External Clients)
Typical examples of frauds to be detected are:
• Credit card frauds
• Recycling of capital
• Securities Frauds
• Telecom frauds
Data Mining: Introduction and Case Studies10 ©2008 Deloitte Touche Tohmatsu
Case Studies: Introduction
We’ve acquired a DM tool in order to evaluate Data Mining potentials in Audit Support services and in new possible ERS services.
We had chosen the WizRule for its simplicity and low license cost.
We’ve applied the tool on datasets which came from previous IT Audit support on FSA engagements. The 3 “case studies” that follow are related to:
• Journal entries
• Stock in/outflows
• Timesheet entries
Data Mining: Introduction and Case Studies11 ©2008 Deloitte Touche Tohmatsu
Case Study 1: Journal EntriesDataset
• Content of dataset: journal entries related to a company for an entire fiscal year.
• Number of records: 57.342
• Number of fields: 15
• Parameters setting on analysis:
Minimum Probability of If-then Rules: 0,99
Minimum Accuracy Level of Formula Rules: 0,99
Minimum Number of Cases in a Rule: 300
Data Mining: Introduction and Case Studies12 ©2008 Deloitte Touche Tohmatsu
Case Study 1: Journal EntriesProcessing results and Performance
• Rules found: 3.729
• Spelling deviations found: 28
• Rule deviations found: 1.253
• Time spent by the tool*: 3 minutes
* with a PC Laptop having the following:
- Processor: Intel Pentium M 1,73 Ghz
- RAM: 1 Gb
- Free HD Space: 25 Gb
Data Mining: Introduction and Case Studies13 ©2008 Deloitte Touche Tohmatsu
Case Study 1: Journal Entries Rules and deviations found – examples (1/3)
RULE:
28) If Account is 2203010101
Then
User is UFFAMMCICT1
Rule's probability: 1,000The rule exists in 425 records.
Significance Level: Error probability is almost 0
Data Mining: Introduction and Case Studies14 ©2008 Deloitte Touche Tohmatsu
Case Study 1: Journal Entries Rules and deviations found – examples (1/3)
RULE DEVIATION:
27) If Account is 1701040100
Then
User is UFFAMMCICT4
Rule's probability: 0,993The rule exists in 963 records.
Significance Level: Error probability is almost 0
Deviations (records' serial numbers):
33008, 19420, 19422, 24640, 24643, 29022, 29025
Data Mining: Introduction and Case Studies15 ©2008 Deloitte Touche Tohmatsu
Case Study 1: Journal Entries Rules and deviations found – examples (3/3)
SPELLING DEVIATION:
Deviation #2 (out of 28)Record No. 10157
Field ValueEntry_Date 28/02/2006Entry_N 3800000524
X User DIRSAPTrial_Bal_Acc.5002060101Posting_Date 07/03/2006…
Rules explaining howthe case deviates from the normThe value DIR2SAP appears 3.208 times in the User field .There are 2 case(s) containing similar value(s):10157, 10158.
Data Mining: Introduction and Case Studies16 ©2008 Deloitte Touche Tohmatsu
Case Study 2: Stocks AccountingDataset
• Content of dataset: quarterly stock in/outflows
• Number of records: 2.359.335
• Number of fields: 14
• Parameters setting on analysis:
Minimum Probability of If-then Rules: 0,99
Minimum Accuracy Level of Formula Rules: 0,99
Minimum Number of Cases in a Rule: 10.000
Data Mining: Introduction and Case Studies17 ©2008 Deloitte Touche Tohmatsu
Case Study 2: Stocks Accounting Performance
• Rules found: 268
• Spelling deviations found: 25
• Rule deviations found: 5.889
• Time spent by the tool*: 14 minutes
* with a PC Laptop having the following :
- Processor: Intel Pentium M 1,73 Ghz
- RAM: 1 Gb
- Free HD Space: 25 Gb
Data Mining: Introduction and Case Studies18 ©2008 Deloitte Touche Tohmatsu
Case Study 2: Stocks Accounting Rules and deviations found – examples (1/4)
RULE:
33) If WAREHOUSE_CODE is 3002
Then
UNIT OF MEASURE is NR (=Number of item)
Rule's probability: 1,000
The rule exists in 27.151 records.
Significance Level: Error probability is almost 0
Data Mining: Introduction and Case Studies19 ©2008 Deloitte Touche Tohmatsu
Case Study 2: Stocks Accounting Rules and deviations found – examples (2/4)
RULE:
34) If COD_MAG is 8001
Then
DATA is 2005-03-16
Rule's probability: 1,000
The rule exists in 12.755 records.
Significance Level: Error probability is almost 0
Data Mining: Introduction and Case Studies20 ©2008 Deloitte Touche Tohmatsu
Case Study 2: Stocks Accounting Rules and deviations found – examples (3/4)
RULE DEVIATION:
3532) If MAT_LOC is SDOP
Then
UNIT OF MEASURE is NR is MP
Rule's probability: 0,998
The rule exists in 13668 records.
Significance Level: Error probability is almost 0
Deviations (records' serial numbers):
47613, 880530, 880745, 881033, 881654, 881700,
881702, 881726, 881748, 881778, …
Data Mining: Introduction and Case Studies21 ©2008 Deloitte Touche Tohmatsu
Case Study 2: Stocks Accounting Rules and deviations found – examples (4/4)
SPELLING DEVIATION:
no significant spelling deviation.
Data Mining: Introduction and Case Studies22 ©2008 Deloitte Touche Tohmatsu
Case Study 3: Time Sheet entriesDataset
• Content of dataset: time sheet by customer
• Number of records: 10.978
• Number of fields: 10
• Parameters setting on analysis:
Minimum Probability of If-then Rules: 0,90
Minimum Accuracy Level of Formula Rules: 0,90
Minimum Number of Cases in a Rule: 70
Data Mining: Introduction and Case Studies23 ©2008 Deloitte Touche Tohmatsu
Case Study 3: Time Sheet Performance
• Rules found: 346
• Spelling deviations found: 24
• Rule deviations found: 625
• Time spent by the tool*: 20 seconds
* with a PC Laptop having the following:
- Processor: Intel Pentium M 1,73 Ghz
- RAM: 1 Gb
- Free HD Space: 25 Gb
Data Mining: Introduction and Case Studies24 ©2008 Deloitte Touche Tohmatsu
Case Study 3: Time Sheet Rules and deviations found – examples (1/3)
RULE:
3) If CUSTOMER is ABC COMPANY LTD
Then
CUST_CODE is 1.286,00
Rule's probability: 1,000
The rule exists in 664 records.
Significance Level: Error probability is almost 0
Data Mining: Introduction and Case Studies25 ©2008 Deloitte Touche Tohmatsu
Case Study 3: Time Sheet Rules and deviations found – examples (2/3)
RULE DEVIATION:
191) If CUSTOMER is XYZ COMPANY LTD
Then
OVERTIME is 0,00
Rule's probability: 0,945The rule exists in 156 records.
Significance Level: Error probability is almost 0
Deviations (records' serial numbers):
3492, 3494, 3585, 3578, 3463, 3476, 3523, 3526, 3549
Data Mining: Introduction and Case Studies26 ©2008 Deloitte Touche Tohmatsu
Case Study 3: Time Sheet Rules and deviations found – examples (3/3)
SPELLING DEVIATION:
Deviation #1 (out of 24)Record No. 3598
Field ValueCUST_CODE 11179.000000
X CUSTOMER XYZ COMPANY LTDCONTRACT 70383.000000BRANCH Milan, March 22 (S)REG_NUM 86878.000000
….Rules explaining how the case deviates from the normThe value XYZ COMPANY LTD appears 165 times in the CUSTOMER field.There are 3 case(s) containing similar value(s):3598, 3599, 3600.
Data Mining: Introduction and Case Studies27 ©2008 Deloitte Touche Tohmatsu
Case Studies – Some ConsiderationsWizRule Pros
•Easy-to-use: just set the “quantity” and “quality” fields
•Good processing performance (depends mainly on fields number)
•Effective discovery of hidden knowledge in the datasets
WizRule Cons
•The reports produced are generally long
•Need of a sufficiently deep knowledge of the data analyzed
•A minimal parameters setting change can vary consistently the reports produced
Data Mining: Introduction and Case Studies28 ©2008 Deloitte Touche Tohmatsu
Case Studies - Possible uses
•Data cleaning
•Limited use on IT Audit support on FSA: could be an additional value but take-up and require additional time
•The deviation analysis could support fraud detection activities
Data Mining: Introduction and Case Studies30 ©2008 Deloitte Touche Tohmatsu
Data Mining Tools
• Enterprise Miner (SAS Institute Inc.)http://www.sas.com/technologies/analytics/datamining/miner/
• Clementine (SPSS Inc.) http://www.spss.com/clementine/
• StatSoft Statistica Data Minerhttp://www.statsoft.com/products/dataminer.htm
• IBM DB2 Intelligent Minerhttp://www-06.ibm.com/software/uk/forms/pdf_form_catalog_uk.html
• Waikato Weka Projecthttp://www.cs.waikato.ac.nz/~ml/weka/
• Microsoft SQL Server 2005 (versione beta)http://www.microsoft.com/sql/2005/
• WizSoftware WizRulehttp://www.wizsoft.com
Note: in the following terms and princes are indicative
Data Mining: Introduction and Case Studies31 ©2008 Deloitte Touche Tohmatsu
SAS Enterpise Miner (1/2)
•Software house: SAS, Inc.
•Platform:
�Server: Unix/Linux, Solaris and MS-Windows
�Client: java based
•User Interface: “user-friendly”
•Integrated with forecasting and other analysis SAS modules
•General Purpose Tool that includes many DM algorithms (Rule Induction, DT, NN, Clustering, …)
•Price: not available
Data Mining: Introduction and Case Studies32 ©2008 Deloitte Touche Tohmatsu
SAS Enterpise Miner (2/2)
Enterprise Miner’s Front-end.
Data Mining: Introduction and Case Studies33 ©2008 Deloitte Touche Tohmatsu
SPSS Clementine (1/2)
•Software house: SPSS, Inc.
•Platform:
�Server: Unix/Linux, Solaris and MS-Windows
�Client: java based
•User Interface: “user-friendly”
•Integrated with SPSS statistical modules
•Price: ~ 60.000 €
Data Mining: Introduction and Case Studies34 ©2008 Deloitte Touche Tohmatsu
SPSS Clementine (2/2)
Clementine’s visual interface
Data Mining: Introduction and Case Studies35 ©2008 Deloitte Touche Tohmatsu
StatSoft Statistica Data Miner(1/2)
•Software house: StatSoft, Inc. (UK)
•Platform: MS-Windows, Unix
•User Interface: “user-friendly”
•General Purpose Tool that includes many DM algorithms (HMM, NN, DT, AR, K-Means Clustering, …)
•C/C++ Programming Interface
•Price: ~18.000 € + 3.000 €/year
Data Mining: Introduction and Case Studies36 ©2008 Deloitte Touche Tohmatsu
StatSoft Statistica Data Miner(2/2)
Example of graphic representation of association rules found by the tool.
Data Mining: Introduction and Case Studies37 ©2008 Deloitte Touche Tohmatsu
IBM DB2 Intelligent Miner(1/2)
• Platform: MS-Windows, AIX, OS/400
• It can be easily integrated with DB2 (it also supports Oracle DB)
• It supports interactive process: data processing, statistical analysis and visualization of the results
• Many DM algorithms implemented
• Scalable processing
• Tool born to process huge datasets
• Price: > 50.000 €
Data Mining: Introduction and Case Studies38 ©2008 Deloitte Touche Tohmatsu
IBM DB2 Intelligent Miner(2/2)
Front-end and available algorithms in DB2 Intelligent Miner.
Data Mining: Introduction and Case Studies39 ©2008 Deloitte Touche Tohmatsu
Waikato Weka Project (1/2)
•Developed by Waikato University (NZ)
•Platform: Java
•Very versatile (Clustering, Visualization, Analysis, ANN, …)
•Analysis and visualization tools
•Data filters
•Open Source
•Not very easy for “non DM” users
•Association rules represented only through text
•Price: free
Data Mining: Introduction and Case Studies40 ©2008 Deloitte Touche Tohmatsu
Waikato Weka Project (2/2)
Visualization of association rules in Weka
Data Mining: Introduction and Case Studies41 ©2008 Deloitte Touche Tohmatsu
Microsoft SQL Server 2005(1/2)
•Platform: MS-Windows
•Analysis Services Module is part of SQL Server
•Built-in business rules, tools and wizards to help analysis of
•Semi-additive measures•Time Intelligence
•Account intelligence•Financial Aggregations
•Price: ~ 4.000 $ for “Workgroup” version or ~ 730 $ for 5 licenses
Data Mining: Introduction and Case Studies42 ©2008 Deloitte Touche Tohmatsu
Microsoft SQL Server 2005(2/2)
Visualization of Decision Trees in SQL Server 2005
Data Mining: Introduction and Case Studies43 ©2008 Deloitte Touche Tohmatsu
WizRule
•Platform: Windows
•Easy to use
•Reports easy to understand
•Audit perspective*
•It can acquire data from many sources
•Price: ~ 1.400 $ (for 1 license)
* through the Deviation Report it is possible to immediately investigate anomalies in the data
Data Mining: Introduction and Case Studies44 ©2008 Deloitte Touche Tohmatsu
WizRule: what does it find? (1/2)
WizRule finds association rules and formula relationships (of specific types) among fields.
It also shows deviation cases from rules it found that:
• are not explained by other rules, and
• whose frequency, relative to the overall frequency, is low.
These kind of deviations are considered by the tool as suspected errors.
Data Mining: Introduction and Case Studies45 ©2008 Deloitte Touche Tohmatsu
WizRule: what does it find? (2/2)
The rules it finds are of this kind:
• if C1=a and C2=b … Ci=w then Cz = y
• if C1 starts with “abcde” then C2=y
• if C1 is between a and b, then C2=y
where Cx is the value of the field x-th, “abcde” is a string with max length = 5.
It is also able to find formula relationships (of specific types) among fields, such as:
• C1 = a x C2 + b
• C1 = a / C2
where Cx is the value of the field x-th, a and b constants.
Data Mining: Introduction and Case Studies46 ©2008 Deloitte Touche Tohmatsu
WizRule: how does it work?
The user:
• selects the source of data
• “fine-tunes” the analysis parameters
Then the tool reveals through specific reports:
• the rules governing the data
• suspected errors/deviations
Data Mining: Introduction and Case Studies47 ©2008 Deloitte Touche Tohmatsu
WizRule: what source of data?
The tool is able to acquire data from numerous sources:
• .dbf files (dBase, Fox Pro, Clipper etc)
• Ms Access files
• Ms SQL Server tables
• Oracle tables
• ODBC compliant databases
• OLE DB compliant databases
• ASCII-type text files
Data Mining: Introduction and Case Studies48 ©2008 Deloitte Touche Tohmatsu
WizRule: “fine-tuning”
User could tune the following analysis parameters in order to increase/decrease the number of found rules:
•Minimum probability of “if-then” rules (confidence
level)
•Minimum accuracy level of formula rules
•Minimum number of cases of a rule (support level)
•Minimum number of conditions
Data Mining: Introduction and Case Studies49 ©2008 Deloitte Touche Tohmatsu
WizRule: what does it show?
Once the tool processed the data, it creates the following reports:
1. Rule Report
2. Deviation Report
3. Spelling Report
Data Mining: Introduction and Case Studies50 ©2008 Deloitte Touche Tohmatsu
WizRule: reports (1/3)
Rule Report (Screenshot)
Data Mining: Introduction and Case Studies51 ©2008 Deloitte Touche Tohmatsu
WizRule: reports (2/3)
Spelling Report (Screenshot)
Data Mining: Introduction and Case Studies52 ©2008 Deloitte Touche Tohmatsu
WizRule: reports (3/3)
Deviation Report (Screenshot)
Data Mining: Introduction and Case Studies53 ©2008 Deloitte Touche Tohmatsu
Benchmark of DM ToolsSupported Platforms (1/3)
XXWizRule
XXXXEnterprise Miner
XXXXClementine
Datab
aseConnectivity
Win
dow
sServer
/PC
Clien
t
Unix
Server
/PC
Clien
t
Unix
Stan
dalo
ne
PCStan
dalo
ne
(Win
dow
s)
Supported Platforms
Data Mining: Introduction and Case Studies54 ©2008 Deloitte Touche Tohmatsu
Benchmark of DM ToolsData Input & Model Output (2/3)
XXXXWizRule
XXXXEnterprise Miner
XXXClementine
Outp
ut
Source
Code
Sum
mary
Rep
ort
Native
Datab
aseD
river
OD
BC
Auto
matic
Head
er
Data Input & Model Output
Data Mining: Introduction and Case Studies55 ©2008 Deloitte Touche Tohmatsu
Benchmark of DM ToolsAlgorithms (3/3)
X(1)WizRule
XXXXXXXEnterprise Miner
XXXXXXXClementine
Kohonen
Asso
ciation
Rules
K-M
eans
Gen
eralizedLin
earM
od.
Rule
Inductio
n
Rad
ialBasis
Functio
ns
Multi-layer
Perceptio
ns
Linear/S
tatistical
Decisio
nTrees
Algorithms
(1) with WizWhy Module
Member ofDeloitte Touche Tohmatsu
Il nome Deloitte si riferisce a una o più di una delle seguenti entità: Deloitte Touche Tohmatsu (una Verein svizzera), le sue member firm e le relative entità controllate e/o licenziatarie. Ciascuna member firm e ciascuna entità controllata e/o licenziataria è una entità giuridica separata e indipendente che opera sotto i nomi "Deloitte," "Deloitte & Touche," "Deloitte Touche Tohmatsu," o altri nomi derivati. I servizi sono forniti dalle member firm, dalle rispettive entità controllate o da entità licenziatarie e non dalla Verein Deloitte Touche Tohmatsu. Né Deloitte Touche Tohmatsu, in relazione alla sua natura di Verein (associazione) di diritto svizzero, né ciascuna delle member firm e/o delle entità controllate e/o licenziatarie può essere ritenuta in alcun modo responsabile per atti od omissioni posti in essere da altre entità.
Deloitte refers to one or more of Deloitte Touche Tohmatsu, a Swiss Verein, its member firm, and their respective subsidiaries and affiliates. As a Swiss Verein (association), neither Deloitte Touche Tohmatsu nor any of its member firms has any liability for each other’s acts or omissions. Each of the member firms is a separate and independent legal entity operating under the names “Deloitte”, “Deloitte & Touche”, “Deloitte Touche Tohmatsu”, or other related names. Services are provided by the member firms or their subsidiaries or affiliates and not by the Deloitte Touche Tohmatsu Verein.