Data Mining: Introduction, Techniques, Case Studies ... · Data Mining: Introduction, Techniques, Case Studies & Benchmarking Milan April 2nd , 2008 Franco Orsogna Deloitte Enterprise

Data Mining:Introduction, Techniques, Case Studies & Benchmarking

MilanApril 2nd , 2008

Franco OrsognaDeloitte Enterprise Risk Services Italia

Data Mining: Introduction and Case Studies2 ©2008 Deloitte Touche Tohmatsu

Index

•Objectives

•Definition

•Main Techniques

•Industry Sectors & Application Fields

•Case Studies with WizRule tool

•Conclusions & possible uses

•Appendix•Data Mining tools & Benchmark

•Focus on WizRule•WizRule vs Leading DM tools


Objectives

• Have a definition of Data Mining

• Show the main techniques available from DM

• Give examples of main application fields

• Present the results of the application of WizRulesoftware on 3 case studies

We included a benchmark in the appendix section among different Data Mining tools available on market


Definition (1/2)

“A fast and inexpensive way of summarizing, exploring, understanding, and analyzing data…without requiring human intervention” (*)

“Knowledge Discovery in Data is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data” (**)

(*) J. Han and M. Kamber, “Data Mining: Concepts and Techniques” , 2000

(**) Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, “Advances in Knowledge Discovery and Data Mining”, 1996


Definition (2/2)

Data Mining is a matter that join together knowledge from different sciences:

•Statistics

•Pattern recognition

•Operating Logic

•Algorithm Theory

•Artificial Intelligence

•Etc…


Main Techniques (1/2)

The main Data Mining Techniques are:

• Association Rules: through this kind of algorithm you can find items correlated by “if-then” rules:

i.e.: If the customer has a high income and owns a car then he could have an interest in the purchase of a car warranty extension

• Clustering: splits data into subsets that naturally belong together

i.e.: a set of customers can be split into subset of customers with similar pattern of consumptions

• Decision Trees: build a pattern of nested “if-then” rules

Salary > €30.000?

Yes

No

# children? Reliable

state-employees?

<3>2

YesNo

Not reliable


Main Techniques (2/2)

• Classification: builds a predictive model to classify data

i.e.: to analyze bank customers in order to create a model for characterizing the new ones

• Summarization: summarizes data into a smaller subset that is able to significantly represent the original population

i.e.: selecting some customers as a sample for the entire population in the data

• Deviation Detection: discovers instances (records) which have a difference from the population

i.e.: atypical users (i.e. users with atypical access privileges) in a corporate DB

• Etc…


Industry Sectors & ApplicationFields (1/2)

Numerous are the Industry Sectors that are using Data Mining:

•Banking•Telecommunications•Retail•Etc…

Application fields are, for example:•Customer acquisition•Cross-sell, co-marketing•Credit Risk analysis•Fraud analysis•Etc…


Industry Sectors & ApplicationFields (2/2)

Two kinds of DM services are possible:

• Data-quality, Data-cleaning, Data-analysis, etc..

• Fraud detection (Audit Support / External Clients)

Typical examples of frauds to be detected are:

• Credit card frauds

• Recycling of capital

• Securities Frauds

• Telecom frauds


Case Studies: Introduction

We’ve acquired a DM tool in order to evaluate Data Mining potentials in Audit Support services and in new possible ERS services.

We had chosen the WizRule for its simplicity and low license cost.

We’ve applied the tool on datasets which came from previous IT Audit support on FSA engagements. The 3 “case studies” that follow are related to:

• Journal entries

• Stock in/outflows

• Timesheet entries


Case Study 1: Journal EntriesDataset

• Content of dataset: journal entries related to a company for an entire fiscal year.

• Number of records: 57.342

• Number of fields: 15

• Parameters setting on analysis:

Minimum Probability of If-then Rules: 0,99

Minimum Accuracy Level of Formula Rules: 0,99

Minimum Number of Cases in a Rule: 300


Case Study 1: Journal EntriesProcessing results and Performance

• Rules found: 3.729

• Spelling deviations found: 28

• Rule deviations found: 1.253

• Time spent by the tool*: 3 minutes

* with a PC Laptop having the following:

- Processor: Intel Pentium M 1,73 Ghz

- RAM: 1 Gb

- Free HD Space: 25 Gb


Case Study 1: Journal Entries Rules and deviations found – examples (1/3)

RULE:

28) If Account is 2203010101

Then

User is UFFAMMCICT1

Rule's probability: 1,000The rule exists in 425 records.

Significance Level: Error probability is almost 0



RULE DEVIATION:

27) If Account is 1701040100

Then

User is UFFAMMCICT4



Deviations (records' serial numbers):

33008, 19420, 19422, 24640, 24643, 29022, 29025



SPELLING DEVIATION:

Deviation #2 (out of 28)Record No. 10157

Field ValueEntry_Date 28/02/2006Entry_N 3800000524

X User DIRSAPTrial_Bal_Acc.5002060101Posting_Date 07/03/2006…

Rules explaining howthe case deviates from the normThe value DIR2SAP appears 3.208 times in the User field .There are 2 case(s) containing similar value(s):10157, 10158.


Case Study 2: Stocks AccountingDataset

• Content of dataset: quarterly stock in/outflows

• Number of records: 2.359.335





Minimum Number of Cases in a Rule: 10.000


Case Study 2: Stocks Accounting Performance

• Rules found: 268


• Rule deviations found: 5.889

• Time spent by the tool*: 14 minutes

* with a PC Laptop having the following :


- RAM: 1 Gb



Case Study 2: Stocks Accounting Rules and deviations found – examples (1/4)

RULE:

33) If WAREHOUSE_CODE is 3002

Then

UNIT OF MEASURE is NR (=Number of item)

Rule's probability: 1,000

The rule exists in 27.151 records.




RULE:

34) If COD_MAG is 8001

Then

DATA is 2005-03-16


The rule exists in 12.755 records.




RULE DEVIATION:

3532) If MAT_LOC is SDOP

Then

UNIT OF MEASURE is NR is MP


The rule exists in 13668 records.



47613, 880530, 880745, 881033, 881654, 881700,

881702, 881726, 881748, 881778, …



SPELLING DEVIATION:

no significant spelling deviation.


Case Study 3: Time Sheet entriesDataset

• Content of dataset: time sheet by customer

• Number of records: 10.978





Minimum Number of Cases in a Rule: 70


Case Study 3: Time Sheet Performance

• Rules found: 346


• Rule deviations found: 625

• Time spent by the tool*: 20 seconds

* with a PC Laptop having the following:


- RAM: 1 Gb



Case Study 3: Time Sheet Rules and deviations found – examples (1/3)

RULE:

3) If CUSTOMER is ABC COMPANY LTD

Then

CUST_CODE is 1.286,00


The rule exists in 664 records.




RULE DEVIATION:

191) If CUSTOMER is XYZ COMPANY LTD

Then

OVERTIME is 0,00




3492, 3494, 3585, 3578, 3463, 3476, 3523, 3526, 3549



SPELLING DEVIATION:

Deviation #1 (out of 24)Record No. 3598

Field ValueCUST_CODE 11179.000000

X CUSTOMER XYZ COMPANY LTDCONTRACT 70383.000000BRANCH Milan, March 22 (S)REG_NUM 86878.000000

….Rules explaining how the case deviates from the normThe value XYZ COMPANY LTD appears 165 times in the CUSTOMER field.There are 3 case(s) containing similar value(s):3598, 3599, 3600.


Case Studies – Some ConsiderationsWizRule Pros

•Easy-to-use: just set the “quantity” and “quality” fields

•Good processing performance (depends mainly on fields number)

•Effective discovery of hidden knowledge in the datasets

WizRule Cons

•The reports produced are generally long

•Need of a sufficiently deep knowledge of the data analyzed

•A minimal parameters setting change can vary consistently the reports produced


Case Studies - Possible uses

•Data cleaning

•Limited use on IT Audit support on FSA: could be an additional value but take-up and require additional time

•The deviation analysis could support fraud detection activities


Appendix


Data Mining Tools

• Enterprise Miner (SAS Institute Inc.)http://www.sas.com/technologies/analytics/datamining/miner/

• Clementine (SPSS Inc.) http://www.spss.com/clementine/

• StatSoft Statistica Data Minerhttp://www.statsoft.com/products/dataminer.htm

• IBM DB2 Intelligent Minerhttp://www-06.ibm.com/software/uk/forms/pdf_form_catalog_uk.html

• Waikato Weka Projecthttp://www.cs.waikato.ac.nz/~ml/weka/

• Microsoft SQL Server 2005 (versione beta)http://www.microsoft.com/sql/2005/

• WizSoftware WizRulehttp://www.wizsoft.com

Note: in the following terms and princes are indicative


SAS Enterpise Miner (1/2)

•Software house: SAS, Inc.

•Platform:

�Server: Unix/Linux, Solaris and MS-Windows

�Client: java based

•User Interface: “user-friendly”

•Integrated with forecasting and other analysis SAS modules

•General Purpose Tool that includes many DM algorithms (Rule Induction, DT, NN, Clustering, …)

•Price: not available


SAS Enterpise Miner (2/2)

Enterprise Miner’s Front-end.


SPSS Clementine (1/2)

•Software house: SPSS, Inc.

•Platform:

�Server: Unix/Linux, Solaris and MS-Windows

�Client: java based


•Integrated with SPSS statistical modules

•Price: ~ 60.000 €


SPSS Clementine (2/2)

Clementine’s visual interface


StatSoft Statistica Data Miner(1/2)

•Software house: StatSoft, Inc. (UK)

•Platform: MS-Windows, Unix


•General Purpose Tool that includes many DM algorithms (HMM, NN, DT, AR, K-Means Clustering, …)

•C/C++ Programming Interface

•Price: ~18.000 € + 3.000 €/year


StatSoft Statistica Data Miner(2/2)

Example of graphic representation of association rules found by the tool.


IBM DB2 Intelligent Miner(1/2)

• Platform: MS-Windows, AIX, OS/400

• It can be easily integrated with DB2 (it also supports Oracle DB)

• It supports interactive process: data processing, statistical analysis and visualization of the results

• Many DM algorithms implemented

• Scalable processing

• Tool born to process huge datasets

• Price: > 50.000 €


IBM DB2 Intelligent Miner(2/2)

Front-end and available algorithms in DB2 Intelligent Miner.


Waikato Weka Project (1/2)

•Developed by Waikato University (NZ)

•Platform: Java

•Very versatile (Clustering, Visualization, Analysis, ANN, …)

•Analysis and visualization tools

•Data filters

•Open Source

•Not very easy for “non DM” users

•Association rules represented only through text

•Price: free


Waikato Weka Project (2/2)

Visualization of association rules in Weka


Microsoft SQL Server 2005(1/2)

•Platform: MS-Windows

•Analysis Services Module is part of SQL Server

•Built-in business rules, tools and wizards to help analysis of

•Semi-additive measures•Time Intelligence

•Account intelligence•Financial Aggregations

•Price: ~ 4.000 $ for “Workgroup” version or ~ 730 $ for 5 licenses


Microsoft SQL Server 2005(2/2)

Visualization of Decision Trees in SQL Server 2005


WizRule

•Platform: Windows

•Easy to use

•Reports easy to understand

•Audit perspective*

•It can acquire data from many sources

•Price: ~ 1.400 $ (for 1 license)

* through the Deviation Report it is possible to immediately investigate anomalies in the data


WizRule: what does it find? (1/2)

WizRule finds association rules and formula relationships (of specific types) among fields.

It also shows deviation cases from rules it found that:

• are not explained by other rules, and

• whose frequency, relative to the overall frequency, is low.

These kind of deviations are considered by the tool as suspected errors.


WizRule: what does it find? (2/2)

The rules it finds are of this kind:

• if C1=a and C2=b … Ci=w then Cz = y

• if C1 starts with “abcde” then C2=y

• if C1 is between a and b, then C2=y

where Cx is the value of the field x-th, “abcde” is a string with max length = 5.

It is also able to find formula relationships (of specific types) among fields, such as:

• C1 = a x C2 + b

• C1 = a / C2

where Cx is the value of the field x-th, a and b constants.


WizRule: how does it work?

The user:

• selects the source of data

• “fine-tunes” the analysis parameters

Then the tool reveals through specific reports:

• the rules governing the data

• suspected errors/deviations


WizRule: what source of data?

The tool is able to acquire data from numerous sources:

• .dbf files (dBase, Fox Pro, Clipper etc)

• Ms Access files

• Ms SQL Server tables

• Oracle tables

• ODBC compliant databases

• OLE DB compliant databases

• ASCII-type text files


WizRule: “fine-tuning”

User could tune the following analysis parameters in order to increase/decrease the number of found rules:

•Minimum probability of “if-then” rules (confidence

level)

•Minimum accuracy level of formula rules

•Minimum number of cases of a rule (support level)

•Minimum number of conditions


WizRule: what does it show?

Once the tool processed the data, it creates the following reports:

1. Rule Report

2. Deviation Report

3. Spelling Report


WizRule: reports (1/3)

Rule Report (Screenshot)



Spelling Report (Screenshot)



Deviation Report (Screenshot)


Benchmark of DM ToolsSupported Platforms (1/3)

XXWizRule

XXXXEnterprise Miner

XXXXClementine

Datab

aseConnectivity

Win

dow

sServer

/PC

Clien

t

Unix

Server

/PC

Clien

t

Unix

Stan

dalo

ne

PCStan

dalo

ne

(Win

dow

s)

Supported Platforms


Benchmark of DM ToolsData Input & Model Output (2/3)

XXXXWizRule

XXXXEnterprise Miner

XXXClementine

Outp

ut

Source

Code

Sum

mary

Rep

ort

Native

Datab

aseD

river

OD

BC

Auto

matic

Head

er

Data Input & Model Output


Benchmark of DM ToolsAlgorithms (3/3)

X(1)WizRule

XXXXXXXEnterprise Miner

XXXXXXXClementine

Kohonen

Asso

ciation

Rules

K-M

eans

Gen

eralizedLin

earM

od.

Rule

Inductio

n

Rad

ialBasis

Functio

ns

Multi-layer

Perceptio

ns

Linear/S

tatistical

Decisio

nTrees

Algorithms

(1) with WizWhy Module

Member ofDeloitte Touche Tohmatsu

Il nome Deloitte si riferisce a una o più di una delle seguenti entità: Deloitte Touche Tohmatsu (una Verein svizzera), le sue member firm e le relative entità controllate e/o licenziatarie. Ciascuna member firm e ciascuna entità controllata e/o licenziataria è una entità giuridica separata e indipendente che opera sotto i nomi "Deloitte," "Deloitte & Touche," "Deloitte Touche Tohmatsu," o altri nomi derivati. I servizi sono forniti dalle member firm, dalle rispettive entità controllate o da entità licenziatarie e non dalla Verein Deloitte Touche Tohmatsu. Né Deloitte Touche Tohmatsu, in relazione alla sua natura di Verein (associazione) di diritto svizzero, né ciascuna delle member firm e/o delle entità controllate e/o licenziatarie può essere ritenuta in alcun modo responsabile per atti od omissioni posti in essere da altre entità.

Deloitte refers to one or more of Deloitte Touche Tohmatsu, a Swiss Verein, its member firm, and their respective subsidiaries and affiliates. As a Swiss Verein (association), neither Deloitte Touche Tohmatsu nor any of its member firms has any liability for each other’s acts or omissions. Each of the member firms is a separate and independent legal entity operating under the names “Deloitte”, “Deloitte & Touche”, “Deloitte Touche Tohmatsu”, or other related names. Services are provided by the member firms or their subsidiaries or affiliates and not by the Deloitte Touche Tohmatsu Verein.

Data Mining: Introduction, Techniques, Case Studies ... · Data Mining: Introduction, Techniques, Case Studies & Benchmarking Milan April 2nd , 2008 Franco Orsogna Deloitte Enterprise

Documents