Top Banner
INTRODUCTION TO DATA MINING Pinakpani Pal Electronics & Communication Sciences Unit Indian Statistical Institute [email protected]
133

INTRODUCTION TO DATA MINING

Feb 23, 2016

Download

Documents

liana

INTRODUCTION TO DATA MINING. Pinakpani Pal Electronics & Communication Sciences Unit Indian Statistical Institute [email protected]. Main Sources. Data Mining Concepts and Techniques – Jiawei Han and Micheline Kamber , 2007 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: INTRODUCTION TO DATA MINING

INTRODUCTION TODATA MINING

Pinakpani PalElectronics & Communication Sciences Unit

Indian Statistical [email protected]

Page 2: INTRODUCTION TO DATA MINING

Introduction to Data Mining 2

Main Sources

• Data Mining Concepts and Techniques –Jiawei Han and Micheline Kamber, 2007

• Handbook of Data Mining and Discovery- Willi Klosgen and Jan M Zytkow, 2002

• Fast algorithms for mining association rules and sequential patterns – R.Srikant, Ph.D. Thesis at the University of Wisconsin-Madison, 1996.

• “Parallel & distributed association mining: a survey,” –M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999.

Page 3: INTRODUCTION TO DATA MINING

Introduction to Data Mining 3

Prelude

•Data Mining is a method of finding interesting trends or patterns in large datasets.

•Data collection may be incomplete, heterogeneous and historical.

•Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms.

•Data Mining tools are expected to involve minimal user intervention.

Page 4: INTRODUCTION TO DATA MINING

Introduction to Data Mining 4

Prelude

•Data mining deals with finding patterns in data either by

– user-definition (pre-defined by the user), – interesting (with the help of an interestingness measure)

or – valid (validity pre-defined).

•Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support.

Page 5: INTRODUCTION TO DATA MINING

Introduction to Data Mining 5

Data Mining Communities

•Statistics: Provides the background for the algorithms.

•Artificial Intelligence: Provides the required heuristics for machine learning / conceptual clustering.

•Database: Provides the platform for storage and retrieval of raw and summary data.

Page 6: INTRODUCTION TO DATA MINING

Introduction to Data Mining 6

Data Mining

Mining knowledge from Large amounts of Data.Evolution:•Data collection•Database creation •Data management

– Data storage– Retrieval– Transaction processing

Page 7: INTRODUCTION TO DATA MINING

Introduction to Data Mining 7

Data Mining

•Advanced data analysis

data warehouse and data mining

Page 8: INTRODUCTION TO DATA MINING

Introduction to Data Mining 8

Data Mining Components

Information Repository: single or multiple heterogeneous data source

Data Sever: storing or retrieving relevant dataKnowledgebase: concept hierarchies, constraints,

threshold, metadataPattern Extraction : characterization, discrimination,

association, classification, prediction, clustering, various statistical analysis

Pattern Evaluation: interestingness measures

Page 9: INTRODUCTION TO DATA MINING

Introduction to Data Mining 9

Stages of the Data Mining Process

Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention.

Steps:• [Data Collection]

– web crawling / warehousing

Page 10: INTRODUCTION TO DATA MINING

Introduction to Data Mining 10

Stages of the Data Mining Process

Steps (contd.):•Data Preprocessing & Feature Extraction

– Data cleaning: elimination of erroneous and irrelevant data

– Data Integration: from multiple source– Data selection / reduction: to accept only the interesting

attributes of the data according to the problem domain.– Data transformation: normalization, aggregation

Page 11: INTRODUCTION TO DATA MINING

Introduction to Data Mining 11

Stages of the Data Mining Process

Steps (contd.):•Pattern Extraction & Evaluation

– Identification of data mining primitives and interestingness measures are done at this stage.

•Visualization of data– Making it easily understandable

•Evaluation of results– Not every s/w discovered facts are useful for human

beings!

Page 12: INTRODUCTION TO DATA MINING

Introduction to Data Mining 12

Data Preprocessing

Data Cleaning: Data may be incomplete, noisy and inconsistent. Attempts are made to identify outliers to smooth out noise, fill in missing values and correct inconsistencies.

Page 13: INTRODUCTION TO DATA MINING

Introduction to Data Mining 13

Data Preprocessing

Data Integration: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files.

Page 14: INTRODUCTION TO DATA MINING

Introduction to Data Mining 14

Data Preprocessing

Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary. It includes activities like, Removal of irrelevant and redundant attributes, Data Compression and Aggregation or Generation of Summary Data.

Page 15: INTRODUCTION TO DATA MINING

Introduction to Data Mining 15

Data Preprocessing

Transformation: Data need to be transformed or consolidated into forms suitable for mining. It may include activities like, Generalization, Normalization, e.g. attribute values converted from absolute values to ranges, Construction of new attributes etc.

Page 16: INTRODUCTION TO DATA MINING

Introduction to Data Mining 16

Patterns

•Descriptive – characterizing general properties of the data

•Predictive – inference on the current data in order to make patterns

•Discover:– multiple kind of patterns to accommodate different user

expectation (may specify hints to guide) /application– patterns at various granularity

Page 17: INTRODUCTION TO DATA MINING

Introduction to Data Mining 17

Frequent Patterns

Patterns that occur frequently in the data.Types:• Itemset•Subsequences•Substructures (sub-graphs, sub-trees, sub-lattices)

Page 18: INTRODUCTION TO DATA MINING

Introduction to Data Mining 18

Discovery of Association Rules

To identify the features or items in a problem domain that tend to appear together. These features or items are said to be associated. The process is to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset.

Page 19: INTRODUCTION TO DATA MINING

Introduction to Data Mining 19

Association Rule: Example

A user studying the buying habits of customers may choose to mine association rules of the form:

P (X:customer,W) ^ Q (X,Y) buys (X,Z) [support=n%, confidence is m%]

Meta rules such as the following can be specified:occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”)

[1.4%, 70%]

Page 20: INTRODUCTION TO DATA MINING

Introduction to Data Mining 20

Association Rule: Single/Multi

Single-dimensional association rule:buys(X, “computer”) buys (X, “antivirus”)

[1.1%, 55%]

OR“computer” “antivirus” (A B )

[1.1%, 55%]

Multi-dimensional association rule:occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”)

[1.4%, 70%]

Page 21: INTRODUCTION TO DATA MINING

Introduction to Data Mining 21

Metrics for Interestingness measures

Interestingness measures in knowledge discovery help to identify the relevance of the patterns discovered during the mining process.

Page 22: INTRODUCTION TO DATA MINING

Introduction to Data Mining 22

Interestingness measures

•Used to confine the number of uninteresting patterns returned by the process.

•Based on the structure of patterns and statistics underlying them.

•Associate a threshold which can be controlled by the user

– patterns not meeting the threshold are not presented to the user.

Page 23: INTRODUCTION TO DATA MINING

Introduction to Data Mining 23

Interestingness measures: objective

Objective measures of pattern interestingness:• simplicity• utility (support)• certainty (confidence)• novelty

Page 24: INTRODUCTION TO DATA MINING

Introduction to Data Mining 24

Interestingness measures: simplicity

Simplicity: a patterns interestingness is based on its overall simplicity for human comprehension.

e.g. Rule length is a simplicity measure

Page 25: INTRODUCTION TO DATA MINING

Introduction to Data Mining 25

Interestingness measures: support

Utility (support): usefulness of a patternsupport(AB) = P(A U B).

The support for a association rule {A} {B} is the % of all the transactions under analysis that contains this itemset.

Page 26: INTRODUCTION TO DATA MINING

Introduction to Data Mining 26

Interestingness measures: confidence

Certainty (confidence): Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure

confidence(A B) = P(B│A)The confidence for a association rule {A} {B} is the % cases that follows the rule.

Association rules that satisfy both the confidence and support threshold are referred to as strong association rules.

Page 27: INTRODUCTION TO DATA MINING

Introduction to Data Mining 27

Interestingness measures: novelty

Novelty: Patterns contributing new information to the given pattern set are called novel patterns.

e.g: Data exception.

Removing redundant patterns is a strategy for detecting novelty.

Page 28: INTRODUCTION TO DATA MINING

Introduction to Data Mining 28

Market Basket data analysis

Let, a transaction be defined as the variety of items purchased by a customer in one visit, irrespective of the quantity of each item purchased. The problem is to find the items that a customer tends to buy together.

Page 29: INTRODUCTION TO DATA MINING

Introduction to Data Mining 29

Market Basket data analysis

An association rule is an expression of the form XY,

where X and Y are the sets of items. The intuitive meaning of the expression is, the transactions that contain X tend to contain Y as well. The inverse may not be true. Since only presence or absence of items are considered and not the quantity purchased, this type of rules are called Binary Association Rules.

Page 30: INTRODUCTION TO DATA MINING

Introduction to Data Mining 30

Market Basket data analysis

Purpose is to study consumers’ purchase pattern in departmental stores. Considering four possible transactions,

1 - {Pen, Ink, Diary, Writing Pad}2 - {Pen, Ink, Diary}3 - {Pen, Diary}4 - {Pen, Ink, Writing Pad}

Page 31: INTRODUCTION TO DATA MINING

Introduction to Data Mining 31

Market Basket data analysis

A possible Association Rule,“ Purchase of Pen implies the

purchase of Ink or Diary”

{Pen} {Ink} or {Pen} {Diary} Basically, the rule is of the form {LHS} {RHS} where, both {LHS} and {RHS} are sets of items,

called itemset and {LHS} ∩ {RHS} = . • {Pen, Ink} is a 2-itemset.

Page 32: INTRODUCTION TO DATA MINING

Introduction to Data Mining 32

Binary Association Rule Mining

Two Step Process1. Find all frequent itemsets

– An itemset will be considered for mining rules if its support is above a threshold called minsup.

2. Generate strong association rules from frequent itemsets– Acceptance of a rule is once again through a

threshold called minconf.

Page 33: INTRODUCTION TO DATA MINING

Introduction to Data Mining 33

Finding Frequent Itemsets

If there are N items in a market basket and the association is studied for all possible item combinations, totally 2N combinations are to be checked.

Page 34: INTRODUCTION TO DATA MINING

Introduction to Data Mining 34

Finding Frequent Itemsets

All nonempty subsets of a frequent itemset must also be frequent.

(anti-monotone property)Apriori Algorithm

An itemset is frequent when its occurrence in the total dataset exceeds the minsup.If there exists N items, the algorithm attempts to compute frequent itemsets for 1-itemset to N-itemsets.

Page 35: INTRODUCTION TO DATA MINING

Introduction to Data Mining 35

Apriori Algorithm

The algorithm has two steps,1. Join step2. Prune step

1. Join step : Here frequent k-itemsets are computed by joining the (k-1)-itemsets

2. Prune step: if a k-itemset fails to cross the minsup threshold, all the supersets of the concerned k-itemset are no longer considered for association rule discovery.

Page 36: INTRODUCTION TO DATA MINING

Introduction to Data Mining 36

Apriori Algorithm

•Let Lk be the set of frequent k-itemsets•Let Ck be the set of candidate k-itemsets Each member of this set has two fields – itemset and

support count.

Page 37: INTRODUCTION TO DATA MINING

Introduction to Data Mining 37

Apriori Algorithm

1. Let k←12. Generate L1 frequent itemsets of length 13. (Lk = ) OR (k = N) goto Step 74. k ← k+15. Generate Lk frequent itemsets of length k by

Join and Prune6. Goto Step 3.7. StopOutput : UkLk

Page 38: INTRODUCTION TO DATA MINING

Introduction to Data Mining 38

Apriori Algorithm

Join ()forall (i,j) where i ϵ Lk-1 and j ϵ Lk-1, i≠j

select all possible k-itemset and insert into Ck

endfor

If L3={{{1 2 3}, s123}, {{1 2 4}, s124}, {{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}}C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}}

Page 39: INTRODUCTION TO DATA MINING

Introduction to Data Mining 39

Apriori Algorithm

Prune()forall itemsets Ck do

forall (k-1)-subsets s of c doIf ( Lk-1) then delete c from Ck

endifendfor

endforLk ← Ck

S

c

L4={{{1 2 3 4}, s1234}}

Page 40: INTRODUCTION TO DATA MINING

Introduction to Data Mining 40

Rule Generation

Rule generation should ensure production of rules that satisfy only the minimum confidence threshold

– Because, rules are generated from frequent itemsets, they automatically satisfy the minimum support threshold

Given a frequent itemset li, find all non-empty subsets f li such that f li – f satisfies the minimum confidence requirement

• If | li | = k, then there are 2k – 2 candidate association rules

Page 41: INTRODUCTION TO DATA MINING

Introduction to Data Mining 41

Rule Generation

Algorithm:

forall li i ≥ 2 docall genrule (li, li)

endfor

Page 42: INTRODUCTION TO DATA MINING

Introduction to Data Mining 42

Rule Generation

genrule (lk, fi)F ← {(m-1)-itemset fm-1 | fm-1 fm}forall fm-1ϵ F do

conf ←sup(lk) / sup(fm-1)if (conf ≥ minconf)

print rule “fm-1 (lk- fm-1), conf, sup(lk)”if (m-1 >1)

cal genrule(lk, am-1) endif

endif endfor

Page 43: INTRODUCTION TO DATA MINING

Introduction to Data Mining 43

Rule Generation

If {A,B,C,D} is a frequent itemset, candidate rules:{ABC}{D}, {ABD}{C},{ACD}{B}, {BCD}{A},{AB}{CD}, {AC}{BD}, {AD}{BC}, {BC}{AD}, {BD}{AC}, {CD}{AB},{A}{BCD}, {B} {ACD},{C}{ABD}, {D}{ABC}

Page 44: INTRODUCTION TO DATA MINING

Introduction to Data Mining 44

Rule Generation

In general, confidence does not have an anti-monotone property

c({ABC} {D}) can be larger or smaller than c({AB} {D})

But confidence of rules generated from the same itemset has an anti-monotone property

– Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

e.g., L = {A,B,C,D}: c({ABC} {D}) c({AB} {CD}) c({A} {BCD})

Page 45: INTRODUCTION TO DATA MINING

Introduction to Data Mining 45

Case Study

To find the Association among the species of trees present in a forest.

The problem is to find a set of association rules which would indicate the species of trees that usually appear together and also whether a set of species ensures the presence of another set of species with a minimum degree of confidence specified apriori.

Page 46: INTRODUCTION TO DATA MINING

Introduction to Data Mining 46

Data Collection

A forest area is divided into a number of transacts. A group of surveyors walk through each such transact to identify the different species of trees and their number of occurrences.

Page 47: INTRODUCTION TO DATA MINING

Introduction to Data Mining 47

Data

TransactsSpecies 1 2 3 … 1008

1 7 0 1 … 132 0 5 9 … 03 16 4 0 … 2⁞ ⁞ ⁞ ⁞ … ⁞

398 6 2 25 … 7

Page 48: INTRODUCTION TO DATA MINING

Introduction to Data Mining 48

Converting the Data

TransactsSpecies 1 2 3 … 1008

1 1 0 1 … 12 0 1 1 … 03 1 1 0 … 1⁞ ⁞ ⁞ ⁞ … ⁞

398 1 1 1 … 1

Page 49: INTRODUCTION TO DATA MINING

Introduction to Data Mining 49

Drawbacks

Support and confidence used by Apriori allow a lot of rules which are not necessarily interesting

Two options to extract interesting rules•Using subjective knowledge•Using objective measures (measures better than

confidence)

Page 50: INTRODUCTION TO DATA MINING

Introduction to Data Mining 50

Subjective approaches

•Visualization – users allowed to interactively verify the discovered rules

•Template-based approach – filter out rules that do not fit the user specified templates

•Subjective interestingness measure – filter out rules that are obvious (bread butter) and that are non-actionable (do not lead to profits)

Page 51: INTRODUCTION TO DATA MINING

Introduction to Data Mining 51

Objective Measures

TID A B C D Support(A) = 0.7 1 1 1 0 0 Support(B) = 0.6 2 0 0 1 0 Support(C) = 0.5 3 1 1 1 1 Support(D) = 0.5 4 1 0 0 0 Support(AB) = 0.4 5 0 1 0 1 Support(CD) = 0.4 6 1 1 0 0 minsup = 0.3 7 0 1 1 1 How to infer, 8 1 0 1 1 A B 9 1 1 0 0 or10 1 0 1 1 C D

Page 52: INTRODUCTION TO DATA MINING

Introduction to Data Mining 52

Dissociation

•Dissociation of an itemset is, the % of transactions where one or more items but not all are absent.

Dissociation(AB) = 0.5

Dissociation(CD) = 0.2

•Extract frequent itemsets from a set of transactions under high association but low dissociation.

Page 53: INTRODUCTION TO DATA MINING

Introduction to Data Mining 53

Togetherness

Let Si = subset of transactions containing the item i.SA ∩ SB = subset of transactions containing both A

and B.SA U SB = subset of transactions containing either A or

B.Togetherness(AB)= | SA ∩ SB | / | SA U SB | Similar to minsup, a threshold min_togetherness can

be defined to find frequent itemsets.

Page 54: INTRODUCTION TO DATA MINING

Introduction to Data Mining 54

Objective Measures

•Weka uses other objective measures– Lift (A B) = confidence(A B)/support(B) =

support(A B)/(support(A)*support(B))– Leverage (A B) = support(A B) –

support(A)*support(B)– Conviction(A B) = support(A)*support(not

B)/support(A B)– conviction inverts the lift ratio and also computes

support for RHS not being true

Page 55: INTRODUCTION TO DATA MINING

Introduction to Data Mining 55

Modifications of Apriori Algorithm

•Reduce computation time:•Hash based techniques•Transaction reduction•Sampling•Dynamic itemset counting

Page 56: INTRODUCTION TO DATA MINING

Introduction to Data Mining 56

Frequent Pattern Mining Variations

•Type of value handled•Levels of abstractions•Number of data dimensions•Kinds of Patterns to be mined•Completeness of Patterns to be mined•Kind of rules to be mined

Page 57: INTRODUCTION TO DATA MINING

Introduction to Data Mining 57

Type of Value Handled

Binary / Boolean• Absence of items helps in improving the discovery of

association rules but does not directly contribute to rule mining.

Quantitative• In certain applications, absence of items may sometime be

as important as their presence. • In medical applications, it has been found that both presence

and absence of symptoms need to be considered in discovering association rules.

Page 58: INTRODUCTION TO DATA MINING

Introduction to Data Mining 58

Quantitative Association Rules

For numeric attributes like, age, salary etc. binary association rule mining is not applicable. The attribute domain can be categorized in two basic approaches regarding the treatment of quantitative attributes:

•Static•Dynamical

Page 59: INTRODUCTION TO DATA MINING

Introduction to Data Mining 59

Static Discritisation

Quantitative Attributes are discritised using predefined concept hierarchies.

Say income may be replaced by original numeric values of this attribute interval level

“0…10K”, “11…20K” … and so on.

Page 60: INTRODUCTION TO DATA MINING

Introduction to Data Mining 60

Dynamical Discritisation

Quantitative Attributes are discritised (clustered) into “bins” based on the distribution of Data.After the verification of minsup and minconf thresholds, following rules may be obtained,age(x,5) studies(x, “in school”)age(x,6) studies(x, “in school”)

⁞age(x,17) studies(x, “in school”) age(x,18) studies(x, “in school”)

Page 61: INTRODUCTION TO DATA MINING

Introduction to Data Mining 61

Dynamical Discritisation

•ARCS(Association Rule Clustering System) used for mining quantitative rules may be used for classification in the form,

Aquant1 Aquant2 …. Aquantn Acat

where Aquant1 , Aquant2 etc. are tests on numeric attribute ranges and Acat is the class label assigned after the training step.

Page 62: INTRODUCTION TO DATA MINING

Introduction to Data Mining 62

Dynamical Discritisation

Using ARCS (Association Rule Clustering System), a composite rule may be formed as,

age(x, “5….18”) studies(x, “in school”)Similar way, two dimensional quantitative rules can

also be formed. age(x, “25 …. 40”) income(x, “20K …. 40K”)

buys(x, “new car”)

Page 63: INTRODUCTION TO DATA MINING

Introduction to Data Mining 63

Levels of Abstractions

All

Parker

Pen InkWriting Pad

Oxford Link

Bottle CartridgeBlankRuled

Pioneer

DotFountain

Pilot ……… … … ……

Page 64: INTRODUCTION TO DATA MINING

Introduction to Data Mining 64

Multilevel Association Rule

Using •Uniform minimum support•Reduced minimum support at lower level•Group based minimum support

Page 65: INTRODUCTION TO DATA MINING

Introduction to Data Mining 65

Rules over Taxonomies

•The items used for rule mining may not be at the same level. There can be an in-built taxonomy among the items. An example of a taxonomy as applicable to market basket data :

This taxonomy implies : • Track Suits is-a Outerwear, Outerwear is-a Clothes etc.

ClothesFootwear

Outerwear Shirts

Track Suits Track PantsShoes Snickers

Page 66: INTRODUCTION TO DATA MINING

Introduction to Data Mining 66

Rules over Taxonomies

Application domain may need rules at different levels of the taxonomy.

Trivial Rule: If Ŷ implies ancestor(Y), then rule Y Ŷ is Trivial.Shoes Footwear (A rule with 100% confidence)

Footwear

Shoes Snickers

Page 67: INTRODUCTION TO DATA MINING

Introduction to Data Mining 67

Rules across Levels

•Rule OuterwearSnickers does not infer either Track SuitsSnickers or Track PantsSnickers So, a rule at a higher level does not infer the same

rule at the lower level of the taxonomy. Clothes

FootwearOuterwear Shirts

Track Suits Track PantsShoes Snickers

Page 68: INTRODUCTION TO DATA MINING

Introduction to Data Mining 68

Rules across Levels

•Rule Track SuitsSnickers definitely infers the rule OuterwearSnickers

So, a rule at a lower level definitely infers the same rule at the higher level of the taxonomy.

ClothesFootwear

Outerwear Shirts

Track Suits Track PantsShoes Snickers

Page 69: INTRODUCTION TO DATA MINING

Introduction to Data Mining 69

Interest Measure

•To find rules whose support is more than R times the expected value or whose confidence is more than R times the expected value , for some user specified constant R.

Page 70: INTRODUCTION TO DATA MINING

Introduction to Data Mining 70

Rule (with Taxonomies) Generation

Steps1. Find frequent itemsets2. Use frequent itemsets to generate the desired

rules.3. Prune all uninteresting rules from this set.

Page 71: INTRODUCTION TO DATA MINING

Introduction to Data Mining 71

The Database

TID Items1 Shirts2 Track Suits, Snickers3 Track Pants, Snickers4 Shoes5 Shoes6 Track Suits

minsup = 30%minconf=60%

Page 72: INTRODUCTION TO DATA MINING

Introduction to Data Mining 72

Frequent Itemset & Taxonomies

Itemsets Sup (out of 6){Track Suit} 2

{Outerwear} 3

{Clothes} 4

{Shoes} 2

{Snickers} 2

{Footwear} 4

{Outerwear, Snickers} 2

{Clothes, Snickers} 2

{Outerwear, Footwear} 2

{Clothes, Footwear} 2

Clothes

Footwear

Outerwear Shirts

Track Suits Track Pants

Shoes Snickers

Page 73: INTRODUCTION TO DATA MINING

Introduction to Data Mining 73

Rules

Rule Sup% Conf%

OuterwearSnickers 33 66OuterwearFootwear 33 66SnickersOuterwear 33 100SnickersClothes 33 100

Page 74: INTRODUCTION TO DATA MINING

Introduction to Data Mining 74

Rule under Item Constraints

Some applications may need association rules under user specified constraints on items. When a taxonomy is present, these constraints may be specified using the taxonomy.

Page 75: INTRODUCTION TO DATA MINING

Introduction to Data Mining 75

Rule under Item Constraints

(Track Suits Shoes) (descendants(Clothes) ancestors(Snickers))

•A Boolean expression representing a constraint. •Allow rules containing either, both Track Suits and

Shoes or Clothes or any descendant of Clothes and do not contain Snickers or Footwear as its ancestor.

Page 76: INTRODUCTION TO DATA MINING

Introduction to Data Mining 76

Rule under Item Constraints

Exploitation of hierarchy does not stop the generation of association rules among the items at the same level. Thus, these types of association rules are the Generalized Association Rules.

Page 77: INTRODUCTION TO DATA MINING

Introduction to Data Mining 77

Number of Data Dimensions

•Single Dimension – Discrete Predicate:

buy(X,”Pen”) buy (X, “Ink”)•Multidimension

– Discrete Predicate: age(X,”9..21”)^occupation(X,”Student”) buy (X, “Pen”)

– Multiple occurrence of Predicate: age(X,”9..21”)^occupation(X,”Student”)^ buy(X,”Pen”)

buy (X, “Ink”)

Page 78: INTRODUCTION TO DATA MINING

Introduction to Data Mining 78

Sequential Patterns

A sequential pattern always provides an order.• In a market basket application, it is not interested in

the set of items appearing in a transaction but tries to find an inter-transaction purchase pattern. So the transactions need to be ordered.

Page 79: INTRODUCTION TO DATA MINING

Introduction to Data Mining 79

Sequential Patterns

It is assumed that a customer can have only one transaction at a given transaction time.

•An itemset (I) is a non-empty set of items (ij) I = {i1 i2…in}

•A sequence (s) is an ordered list of itemsets or events (ej). s = {e1 e2…em} where ei occurs before ej (i<j)

Page 80: INTRODUCTION TO DATA MINING

Introduction to Data Mining 80

Sequential Patterns

A sequence is contained in another sequence if each itemset in the first sequence is contained in some itemset of the second sequence. A sequence {(3) (4 5) (8)} is contained in another sequence {(7) (3 8) (9) (4 5 6) (8)} since, (3) (3 8), (4 5) (4 5 6) and (8) (8). A sequence {(3) (5)} is not contained in {(3 5)} and vice versa.

Page 81: INTRODUCTION TO DATA MINING

Introduction to Data Mining 81

Sequential Patterns

• In a set of sequences, a sequence s is maximal if it is not contained in any other sequence.

•A sequence to be frequent it must at least cross the minimum support threshold.

•A frequent sequence is called sequential pattern. •A sequential patterns with length l is called an l-

pattern.

Page 82: INTRODUCTION TO DATA MINING

Introduction to Data Mining 82

Discovery of Sequential Patterns

Sequence Support

{(10)} 1

{(20)} 1

{(30)} 4

{(40)} 2

{(50)) 1

{(60)} 1

{(70)} 3

{(90)} 3

CustId Date Items001 13/0205/12 30001 14/05/2012 90002 13/05/2012 10, 20002 15/05/2012 30002 16/05/2012 40, 60, 70003 17/05/2012 30, 50, 70004 13/05/2012 30004 14/015/2012 40, 70004 16/05/2012 90005 13/05/2012 90 minsup = 25%

Page 83: INTRODUCTION TO DATA MINING

Introduction to Data Mining 83

Discovery of Sequential Patterns

•L1={{(30)}, {(40)}, {(70)}, {(90)}}• candidate sequence (22) c2={{(30) (30)},{(30)

(40)}, {(30) (70)}, {(30) (90)}, …, {(90) (90)} , {(30 40)}, …, {(70 90)}}

Sequence Support Sequence Support(10 20) 1 (30) (70) 2(10) (30) 1 (30) (90) 2(20) (30) 1 (40) (90) 1(30) (40) 2 (70) (90) 1(30) (60) 1 (40 70) 2

Page 84: INTRODUCTION TO DATA MINING

Introduction to Data Mining 84

Discovery of Sequential Patterns

•L2={{(30) (40)}, {(30) (70)}, {(30) (90)} {(40 70)}} candidate sequence c2={{(30) (30) (70)},{(30) (30) (90)}, {(30) (40 70)}, …, {(40) (30) (70)},{(40) (30) (90)}, {(40) (40 70)}, …, {(30) (40) (30) (70)},{(30) (40) (30) (90)}, {(30) (40) (40 70)}, …, {(40 70) (40 70)}, …, {(30) (40 70 90)}}

Sequence Support(30) (40 70) 2

Page 85: INTRODUCTION TO DATA MINING

Introduction to Data Mining 85

Discovery of Sequential Patterns

CustId Sequence1 (30) (90)

2 (10 20) (30) (40 60 70)

3 (30 60 70)

4 (30) (40 70) (90)

5 (90)

CustId Date Items001 13/0205/12 30001 14/05/2012 90002 13/05/2012 10, 20002 15/05/2012 30002 16/05/2012 40, 60, 70003 17/05/2012 30, 50, 70004 13/05/2012 30004 14/015/2012 40, 70004 16/05/2012 90005 13/05/2012 90

If minsup of any maximal sequence = 0.25 (say), then, acceptable sequential patterns: {(30) (90)} and {(30) (40 70)}.

Page 86: INTRODUCTION TO DATA MINING

Introduction to Data Mining 86

Specification of Time Windows

•User may define a time window within which the patterns are to be discovered.

• If a pattern is found without adequate support within a time window but crosses minsup across different time windows, it would not be considered as a valid sequential pattern.

• This effort helps in studying seasonal purchase patterns in case of market basket analysis.

Page 87: INTRODUCTION TO DATA MINING

Introduction to Data Mining 87

Sequential Patterns over Taxonomies

Similar to rule mining, the items under consideration may not be at the same level.

From the available transactions if a sequential pattern is found as {(Track Suits) (Shoes)}, it would also support patterns like, {(Outerwear)(Shoes)},{(Outerwear) (Footwear)} etc. These are called generalized sequential patterns.

Clothes

FootwearOuterwear

Shirts

Track Suits Track PantsShoes Snickers

Page 88: INTRODUCTION TO DATA MINING

Introduction to Data Mining 88

Data Classification

•Classification is a method where the data instances in a problem domain are distributed among different pre-defined classes or concepts.

• Usually a data instance is placed in only one class. • For the purpose of classification, definite criteria /

rules are defined for the membership of each class.

Page 89: INTRODUCTION TO DATA MINING

Introduction to Data Mining 89

Data Classification

•Classification is usually done under the supervision of domain experts of the problem domain under consideration. So, classification process involves supervised learning.

• Clustering, on the other hand, is the result of unsupervised learning. Here the class or concept label of each data instance or each cluster is not known. The number of such classes or concepts are pre-defined intuitively.

Page 90: INTRODUCTION TO DATA MINING

Introduction to Data Mining 90

Data Classification

Classification process has two steps. 1. build the model from training data set

– Learning a mapping function y = f(X) where y is the associated class label for an instance X.

2. classify unknown data.

Page 91: INTRODUCTION TO DATA MINING

Introduction to Data Mining 91

Comparison of Classification Methods

Properties for the comparison:• Predictive Accuracy : Ability of a model to

correctly predict the class label for a new data instant.

• Speed : Computation cost, in terms of time, required in a model to generate, i.e. to train the classes and then to classify data.

Page 92: INTRODUCTION TO DATA MINING

Introduction to Data Mining 92

Comparison of Classification Methods

Properties for the comparison:•Robustness : Ability of a model to make correct

classification under noisy data or data with missing values.

• Scalability : The response of a model in training and classification step against the increase in data volume.

Page 93: INTRODUCTION TO DATA MINING

Introduction to Data Mining 93

Classification by Decision Tree Induction

•A Decision Tree is a tree structure. •Classification is done against a concept. •Tree is formed by testing an attribute or attribute

combination in each node. •Each branch of the tree is caused by an outcome of

this test. •The leaf nodes represent the classes.

Page 94: INTRODUCTION TO DATA MINING

Introduction to Data Mining 94

Decision Tree Concept: Buy New Car

INCOME

MARITAL STATUS AGE

YESNOYES YES NO

=20K 20-50K

MarriedSingle <40 >40

>50K

Page 95: INTRODUCTION TO DATA MINING

Introduction to Data Mining 95

Decision Tree Induction Algorithm

1. Tree starts as a single node on which training samples are tested.

2. If all the training samples are of the same class the node becomes the leaf and it is labeled with that class.

3. Running an attribute selection algorithm, an attribute is chosen for tree generation (attribute INCOME in the example).

Page 96: INTRODUCTION TO DATA MINING

Introduction to Data Mining 96

Decision Tree Induction Algorithm

4. A branch is created for each value of the chosen attribute and the samples are partitioned accordingly(three branches under INCOME).

5. Algorithm repeats steps 3 and 4 recursively to form decision tree for the samples at each partition. Once an attribute is considered in a node, it is not considered in any of its descendent nodes.

Page 97: INTRODUCTION TO DATA MINING

Introduction to Data Mining 97

Decision Tree Induction Algorithm

6. The recursive procedure stops wheni. all samples for each node belong to the same class

according to the domain expert.ii. there is no other attribute on which the

samples can be further partitioned. Majority Voting may be employed here to convert a node to a leaf node and be labeled as a class that covers majority of its samples.

iii. There are no tuples for a given branch

Page 98: INTRODUCTION TO DATA MINING

Introduction to Data Mining 98

Tree Pruning

Tree pruning is done to avoid overfitting of data at different nodes. Statistical measures are taken to identify and to remove branches not reliable enough. This process results in faster classification and makes better classification of unknown data.

• Prepruning •Postpruning

Page 99: INTRODUCTION TO DATA MINING

Introduction to Data Mining 99

Prepruning

The tree generation process is stopped after every partitioning. As a result all the new nodes generated become leaf nodes with membership of samples decided by Majority Voting. Goodness of partitioning is then tested by measures like 2, information gain etc. If any result goes below a pre-specified threshold, further partitioning of the affected subset of samples is stopped.

Page 100: INTRODUCTION TO DATA MINING

Introduction to Data Mining 100

Prepruning

•High threshold would generate an over-simplified tree and low threshold may cause hardly any pruning.

Page 101: INTRODUCTION TO DATA MINING

Introduction to Data Mining 101

Postpruning

•Branches are removed from a fully grown tree. Here the expected error rate at each non-leaf node is computed if its sub-tree is pruned. It is compared with the combined error rates along its each branch weighted by the proportion of the participating samples. If the expected error rate is lower, the sub-tree is removed.

Page 102: INTRODUCTION TO DATA MINING

Introduction to Data Mining 102

Classification Rule Generation

Each path of a decision tree from the root to a leaf gives rise to a IF-THEN classification rule. From the decision tree in the example rules may be formed as:

IF income=20k AND marital-status=“married” THEN buys-new-car=“no”IF income=50k THEN buys-new-car=“yes” etc.

Page 103: INTRODUCTION TO DATA MINING

Introduction to Data Mining 103

Classification Rule Generation

Either during Rule Generation or during Postpruning the redundant paths are pruned. For example if the following rules are found,

IF income=20k AND marital-status=“married” THEN buys-new-car=“no”IF income=20k AND marital-status=“widow” THEN buys-new-car=“no”

Page 104: INTRODUCTION TO DATA MINING

Introduction to Data Mining 104

Classification Rule Generation

The 2 paths are pruned to 1 path as,IF income=20k AND marital-status=(“married”

OR “widow”) THEN buys-new-car=“no”

Other well known classification methods are, Bayesian Classification, Classification by Backpropagation, k-Nearest Neighbor Classifiers etc.

Page 105: INTRODUCTION TO DATA MINING

Introduction to Data Mining 105

Case Study: Dynamic Classification Hierarchy

Classification of Archaeological data:•Classification Hierarchy is created over a Backend

Database to generate and update Association Rules. Continuous restructuring of Classification Hierarchy is done with the updation of the database.

• On arrival of a new instance, system tries to place it in the existing hierarchy. Failure to classify, considers the instance as an Exception to the class found to be the closest.

Page 106: INTRODUCTION TO DATA MINING

Introduction to Data Mining 106

Case Study: Dynamic Classification Hierarchy

Classification of Archaeological data:•System initiates restructuring when the number of

Exceptions exceeds a predefined threshold value.Three important operations are used.1. ADD : adds a new branch to the hierarchy.2. FUSE : merges more than one classes to one.3. BREAK : decomposes a class into more than one

classes.

Page 107: INTRODUCTION TO DATA MINING

Introduction to Data Mining 107

Initial Transaction

•Universal attribute set:A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}

Transactions:I1={ao, a1, a2, a3, a4}I2={ao, a1, a2, a5, a6}I3={ao, b0, b1, b2}I4={ao, b0, b3, b4}I5={ao, b0, b5, b6}

Page 108: INTRODUCTION TO DATA MINING

Introduction to Data Mining 108

Initial Hierarchy

Exact match at leaf level classes• 5 leaf classes

C1

C0

C11 C12 C22C21

C2

{a0}

{b0}

{b1,b2}{a5,a6}

{a1,a2}

{a3,a4} {b3,b4} {b5,b6}C23

Page 109: INTRODUCTION TO DATA MINING

Introduction to Data Mining 109

Add

I6 ={ao, a3, a4, b0, b1, b2, b3, b4}Approximate – up to intermediate level (exception)

Large number of exception may generate new class

C1

C0

C11 C12 C22C21

C2

{a0}

{b0}

{b1,b2}{a5,a6}

{a1,a2}

{a3,a4} {b3,b4} {b5,b6}C23

{b1,b2b3,b4}C24

Page 110: INTRODUCTION TO DATA MINING

Introduction to Data Mining 110

Fuse

C1

C0

C11 C1n C21

C2

{a0}

{a1,a2, a3,a4}{a1,a2}

C2m… …

C1

C0

C11 C1nC21

C2

{a0}

{a3,a4}

{a1,a2}

C2m

……

Page 111: INTRODUCTION TO DATA MINING

Introduction to Data Mining 111

Fuse

•The fuse of two peer classes K1 and K2

is not allowed if there exists any other peer class K3

with

AA KK 21

AA KK 23

Page 112: INTRODUCTION TO DATA MINING

Introduction to Data Mining 112

Further Transaction

•Universal attribute set:A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}

Transactions:I7={ao, a3, a4, b0, b1, b2 , b3, b4}I8 ={ao, a5, a6, b0, b1, b2 , b3, b4}I9 ={ao, a3, a5, b0, b1, b2 , b3, b4}I10 ={ao, a3, a5, b0, b1, b2 , b5}I11 ={ao, a3, b0, b1, b2 , b3, b4}

Page 113: INTRODUCTION TO DATA MINING

Introduction to Data Mining 113

Break

C1

C0

C11 C12 C22C21

C2

{a0}

{b0}

{b1,b2}{a5,a6}

{a1,a2}

{a3,a4} {b3,b4} {b5,b6}C23

{b1,b2b3,b4}C24

C41 C42{a5,a6}{a3,a4}

Page 114: INTRODUCTION TO DATA MINING

Introduction to Data Mining 114

Cluster Analysis

•The process of partitioning a set of data objects into groups of similar objects is called Clustering. The objects belonging to same cluster are supposed to be similar whereas those in different clusters should be dissimilar under the same similarity measure.

Page 115: INTRODUCTION TO DATA MINING

Introduction to Data Mining 115

Cluster Analysis

•A good clustering algorithm should have the following properties :

• Scalability • Ability to handle different data types • Insensitivity to the order of input records• Working under minimum intervention• Constraint based clustering• Accept high dimensionality

Page 116: INTRODUCTION TO DATA MINING

Introduction to Data Mining 116

Clustering Algorithms

•Partitioning Method : In presence of n objects or data instances, a partitioning method constructs k partitions where k n. Each group/partition must have at least one object. Each object must belong to only one group (may not be true for a fuzzy partitioning algorithm).

Page 117: INTRODUCTION TO DATA MINING

Introduction to Data Mining 117

k-Means Algorithm or aCentroid-based Technique

Accepts an input parameter k and partitions n objects into k clusters where intra-cluster similarity is high and inter-cluster similarity is low. Similarity is measured with respect to the mean value of the objects in a cluster, called the centroid of the cluster.

Page 118: INTRODUCTION TO DATA MINING

Introduction to Data Mining 118

Centroid-based Technique

1.arbitrarily choose k objects out of n as initial cluster centers;

2.assign or reassign each object to a cluster where it is most similar, with respect to the mean value;

3.re-compute the cluster means;4.repeat steps 2 and 3 until there is no further change

or there is an exit condition.

Page 119: INTRODUCTION TO DATA MINING

Introduction to Data Mining 119

Centroid-based Technique

k-means is an iterative algorithm that works on the convergence of a squared-error criterion of the form,

E = i=1 to k pCi |p-mi|2

Where, E is the sum of square-error for all objects, p is a given object and mi is the centroid of the cluster Ci.

Page 120: INTRODUCTION TO DATA MINING

Introduction to Data Mining 120

k-Medoids Algoritrhm

k-means algorithm is sensitive to outliers where a very large value may distort the distribution of data among clusters. In order to overcome it, instead of the mean a medoid is used as the reference point of a cluster. A medoid is the most centrally located object in a cluster.

Page 121: INTRODUCTION TO DATA MINING

Introduction to Data Mining 121

k-Medoids Algoritrhm

1.arbitrarily choose k objects out of n as initial medoids;

2.assign each remaining object to the cluster with the nearest medoid;

3.randomly select a non-medoid object, Orandom ;

Page 122: INTRODUCTION TO DATA MINING

Introduction to Data Mining 122

k-Medoids Algoritrhm

4. compute the total cost S of swapping Oj with Orandom (the cost function calculates the difference in square-error value if a current medoid is replaced by a nonmedoid object);

5. if S<0 then swap Oj with Orandom to form new set of k-medoids (the total cost of swapping is the sum of costs incurred by all nonmedoid objects);

Page 123: INTRODUCTION TO DATA MINING

Introduction to Data Mining 123

k-Medoids Algoritrhm

6.repeat steps 2 to 5 until no change.•To judge the quality of replacement of Oj by

Orandom , each nonmedoid object p is examined for following four cases.

• If p Oj and Oj is replaced by Orandom and p is closest to Oi where ij, then reassign p to Oi .

• If p Oj and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom .

Page 124: INTRODUCTION TO DATA MINING

Introduction to Data Mining 124

k-Medoids Algoritrhm

• If p Oi , where ij and Oj is replaced by Orandom and p is still closest to Oi , then assignment of p does not change.

• If p Oi , where ij and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom .

Page 125: INTRODUCTION TO DATA MINING

Introduction to Data Mining 125

Parallel Association Rule Mining Algorithms

Challenges include:• synchronization and communication minimization• disk I/O minimization•workload balancing

Page 126: INTRODUCTION TO DATA MINING

Introduction to Data Mining 126

Parallel Association Rule Mining Algorithms

Strategies are,•Distributed vs. shared memory architecture - SM

needs more synchronization by locking etc. where for DM message passing claims higher communication overhead.

•Data vs. task parallelism.•Static vs. dynamic parallelism.

Page 127: INTRODUCTION TO DATA MINING

Introduction to Data Mining 127

Sources & References

1.Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, 2007

2.Willi Klosgen and Jan M Zytkow, “Handbook of Data Mining and Discovery”, 2002

3.R.Srikant, “Fast algorithms for mining association rules and sequential patterns”, Ph.D. Thesis at the University of Wisconsin-Madison, 1996.

4.R.Agrawal, T.Imielimski & A.Swami, “Mining association rules between sets of items in large databases,” Proc. ACM SIGMOD, pp.207-216, 1993.

Page 128: INTRODUCTION TO DATA MINING

Introduction to Data Mining 128

Sources & References

5.R.Agrawal & R.Srikant, “Fast algorithms for mining association rules,” Proc. International Conference for Very Large databases, 1994.

6.J.S.Park, M.S.Chen & P.S.Yu, “An effective hash based algorithm for mining association rules,” Proc. ACM SIGMOD,1995.

7.R.Srikant, Q.Vu & R.Agrawal, “Mining association rules with item constraints,” Proc. International Conference on Knowledge Discovery in Databases, 1997.

Page 129: INTRODUCTION TO DATA MINING

Introduction to Data Mining 129

Sources & References

8. K.Ali, S.Manganaris & R.Srikant, “Partial classification using association rules,” Proc. International Conference on Knowledge Discovery in Databases, 1997.

9.S Pal and A Bagchi, “Association against Dissociation: some pragmatic considerations for Frequent Itemset generation under Fixed and Variable Thresholds,” ACM SigKDD Explorations, Vol.7, Issue 2, Dec.2005, pp. 151-159.

Page 130: INTRODUCTION TO DATA MINING

Introduction to Data Mining 130

Sources & References

10.S Ray and A Bagchi, “Rule Generation by Boolean Minimization – Experience with Coronary Bifurcation Stenting in Angioplasty,” ReTIS 2006.

11.S.Maitra & A.Bagchi, “Dynamic restructuring of classification hierarchy towards data mining,” Proc. International Conference on Management of Data, 1998.

12.T.G.Dietterich & R.S.Michalski, “Discovering patterns in sequences of events,” Artificial Intelligence, vol.25, pp.187-232, 1985.

Page 131: INTRODUCTION TO DATA MINING

Introduction to Data Mining 131

Sources & References

13.R.Agrawal & R.Srikant, “Mining sequential patterns” Proc. IEEE International Conference on Data Engineering, 1995.

14.R.Srikant & R.Agrawal, Mining sequential patterns : generalizations and performance improvements,” Proc. International Conference on Extending Database Technology, 1996.

15.M.J.Zaki, “Parallel & distributed association mining: a survey,” IEEE Concurrency, 7(4), pp.14-25, 1999.

Page 132: INTRODUCTION TO DATA MINING

Introduction to Data Mining 132

Research Challenges

Areas:•Query Language•Architecture•Text Mining•Multimedia Mining•Spatial / Temporal Analysis•Graph-Mining

Page 133: INTRODUCTION TO DATA MINING

THANK YOU