INTRODUCTION TO DATA MINING

INTRODUCTION TODATA MINING

Pinakpani PalElectronics & Communication Sciences Unit

Indian Statistical [email protected]

Introduction to Data Mining 2

Main Sources

• Data Mining Concepts and Techniques –Jiawei Han and Micheline Kamber, 2007

• Handbook of Data Mining and Discovery- Willi Klosgen and Jan M Zytkow, 2002

• Fast algorithms for mining association rules and sequential patterns – R.Srikant, Ph.D. Thesis at the University of Wisconsin-Madison, 1996.

• “Parallel & distributed association mining: a survey,” –M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999.


Prelude

•Data Mining is a method of finding interesting trends or patterns in large datasets.

•Data collection may be incomplete, heterogeneous and historical.

•Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms.

•Data Mining tools are expected to involve minimal user intervention.


Prelude

•Data mining deals with finding patterns in data either by

– user-definition (pre-defined by the user), – interesting (with the help of an interestingness measure)

or – valid (validity pre-defined).

•Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support.


Data Mining Communities

•Statistics: Provides the background for the algorithms.

•Artificial Intelligence: Provides the required heuristics for machine learning / conceptual clustering.

•Database: Provides the platform for storage and retrieval of raw and summary data.


Data Mining

Mining knowledge from Large amounts of Data.Evolution:•Data collection•Database creation •Data management

– Data storage– Retrieval– Transaction processing


Data Mining

•Advanced data analysis

data warehouse and data mining


Data Mining Components

Information Repository: single or multiple heterogeneous data source

Data Sever: storing or retrieving relevant dataKnowledgebase: concept hierarchies, constraints,

threshold, metadataPattern Extraction : characterization, discrimination,

association, classification, prediction, clustering, various statistical analysis

Pattern Evaluation: interestingness measures


Stages of the Data Mining Process

Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention.

Steps:• [Data Collection]

– web crawling / warehousing



Steps (contd.):•Data Preprocessing & Feature Extraction

– Data cleaning: elimination of erroneous and irrelevant data

– Data Integration: from multiple source– Data selection / reduction: to accept only the interesting

attributes of the data according to the problem domain.– Data transformation: normalization, aggregation



Steps (contd.):•Pattern Extraction & Evaluation

– Identification of data mining primitives and interestingness measures are done at this stage.

•Visualization of data– Making it easily understandable

•Evaluation of results– Not every s/w discovered facts are useful for human

beings!


Data Preprocessing

Data Cleaning: Data may be incomplete, noisy and inconsistent. Attempts are made to identify outliers to smooth out noise, fill in missing values and correct inconsistencies.


Data Preprocessing

Data Integration: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files.


Data Preprocessing

Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary. It includes activities like, Removal of irrelevant and redundant attributes, Data Compression and Aggregation or Generation of Summary Data.


Data Preprocessing

Transformation: Data need to be transformed or consolidated into forms suitable for mining. It may include activities like, Generalization, Normalization, e.g. attribute values converted from absolute values to ranges, Construction of new attributes etc.


Patterns

•Descriptive – characterizing general properties of the data

•Predictive – inference on the current data in order to make patterns

•Discover:– multiple kind of patterns to accommodate different user

expectation (may specify hints to guide) /application– patterns at various granularity


Frequent Patterns

Patterns that occur frequently in the data.Types:• Itemset•Subsequences•Substructures (sub-graphs, sub-trees, sub-lattices)


Discovery of Association Rules

To identify the features or items in a problem domain that tend to appear together. These features or items are said to be associated. The process is to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset.


Association Rule: Example

A user studying the buying habits of customers may choose to mine association rules of the form:

P (X:customer,W) ^ Q (X,Y) buys (X,Z) [support=n%, confidence is m%]

Meta rules such as the following can be specified:occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”)

[1.4%, 70%]


Association Rule: Single/Multi

Single-dimensional association rule:buys(X, “computer”) buys (X, “antivirus”)

[1.1%, 55%]

OR“computer” “antivirus” (A B )

[1.1%, 55%]

Multi-dimensional association rule:occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”)

[1.4%, 70%]


Metrics for Interestingness measures

Interestingness measures in knowledge discovery help to identify the relevance of the patterns discovered during the mining process.


Interestingness measures

•Used to confine the number of uninteresting patterns returned by the process.

•Based on the structure of patterns and statistics underlying them.

•Associate a threshold which can be controlled by the user

– patterns not meeting the threshold are not presented to the user.


Interestingness measures: objective

Objective measures of pattern interestingness:• simplicity• utility (support)• certainty (confidence)• novelty


Interestingness measures: simplicity

Simplicity: a patterns interestingness is based on its overall simplicity for human comprehension.

e.g. Rule length is a simplicity measure


Interestingness measures: support

Utility (support): usefulness of a patternsupport(AB) = P(A U B).

The support for a association rule {A} {B} is the % of all the transactions under analysis that contains this itemset.


Interestingness measures: confidence

Certainty (confidence): Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure

confidence(A B) = P(B│A)The confidence for a association rule {A} {B} is the % cases that follows the rule.

Association rules that satisfy both the confidence and support threshold are referred to as strong association rules.


Interestingness measures: novelty

Novelty: Patterns contributing new information to the given pattern set are called novel patterns.

e.g: Data exception.

Removing redundant patterns is a strategy for detecting novelty.


Market Basket data analysis

Let, a transaction be defined as the variety of items purchased by a customer in one visit, irrespective of the quantity of each item purchased. The problem is to find the items that a customer tends to buy together.



An association rule is an expression of the form XY,

where X and Y are the sets of items. The intuitive meaning of the expression is, the transactions that contain X tend to contain Y as well. The inverse may not be true. Since only presence or absence of items are considered and not the quantity purchased, this type of rules are called Binary Association Rules.



Purpose is to study consumers’ purchase pattern in departmental stores. Considering four possible transactions,

1 - {Pen, Ink, Diary, Writing Pad}2 - {Pen, Ink, Diary}3 - {Pen, Diary}4 - {Pen, Ink, Writing Pad}



A possible Association Rule,“ Purchase of Pen implies the

purchase of Ink or Diary”

{Pen} {Ink} or {Pen} {Diary} Basically, the rule is of the form {LHS} {RHS} where, both {LHS} and {RHS} are sets of items,

called itemset and {LHS} ∩ {RHS} = . • {Pen, Ink} is a 2-itemset.


Binary Association Rule Mining

Two Step Process1. Find all frequent itemsets

– An itemset will be considered for mining rules if its support is above a threshold called minsup.

2. Generate strong association rules from frequent itemsets– Acceptance of a rule is once again through a

threshold called minconf.


Finding Frequent Itemsets

If there are N items in a market basket and the association is studied for all possible item combinations, totally 2N combinations are to be checked.


Finding Frequent Itemsets

All nonempty subsets of a frequent itemset must also be frequent.

(anti-monotone property)Apriori Algorithm

An itemset is frequent when its occurrence in the total dataset exceeds the minsup.If there exists N items, the algorithm attempts to compute frequent itemsets for 1-itemset to N-itemsets.


Apriori Algorithm

The algorithm has two steps,1. Join step2. Prune step

1. Join step : Here frequent k-itemsets are computed by joining the (k-1)-itemsets

2. Prune step: if a k-itemset fails to cross the minsup threshold, all the supersets of the concerned k-itemset are no longer considered for association rule discovery.


Apriori Algorithm

•Let Lk be the set of frequent k-itemsets•Let Ck be the set of candidate k-itemsets Each member of this set has two fields – itemset and

support count.


Apriori Algorithm

1. Let k←12. Generate L1 frequent itemsets of length 13. (Lk = ) OR (k = N) goto Step 74. k ← k+15. Generate Lk frequent itemsets of length k by

Join and Prune6. Goto Step 3.7. StopOutput : UkLk


Apriori Algorithm

Join ()forall (i,j) where i ϵ Lk-1 and j ϵ Lk-1, i≠j

select all possible k-itemset and insert into Ck

endfor

If L3={{{1 2 3}, s123}, {{1 2 4}, s124}, {{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}}C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}}


Apriori Algorithm

Prune()forall itemsets Ck do

forall (k-1)-subsets s of c doIf ( Lk-1) then delete c from Ck

endifendfor

endforLk ← Ck

S

c

L4={{{1 2 3 4}, s1234}}


Rule Generation

Rule generation should ensure production of rules that satisfy only the minimum confidence threshold

– Because, rules are generated from frequent itemsets, they automatically satisfy the minimum support threshold

Given a frequent itemset li, find all non-empty subsets f li such that f li – f satisfies the minimum confidence requirement

• If | li | = k, then there are 2k – 2 candidate association rules


Rule Generation

Algorithm:

forall li i ≥ 2 docall genrule (li, li)

endfor


Rule Generation

genrule (lk, fi)F ← {(m-1)-itemset fm-1 | fm-1 fm}forall fm-1ϵ F do

conf ←sup(lk) / sup(fm-1)if (conf ≥ minconf)

print rule “fm-1 (lk- fm-1), conf, sup(lk)”if (m-1 >1)

cal genrule(lk, am-1) endif

endif endfor


Rule Generation

If {A,B,C,D} is a frequent itemset, candidate rules:{ABC}{D}, {ABD}{C},{ACD}{B}, {BCD}{A},{AB}{CD}, {AC}{BD}, {AD}{BC}, {BC}{AD}, {BD}{AC}, {CD}{AB},{A}{BCD}, {B} {ACD},{C}{ABD}, {D}{ABC}


Rule Generation

In general, confidence does not have an anti-monotone property

c({ABC} {D}) can be larger or smaller than c({AB} {D})

But confidence of rules generated from the same itemset has an anti-monotone property

– Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

e.g., L = {A,B,C,D}: c({ABC} {D}) c({AB} {CD}) c({A} {BCD})


Case Study

To find the Association among the species of trees present in a forest.

The problem is to find a set of association rules which would indicate the species of trees that usually appear together and also whether a set of species ensures the presence of another set of species with a minimum degree of confidence specified apriori.


Data Collection

A forest area is divided into a number of transacts. A group of surveyors walk through each such transact to identify the different species of trees and their number of occurrences.


Data

TransactsSpecies 1 2 3 … 1008

1 7 0 1 … 132 0 5 9 … 03 16 4 0 … 2⁞ ⁞ ⁞ ⁞ … ⁞

398 6 2 25 … 7


Converting the Data

TransactsSpecies 1 2 3 … 1008

1 1 0 1 … 12 0 1 1 … 03 1 1 0 … 1⁞ ⁞ ⁞ ⁞ … ⁞

398 1 1 1 … 1


Drawbacks

Support and confidence used by Apriori allow a lot of rules which are not necessarily interesting

Two options to extract interesting rules•Using subjective knowledge•Using objective measures (measures better than

confidence)


Subjective approaches

•Visualization – users allowed to interactively verify the discovered rules

•Template-based approach – filter out rules that do not fit the user specified templates

•Subjective interestingness measure – filter out rules that are obvious (bread butter) and that are non-actionable (do not lead to profits)


Objective Measures

TID A B C D Support(A) = 0.7 1 1 1 0 0 Support(B) = 0.6 2 0 0 1 0 Support(C) = 0.5 3 1 1 1 1 Support(D) = 0.5 4 1 0 0 0 Support(AB) = 0.4 5 0 1 0 1 Support(CD) = 0.4 6 1 1 0 0 minsup = 0.3 7 0 1 1 1 How to infer, 8 1 0 1 1 A B 9 1 1 0 0 or10 1 0 1 1 C D


Dissociation

•Dissociation of an itemset is, the % of transactions where one or more items but not all are absent.

Dissociation(AB) = 0.5

Dissociation(CD) = 0.2

•Extract frequent itemsets from a set of transactions under high association but low dissociation.


Togetherness

Let Si = subset of transactions containing the item i.SA ∩ SB = subset of transactions containing both A

and B.SA U SB = subset of transactions containing either A or

B.Togetherness(AB)= | SA ∩ SB | / | SA U SB | Similar to minsup, a threshold min_togetherness can

be defined to find frequent itemsets.


Objective Measures

•Weka uses other objective measures– Lift (A B) = confidence(A B)/support(B) =

support(A B)/(support(A)*support(B))– Leverage (A B) = support(A B) –

support(A)*support(B)– Conviction(A B) = support(A)*support(not

B)/support(A B)– conviction inverts the lift ratio and also computes

support for RHS not being true


Modifications of Apriori Algorithm

•Reduce computation time:•Hash based techniques•Transaction reduction•Sampling•Dynamic itemset counting


Frequent Pattern Mining Variations

•Type of value handled•Levels of abstractions•Number of data dimensions•Kinds of Patterns to be mined•Completeness of Patterns to be mined•Kind of rules to be mined


Type of Value Handled

Binary / Boolean• Absence of items helps in improving the discovery of

association rules but does not directly contribute to rule mining.

Quantitative• In certain applications, absence of items may sometime be

as important as their presence. • In medical applications, it has been found that both presence

and absence of symptoms need to be considered in discovering association rules.


Quantitative Association Rules

For numeric attributes like, age, salary etc. binary association rule mining is not applicable. The attribute domain can be categorized in two basic approaches regarding the treatment of quantitative attributes:

•Static•Dynamical


Static Discritisation

Quantitative Attributes are discritised using predefined concept hierarchies.

Say income may be replaced by original numeric values of this attribute interval level

“0…10K”, “11…20K” … and so on.


Dynamical Discritisation

Quantitative Attributes are discritised (clustered) into “bins” based on the distribution of Data.After the verification of minsup and minconf thresholds, following rules may be obtained,age(x,5) studies(x, “in school”)age(x,6) studies(x, “in school”)

⁞age(x,17) studies(x, “in school”) age(x,18) studies(x, “in school”)



•ARCS(Association Rule Clustering System) used for mining quantitative rules may be used for classification in the form,

Aquant1 Aquant2 …. Aquantn Acat

where Aquant1 , Aquant2 etc. are tests on numeric attribute ranges and Acat is the class label assigned after the training step.



Using ARCS (Association Rule Clustering System), a composite rule may be formed as,

age(x, “5….18”) studies(x, “in school”)Similar way, two dimensional quantitative rules can

also be formed. age(x, “25 …. 40”) income(x, “20K …. 40K”)

buys(x, “new car”)


Levels of Abstractions

All

Parker

Pen InkWriting Pad

Oxford Link

Bottle CartridgeBlankRuled

Pioneer

DotFountain

Pilot ……… … … ……


Multilevel Association Rule

Using •Uniform minimum support•Reduced minimum support at lower level•Group based minimum support


Rules over Taxonomies

•The items used for rule mining may not be at the same level. There can be an in-built taxonomy among the items. An example of a taxonomy as applicable to market basket data :

This taxonomy implies : • Track Suits is-a Outerwear, Outerwear is-a Clothes etc.

ClothesFootwear

Outerwear Shirts

Track Suits Track PantsShoes Snickers


Rules over Taxonomies

Application domain may need rules at different levels of the taxonomy.

Trivial Rule: If Ŷ implies ancestor(Y), then rule Y Ŷ is Trivial.Shoes Footwear (A rule with 100% confidence)

Footwear

Shoes Snickers


Rules across Levels

•Rule OuterwearSnickers does not infer either Track SuitsSnickers or Track PantsSnickers So, a rule at a higher level does not infer the same

rule at the lower level of the taxonomy. Clothes

FootwearOuterwear Shirts



Rules across Levels

•Rule Track SuitsSnickers definitely infers the rule OuterwearSnickers

So, a rule at a lower level definitely infers the same rule at the higher level of the taxonomy.

ClothesFootwear

Outerwear Shirts



Interest Measure

•To find rules whose support is more than R times the expected value or whose confidence is more than R times the expected value , for some user specified constant R.


Rule (with Taxonomies) Generation

Steps1. Find frequent itemsets2. Use frequent itemsets to generate the desired

rules.3. Prune all uninteresting rules from this set.


The Database

TID Items1 Shirts2 Track Suits, Snickers3 Track Pants, Snickers4 Shoes5 Shoes6 Track Suits

minsup = 30%minconf=60%


Frequent Itemset & Taxonomies

Itemsets Sup (out of 6){Track Suit} 2

{Outerwear} 3

{Clothes} 4

{Shoes} 2

{Snickers} 2

{Footwear} 4

{Outerwear, Snickers} 2

{Clothes, Snickers} 2

{Outerwear, Footwear} 2

{Clothes, Footwear} 2

Clothes

Footwear

Outerwear Shirts

Track Suits Track Pants

Shoes Snickers


Rules

Rule Sup% Conf%

OuterwearSnickers 33 66OuterwearFootwear 33 66SnickersOuterwear 33 100SnickersClothes 33 100


Rule under Item Constraints

Some applications may need association rules under user specified constraints on items. When a taxonomy is present, these constraints may be specified using the taxonomy.



(Track Suits Shoes) (descendants(Clothes) ancestors(Snickers))

•A Boolean expression representing a constraint. •Allow rules containing either, both Track Suits and

Shoes or Clothes or any descendant of Clothes and do not contain Snickers or Footwear as its ancestor.



Exploitation of hierarchy does not stop the generation of association rules among the items at the same level. Thus, these types of association rules are the Generalized Association Rules.


Number of Data Dimensions

•Single Dimension – Discrete Predicate:

buy(X,”Pen”) buy (X, “Ink”)•Multidimension

– Discrete Predicate: age(X,”9..21”)^occupation(X,”Student”) buy (X, “Pen”)

– Multiple occurrence of Predicate: age(X,”9..21”)^occupation(X,”Student”)^ buy(X,”Pen”)

buy (X, “Ink”)


Sequential Patterns

A sequential pattern always provides an order.• In a market basket application, it is not interested in

the set of items appearing in a transaction but tries to find an inter-transaction purchase pattern. So the transactions need to be ordered.


Sequential Patterns

It is assumed that a customer can have only one transaction at a given transaction time.

•An itemset (I) is a non-empty set of items (ij) I = {i1 i2…in}

•A sequence (s) is an ordered list of itemsets or events (ej). s = {e1 e2…em} where ei occurs before ej (i<j)


Sequential Patterns

A sequence is contained in another sequence if each itemset in the first sequence is contained in some itemset of the second sequence. A sequence {(3) (4 5) (8)} is contained in another sequence {(7) (3 8) (9) (4 5 6) (8)} since, (3) (3 8), (4 5) (4 5 6) and (8) (8). A sequence {(3) (5)} is not contained in {(3 5)} and vice versa.


Sequential Patterns

• In a set of sequences, a sequence s is maximal if it is not contained in any other sequence.

•A sequence to be frequent it must at least cross the minimum support threshold.

•A frequent sequence is called sequential pattern. •A sequential patterns with length l is called an l-

pattern.


Discovery of Sequential Patterns

Sequence Support

{(10)} 1

{(20)} 1

{(30)} 4

{(40)} 2

{(50)) 1

{(60)} 1

{(70)} 3

{(90)} 3

CustId Date Items001 13/0205/12 30001 14/05/2012 90002 13/05/2012 10, 20002 15/05/2012 30002 16/05/2012 40, 60, 70003 17/05/2012 30, 50, 70004 13/05/2012 30004 14/015/2012 40, 70004 16/05/2012 90005 13/05/2012 90 minsup = 25%



•L1={{(30)}, {(40)}, {(70)}, {(90)}}• candidate sequence (22) c2={{(30) (30)},{(30)

(40)}, {(30) (70)}, {(30) (90)}, …, {(90) (90)} , {(30 40)}, …, {(70 90)}}

Sequence Support Sequence Support(10 20) 1 (30) (70) 2(10) (30) 1 (30) (90) 2(20) (30) 1 (40) (90) 1(30) (40) 2 (70) (90) 1(30) (60) 1 (40 70) 2



•L2={{(30) (40)}, {(30) (70)}, {(30) (90)} {(40 70)}} candidate sequence c2={{(30) (30) (70)},{(30) (30) (90)}, {(30) (40 70)}, …, {(40) (30) (70)},{(40) (30) (90)}, {(40) (40 70)}, …, {(30) (40) (30) (70)},{(30) (40) (30) (90)}, {(30) (40) (40 70)}, …, {(40 70) (40 70)}, …, {(30) (40 70 90)}}

Sequence Support(30) (40 70) 2



CustId Sequence1 (30) (90)

2 (10 20) (30) (40 60 70)

3 (30 60 70)

4 (30) (40 70) (90)

5 (90)

CustId Date Items001 13/0205/12 30001 14/05/2012 90002 13/05/2012 10, 20002 15/05/2012 30002 16/05/2012 40, 60, 70003 17/05/2012 30, 50, 70004 13/05/2012 30004 14/015/2012 40, 70004 16/05/2012 90005 13/05/2012 90

If minsup of any maximal sequence = 0.25 (say), then, acceptable sequential patterns: {(30) (90)} and {(30) (40 70)}.


Specification of Time Windows

•User may define a time window within which the patterns are to be discovered.

• If a pattern is found without adequate support within a time window but crosses minsup across different time windows, it would not be considered as a valid sequential pattern.

• This effort helps in studying seasonal purchase patterns in case of market basket analysis.


Sequential Patterns over Taxonomies

Similar to rule mining, the items under consideration may not be at the same level.

From the available transactions if a sequential pattern is found as {(Track Suits) (Shoes)}, it would also support patterns like, {(Outerwear)(Shoes)},{(Outerwear) (Footwear)} etc. These are called generalized sequential patterns.

Clothes

FootwearOuterwear

Shirts



Data Classification

•Classification is a method where the data instances in a problem domain are distributed among different pre-defined classes or concepts.

• Usually a data instance is placed in only one class. • For the purpose of classification, definite criteria /

rules are defined for the membership of each class.


Data Classification

•Classification is usually done under the supervision of domain experts of the problem domain under consideration. So, classification process involves supervised learning.

• Clustering, on the other hand, is the result of unsupervised learning. Here the class or concept label of each data instance or each cluster is not known. The number of such classes or concepts are pre-defined intuitively.


Data Classification

Classification process has two steps. 1. build the model from training data set

– Learning a mapping function y = f(X) where y is the associated class label for an instance X.

2. classify unknown data.


Comparison of Classification Methods

Properties for the comparison:• Predictive Accuracy : Ability of a model to

correctly predict the class label for a new data instant.

• Speed : Computation cost, in terms of time, required in a model to generate, i.e. to train the classes and then to classify data.


Comparison of Classification Methods

Properties for the comparison:•Robustness : Ability of a model to make correct

classification under noisy data or data with missing values.

• Scalability : The response of a model in training and classification step against the increase in data volume.


Classification by Decision Tree Induction

•A Decision Tree is a tree structure. •Classification is done against a concept. •Tree is formed by testing an attribute or attribute

combination in each node. •Each branch of the tree is caused by an outcome of

this test. •The leaf nodes represent the classes.


Decision Tree Concept: Buy New Car

INCOME

MARITAL STATUS AGE

YESNOYES YES NO

=20K 20-50K

MarriedSingle <40 >40

>50K


Decision Tree Induction Algorithm

1. Tree starts as a single node on which training samples are tested.

2. If all the training samples are of the same class the node becomes the leaf and it is labeled with that class.

3. Running an attribute selection algorithm, an attribute is chosen for tree generation (attribute INCOME in the example).



4. A branch is created for each value of the chosen attribute and the samples are partitioned accordingly(three branches under INCOME).

5. Algorithm repeats steps 3 and 4 recursively to form decision tree for the samples at each partition. Once an attribute is considered in a node, it is not considered in any of its descendent nodes.



6. The recursive procedure stops wheni. all samples for each node belong to the same class

according to the domain expert.ii. there is no other attribute on which the

samples can be further partitioned. Majority Voting may be employed here to convert a node to a leaf node and be labeled as a class that covers majority of its samples.

iii. There are no tuples for a given branch


Tree Pruning

Tree pruning is done to avoid overfitting of data at different nodes. Statistical measures are taken to identify and to remove branches not reliable enough. This process results in faster classification and makes better classification of unknown data.

• Prepruning •Postpruning


Prepruning

The tree generation process is stopped after every partitioning. As a result all the new nodes generated become leaf nodes with membership of samples decided by Majority Voting. Goodness of partitioning is then tested by measures like 2, information gain etc. If any result goes below a pre-specified threshold, further partitioning of the affected subset of samples is stopped.


Prepruning

•High threshold would generate an over-simplified tree and low threshold may cause hardly any pruning.


Postpruning

•Branches are removed from a fully grown tree. Here the expected error rate at each non-leaf node is computed if its sub-tree is pruned. It is compared with the combined error rates along its each branch weighted by the proportion of the participating samples. If the expected error rate is lower, the sub-tree is removed.


Classification Rule Generation

Each path of a decision tree from the root to a leaf gives rise to a IF-THEN classification rule. From the decision tree in the example rules may be formed as:

IF income=20k AND marital-status=“married” THEN buys-new-car=“no”IF income=50k THEN buys-new-car=“yes” etc.



Either during Rule Generation or during Postpruning the redundant paths are pruned. For example if the following rules are found,

IF income=20k AND marital-status=“married” THEN buys-new-car=“no”IF income=20k AND marital-status=“widow” THEN buys-new-car=“no”



The 2 paths are pruned to 1 path as,IF income=20k AND marital-status=(“married”

OR “widow”) THEN buys-new-car=“no”

Other well known classification methods are, Bayesian Classification, Classification by Backpropagation, k-Nearest Neighbor Classifiers etc.


Case Study: Dynamic Classification Hierarchy

Classification of Archaeological data:•Classification Hierarchy is created over a Backend

Database to generate and update Association Rules. Continuous restructuring of Classification Hierarchy is done with the updation of the database.

• On arrival of a new instance, system tries to place it in the existing hierarchy. Failure to classify, considers the instance as an Exception to the class found to be the closest.


Case Study: Dynamic Classification Hierarchy

Classification of Archaeological data:•System initiates restructuring when the number of

Exceptions exceeds a predefined threshold value.Three important operations are used.1. ADD : adds a new branch to the hierarchy.2. FUSE : merges more than one classes to one.3. BREAK : decomposes a class into more than one

classes.


Initial Transaction

•Universal attribute set:A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}

Transactions:I1={ao, a1, a2, a3, a4}I2={ao, a1, a2, a5, a6}I3={ao, b0, b1, b2}I4={ao, b0, b3, b4}I5={ao, b0, b5, b6}


Initial Hierarchy

Exact match at leaf level classes• 5 leaf classes

C1

C0

C11 C12 C22C21

C2

{a0}

{b0}

{b1,b2}{a5,a6}

{a1,a2}

{a3,a4} {b3,b4} {b5,b6}C23


Add

I6 ={ao, a3, a4, b0, b1, b2, b3, b4}Approximate – up to intermediate level (exception)

Large number of exception may generate new class

C1

C0

C11 C12 C22C21

C2

{a0}

{b0}

{b1,b2}{a5,a6}

{a1,a2}

{a3,a4} {b3,b4} {b5,b6}C23

{b1,b2b3,b4}C24


Fuse

C1

C0

C11 C1n C21

C2

{a0}

{a1,a2, a3,a4}{a1,a2}

C2m… …

C1

C0

C11 C1nC21

C2

{a0}

{a3,a4}

{a1,a2}

C2m

……


Fuse

•The fuse of two peer classes K1 and K2

is not allowed if there exists any other peer class K3

with

AA KK 21

AA KK 23


Further Transaction

•Universal attribute set:A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}

Transactions:I7={ao, a3, a4, b0, b1, b2 , b3, b4}I8 ={ao, a5, a6, b0, b1, b2 , b3, b4}I9 ={ao, a3, a5, b0, b1, b2 , b3, b4}I10 ={ao, a3, a5, b0, b1, b2 , b5}I11 ={ao, a3, b0, b1, b2 , b3, b4}


Break

C1

C0

C11 C12 C22C21

C2

{a0}

{b0}

{b1,b2}{a5,a6}

{a1,a2}

{a3,a4} {b3,b4} {b5,b6}C23

{b1,b2b3,b4}C24

C41 C42{a5,a6}{a3,a4}


Cluster Analysis

•The process of partitioning a set of data objects into groups of similar objects is called Clustering. The objects belonging to same cluster are supposed to be similar whereas those in different clusters should be dissimilar under the same similarity measure.


Cluster Analysis

•A good clustering algorithm should have the following properties :

• Scalability • Ability to handle different data types • Insensitivity to the order of input records• Working under minimum intervention• Constraint based clustering• Accept high dimensionality


Clustering Algorithms

•Partitioning Method : In presence of n objects or data instances, a partitioning method constructs k partitions where k n. Each group/partition must have at least one object. Each object must belong to only one group (may not be true for a fuzzy partitioning algorithm).


k-Means Algorithm or aCentroid-based Technique

Accepts an input parameter k and partitions n objects into k clusters where intra-cluster similarity is high and inter-cluster similarity is low. Similarity is measured with respect to the mean value of the objects in a cluster, called the centroid of the cluster.


Centroid-based Technique

1.arbitrarily choose k objects out of n as initial cluster centers;

2.assign or reassign each object to a cluster where it is most similar, with respect to the mean value;

3.re-compute the cluster means;4.repeat steps 2 and 3 until there is no further change

or there is an exit condition.


Centroid-based Technique

k-means is an iterative algorithm that works on the convergence of a squared-error criterion of the form,

E = i=1 to k pCi |p-mi|2

Where, E is the sum of square-error for all objects, p is a given object and mi is the centroid of the cluster Ci.


k-Medoids Algoritrhm

k-means algorithm is sensitive to outliers where a very large value may distort the distribution of data among clusters. In order to overcome it, instead of the mean a medoid is used as the reference point of a cluster. A medoid is the most centrally located object in a cluster.



1.arbitrarily choose k objects out of n as initial medoids;

2.assign each remaining object to the cluster with the nearest medoid;

3.randomly select a non-medoid object, Orandom ;



4. compute the total cost S of swapping Oj with Orandom (the cost function calculates the difference in square-error value if a current medoid is replaced by a nonmedoid object);

5. if S<0 then swap Oj with Orandom to form new set of k-medoids (the total cost of swapping is the sum of costs incurred by all nonmedoid objects);



6.repeat steps 2 to 5 until no change.•To judge the quality of replacement of Oj by

Orandom , each nonmedoid object p is examined for following four cases.

• If p Oj and Oj is replaced by Orandom and p is closest to Oi where ij, then reassign p to Oi .

• If p Oj and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom .



• If p Oi , where ij and Oj is replaced by Orandom and p is still closest to Oi , then assignment of p does not change.

• If p Oi , where ij and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom .


Parallel Association Rule Mining Algorithms

Challenges include:• synchronization and communication minimization• disk I/O minimization•workload balancing


Parallel Association Rule Mining Algorithms

Strategies are,•Distributed vs. shared memory architecture - SM

needs more synchronization by locking etc. where for DM message passing claims higher communication overhead.

•Data vs. task parallelism.•Static vs. dynamic parallelism.


Sources & References

1.Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, 2007

2.Willi Klosgen and Jan M Zytkow, “Handbook of Data Mining and Discovery”, 2002

3.R.Srikant, “Fast algorithms for mining association rules and sequential patterns”, Ph.D. Thesis at the University of Wisconsin-Madison, 1996.

4.R.Agrawal, T.Imielimski & A.Swami, “Mining association rules between sets of items in large databases,” Proc. ACM SIGMOD, pp.207-216, 1993.



5.R.Agrawal & R.Srikant, “Fast algorithms for mining association rules,” Proc. International Conference for Very Large databases, 1994.

6.J.S.Park, M.S.Chen & P.S.Yu, “An effective hash based algorithm for mining association rules,” Proc. ACM SIGMOD,1995.

7.R.Srikant, Q.Vu & R.Agrawal, “Mining association rules with item constraints,” Proc. International Conference on Knowledge Discovery in Databases, 1997.



8. K.Ali, S.Manganaris & R.Srikant, “Partial classification using association rules,” Proc. International Conference on Knowledge Discovery in Databases, 1997.

9.S Pal and A Bagchi, “Association against Dissociation: some pragmatic considerations for Frequent Itemset generation under Fixed and Variable Thresholds,” ACM SigKDD Explorations, Vol.7, Issue 2, Dec.2005, pp. 151-159.



10.S Ray and A Bagchi, “Rule Generation by Boolean Minimization – Experience with Coronary Bifurcation Stenting in Angioplasty,” ReTIS 2006.

11.S.Maitra & A.Bagchi, “Dynamic restructuring of classification hierarchy towards data mining,” Proc. International Conference on Management of Data, 1998.

12.T.G.Dietterich & R.S.Michalski, “Discovering patterns in sequences of events,” Artificial Intelligence, vol.25, pp.187-232, 1985.



13.R.Agrawal & R.Srikant, “Mining sequential patterns” Proc. IEEE International Conference on Data Engineering, 1995.

14.R.Srikant & R.Agrawal, Mining sequential patterns : generalizations and performance improvements,” Proc. International Conference on Extending Database Technology, 1996.

15.M.J.Zaki, “Parallel & distributed association mining: a survey,” IEEE Concurrency, 7(4), pp.14-25, 1999.


Research Challenges

Areas:•Query Language•Architecture•Text Mining•Multimedia Mining•Spatial / Temporal Analysis•Graph-Mining

THANK YOU

INTRODUCTION TO DATA MINING

Documents

INTRODUCTION TO DATA MINING