INTRODUCTION TODATA MINING
Pinakpani PalElectronics & Communication Sciences Unit
Indian Statistical [email protected]
Introduction to Data Mining 2
Main Sources
• Data Mining Concepts and Techniques –Jiawei Han and Micheline Kamber, 2007
• Handbook of Data Mining and Discovery- Willi Klosgen and Jan M Zytkow, 2002
• Fast algorithms for mining association rules and sequential patterns – R.Srikant, Ph.D. Thesis at the University of Wisconsin-Madison, 1996.
• “Parallel & distributed association mining: a survey,” –M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999.
Introduction to Data Mining 3
Prelude
•Data Mining is a method of finding interesting trends or patterns in large datasets.
•Data collection may be incomplete, heterogeneous and historical.
•Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms.
•Data Mining tools are expected to involve minimal user intervention.
Introduction to Data Mining 4
Prelude
•Data mining deals with finding patterns in data either by
– user-definition (pre-defined by the user), – interesting (with the help of an interestingness measure)
or – valid (validity pre-defined).
•Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support.
Introduction to Data Mining 5
Data Mining Communities
•Statistics: Provides the background for the algorithms.
•Artificial Intelligence: Provides the required heuristics for machine learning / conceptual clustering.
•Database: Provides the platform for storage and retrieval of raw and summary data.
Introduction to Data Mining 6
Data Mining
Mining knowledge from Large amounts of Data.Evolution:•Data collection•Database creation •Data management
– Data storage– Retrieval– Transaction processing
Introduction to Data Mining 7
Data Mining
•Advanced data analysis
data warehouse and data mining
Introduction to Data Mining 8
Data Mining Components
Information Repository: single or multiple heterogeneous data source
Data Sever: storing or retrieving relevant dataKnowledgebase: concept hierarchies, constraints,
threshold, metadataPattern Extraction : characterization, discrimination,
association, classification, prediction, clustering, various statistical analysis
Pattern Evaluation: interestingness measures
Introduction to Data Mining 9
Stages of the Data Mining Process
Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention.
Steps:• [Data Collection]
– web crawling / warehousing
Introduction to Data Mining 10
Stages of the Data Mining Process
Steps (contd.):•Data Preprocessing & Feature Extraction
– Data cleaning: elimination of erroneous and irrelevant data
– Data Integration: from multiple source– Data selection / reduction: to accept only the interesting
attributes of the data according to the problem domain.– Data transformation: normalization, aggregation
Introduction to Data Mining 11
Stages of the Data Mining Process
Steps (contd.):•Pattern Extraction & Evaluation
– Identification of data mining primitives and interestingness measures are done at this stage.
•Visualization of data– Making it easily understandable
•Evaluation of results– Not every s/w discovered facts are useful for human
beings!
Introduction to Data Mining 12
Data Preprocessing
Data Cleaning: Data may be incomplete, noisy and inconsistent. Attempts are made to identify outliers to smooth out noise, fill in missing values and correct inconsistencies.
Introduction to Data Mining 13
Data Preprocessing
Data Integration: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files.
Introduction to Data Mining 14
Data Preprocessing
Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary. It includes activities like, Removal of irrelevant and redundant attributes, Data Compression and Aggregation or Generation of Summary Data.
Introduction to Data Mining 15
Data Preprocessing
Transformation: Data need to be transformed or consolidated into forms suitable for mining. It may include activities like, Generalization, Normalization, e.g. attribute values converted from absolute values to ranges, Construction of new attributes etc.
Introduction to Data Mining 16
Patterns
•Descriptive – characterizing general properties of the data
•Predictive – inference on the current data in order to make patterns
•Discover:– multiple kind of patterns to accommodate different user
expectation (may specify hints to guide) /application– patterns at various granularity
Introduction to Data Mining 17
Frequent Patterns
Patterns that occur frequently in the data.Types:• Itemset•Subsequences•Substructures (sub-graphs, sub-trees, sub-lattices)
Introduction to Data Mining 18
Discovery of Association Rules
To identify the features or items in a problem domain that tend to appear together. These features or items are said to be associated. The process is to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset.
Introduction to Data Mining 19
Association Rule: Example
A user studying the buying habits of customers may choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y) buys (X,Z) [support=n%, confidence is m%]
Meta rules such as the following can be specified:occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”)
[1.4%, 70%]
Introduction to Data Mining 20
Association Rule: Single/Multi
Single-dimensional association rule:buys(X, “computer”) buys (X, “antivirus”)
[1.1%, 55%]
OR“computer” “antivirus” (A B )
[1.1%, 55%]
Multi-dimensional association rule:occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”)
[1.4%, 70%]
Introduction to Data Mining 21
Metrics for Interestingness measures
Interestingness measures in knowledge discovery help to identify the relevance of the patterns discovered during the mining process.
Introduction to Data Mining 22
Interestingness measures
•Used to confine the number of uninteresting patterns returned by the process.
•Based on the structure of patterns and statistics underlying them.
•Associate a threshold which can be controlled by the user
– patterns not meeting the threshold are not presented to the user.
Introduction to Data Mining 23
Interestingness measures: objective
Objective measures of pattern interestingness:• simplicity• utility (support)• certainty (confidence)• novelty
Introduction to Data Mining 24
Interestingness measures: simplicity
Simplicity: a patterns interestingness is based on its overall simplicity for human comprehension.
e.g. Rule length is a simplicity measure
Introduction to Data Mining 25
Interestingness measures: support
Utility (support): usefulness of a patternsupport(AB) = P(A U B).
The support for a association rule {A} {B} is the % of all the transactions under analysis that contains this itemset.
Introduction to Data Mining 26
Interestingness measures: confidence
Certainty (confidence): Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure
confidence(A B) = P(B│A)The confidence for a association rule {A} {B} is the % cases that follows the rule.
Association rules that satisfy both the confidence and support threshold are referred to as strong association rules.
Introduction to Data Mining 27
Interestingness measures: novelty
Novelty: Patterns contributing new information to the given pattern set are called novel patterns.
e.g: Data exception.
Removing redundant patterns is a strategy for detecting novelty.
Introduction to Data Mining 28
Market Basket data analysis
Let, a transaction be defined as the variety of items purchased by a customer in one visit, irrespective of the quantity of each item purchased. The problem is to find the items that a customer tends to buy together.
Introduction to Data Mining 29
Market Basket data analysis
An association rule is an expression of the form XY,
where X and Y are the sets of items. The intuitive meaning of the expression is, the transactions that contain X tend to contain Y as well. The inverse may not be true. Since only presence or absence of items are considered and not the quantity purchased, this type of rules are called Binary Association Rules.
Introduction to Data Mining 30
Market Basket data analysis
Purpose is to study consumers’ purchase pattern in departmental stores. Considering four possible transactions,
1 - {Pen, Ink, Diary, Writing Pad}2 - {Pen, Ink, Diary}3 - {Pen, Diary}4 - {Pen, Ink, Writing Pad}
Introduction to Data Mining 31
Market Basket data analysis
A possible Association Rule,“ Purchase of Pen implies the
purchase of Ink or Diary”
{Pen} {Ink} or {Pen} {Diary} Basically, the rule is of the form {LHS} {RHS} where, both {LHS} and {RHS} are sets of items,
called itemset and {LHS} ∩ {RHS} = . • {Pen, Ink} is a 2-itemset.
Introduction to Data Mining 32
Binary Association Rule Mining
Two Step Process1. Find all frequent itemsets
– An itemset will be considered for mining rules if its support is above a threshold called minsup.
2. Generate strong association rules from frequent itemsets– Acceptance of a rule is once again through a
threshold called minconf.
Introduction to Data Mining 33
Finding Frequent Itemsets
If there are N items in a market basket and the association is studied for all possible item combinations, totally 2N combinations are to be checked.
Introduction to Data Mining 34
Finding Frequent Itemsets
All nonempty subsets of a frequent itemset must also be frequent.
(anti-monotone property)Apriori Algorithm
An itemset is frequent when its occurrence in the total dataset exceeds the minsup.If there exists N items, the algorithm attempts to compute frequent itemsets for 1-itemset to N-itemsets.
Introduction to Data Mining 35
Apriori Algorithm
The algorithm has two steps,1. Join step2. Prune step
1. Join step : Here frequent k-itemsets are computed by joining the (k-1)-itemsets
2. Prune step: if a k-itemset fails to cross the minsup threshold, all the supersets of the concerned k-itemset are no longer considered for association rule discovery.
Introduction to Data Mining 36
Apriori Algorithm
•Let Lk be the set of frequent k-itemsets•Let Ck be the set of candidate k-itemsets Each member of this set has two fields – itemset and
support count.
Introduction to Data Mining 37
Apriori Algorithm
1. Let k←12. Generate L1 frequent itemsets of length 13. (Lk = ) OR (k = N) goto Step 74. k ← k+15. Generate Lk frequent itemsets of length k by
Join and Prune6. Goto Step 3.7. StopOutput : UkLk
Introduction to Data Mining 38
Apriori Algorithm
Join ()forall (i,j) where i ϵ Lk-1 and j ϵ Lk-1, i≠j
select all possible k-itemset and insert into Ck
endfor
If L3={{{1 2 3}, s123}, {{1 2 4}, s124}, {{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}}C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}}
Introduction to Data Mining 39
Apriori Algorithm
Prune()forall itemsets Ck do
forall (k-1)-subsets s of c doIf ( Lk-1) then delete c from Ck
endifendfor
endforLk ← Ck
S
c
L4={{{1 2 3 4}, s1234}}
Introduction to Data Mining 40
Rule Generation
Rule generation should ensure production of rules that satisfy only the minimum confidence threshold
– Because, rules are generated from frequent itemsets, they automatically satisfy the minimum support threshold
Given a frequent itemset li, find all non-empty subsets f li such that f li – f satisfies the minimum confidence requirement
• If | li | = k, then there are 2k – 2 candidate association rules
Introduction to Data Mining 41
Rule Generation
Algorithm:
forall li i ≥ 2 docall genrule (li, li)
endfor
Introduction to Data Mining 42
Rule Generation
genrule (lk, fi)F ← {(m-1)-itemset fm-1 | fm-1 fm}forall fm-1ϵ F do
conf ←sup(lk) / sup(fm-1)if (conf ≥ minconf)
print rule “fm-1 (lk- fm-1), conf, sup(lk)”if (m-1 >1)
cal genrule(lk, am-1) endif
endif endfor
Introduction to Data Mining 43
Rule Generation
If {A,B,C,D} is a frequent itemset, candidate rules:{ABC}{D}, {ABD}{C},{ACD}{B}, {BCD}{A},{AB}{CD}, {AC}{BD}, {AD}{BC}, {BC}{AD}, {BD}{AC}, {CD}{AB},{A}{BCD}, {B} {ACD},{C}{ABD}, {D}{ABC}
Introduction to Data Mining 44
Rule Generation
In general, confidence does not have an anti-monotone property
c({ABC} {D}) can be larger or smaller than c({AB} {D})
But confidence of rules generated from the same itemset has an anti-monotone property
– Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
e.g., L = {A,B,C,D}: c({ABC} {D}) c({AB} {CD}) c({A} {BCD})
Introduction to Data Mining 45
Case Study
To find the Association among the species of trees present in a forest.
The problem is to find a set of association rules which would indicate the species of trees that usually appear together and also whether a set of species ensures the presence of another set of species with a minimum degree of confidence specified apriori.
Introduction to Data Mining 46
Data Collection
A forest area is divided into a number of transacts. A group of surveyors walk through each such transact to identify the different species of trees and their number of occurrences.
Introduction to Data Mining 47
Data
TransactsSpecies 1 2 3 … 1008
1 7 0 1 … 132 0 5 9 … 03 16 4 0 … 2⁞ ⁞ ⁞ ⁞ … ⁞
398 6 2 25 … 7
Introduction to Data Mining 48
Converting the Data
TransactsSpecies 1 2 3 … 1008
1 1 0 1 … 12 0 1 1 … 03 1 1 0 … 1⁞ ⁞ ⁞ ⁞ … ⁞
398 1 1 1 … 1
Introduction to Data Mining 49
Drawbacks
Support and confidence used by Apriori allow a lot of rules which are not necessarily interesting
Two options to extract interesting rules•Using subjective knowledge•Using objective measures (measures better than
confidence)
Introduction to Data Mining 50
Subjective approaches
•Visualization – users allowed to interactively verify the discovered rules
•Template-based approach – filter out rules that do not fit the user specified templates
•Subjective interestingness measure – filter out rules that are obvious (bread butter) and that are non-actionable (do not lead to profits)
Introduction to Data Mining 51
Objective Measures
TID A B C D Support(A) = 0.7 1 1 1 0 0 Support(B) = 0.6 2 0 0 1 0 Support(C) = 0.5 3 1 1 1 1 Support(D) = 0.5 4 1 0 0 0 Support(AB) = 0.4 5 0 1 0 1 Support(CD) = 0.4 6 1 1 0 0 minsup = 0.3 7 0 1 1 1 How to infer, 8 1 0 1 1 A B 9 1 1 0 0 or10 1 0 1 1 C D
Introduction to Data Mining 52
Dissociation
•Dissociation of an itemset is, the % of transactions where one or more items but not all are absent.
Dissociation(AB) = 0.5
Dissociation(CD) = 0.2
•Extract frequent itemsets from a set of transactions under high association but low dissociation.
Introduction to Data Mining 53
Togetherness
Let Si = subset of transactions containing the item i.SA ∩ SB = subset of transactions containing both A
and B.SA U SB = subset of transactions containing either A or
B.Togetherness(AB)= | SA ∩ SB | / | SA U SB | Similar to minsup, a threshold min_togetherness can
be defined to find frequent itemsets.
Introduction to Data Mining 54
Objective Measures
•Weka uses other objective measures– Lift (A B) = confidence(A B)/support(B) =
support(A B)/(support(A)*support(B))– Leverage (A B) = support(A B) –
support(A)*support(B)– Conviction(A B) = support(A)*support(not
B)/support(A B)– conviction inverts the lift ratio and also computes
support for RHS not being true
Introduction to Data Mining 55
Modifications of Apriori Algorithm
•Reduce computation time:•Hash based techniques•Transaction reduction•Sampling•Dynamic itemset counting
Introduction to Data Mining 56
Frequent Pattern Mining Variations
•Type of value handled•Levels of abstractions•Number of data dimensions•Kinds of Patterns to be mined•Completeness of Patterns to be mined•Kind of rules to be mined
Introduction to Data Mining 57
Type of Value Handled
Binary / Boolean• Absence of items helps in improving the discovery of
association rules but does not directly contribute to rule mining.
Quantitative• In certain applications, absence of items may sometime be
as important as their presence. • In medical applications, it has been found that both presence
and absence of symptoms need to be considered in discovering association rules.
Introduction to Data Mining 58
Quantitative Association Rules
For numeric attributes like, age, salary etc. binary association rule mining is not applicable. The attribute domain can be categorized in two basic approaches regarding the treatment of quantitative attributes:
•Static•Dynamical
Introduction to Data Mining 59
Static Discritisation
Quantitative Attributes are discritised using predefined concept hierarchies.
Say income may be replaced by original numeric values of this attribute interval level
“0…10K”, “11…20K” … and so on.
Introduction to Data Mining 60
Dynamical Discritisation
Quantitative Attributes are discritised (clustered) into “bins” based on the distribution of Data.After the verification of minsup and minconf thresholds, following rules may be obtained,age(x,5) studies(x, “in school”)age(x,6) studies(x, “in school”)
⁞age(x,17) studies(x, “in school”) age(x,18) studies(x, “in school”)
Introduction to Data Mining 61
Dynamical Discritisation
•ARCS(Association Rule Clustering System) used for mining quantitative rules may be used for classification in the form,
Aquant1 Aquant2 …. Aquantn Acat
where Aquant1 , Aquant2 etc. are tests on numeric attribute ranges and Acat is the class label assigned after the training step.
Introduction to Data Mining 62
Dynamical Discritisation
Using ARCS (Association Rule Clustering System), a composite rule may be formed as,
age(x, “5….18”) studies(x, “in school”)Similar way, two dimensional quantitative rules can
also be formed. age(x, “25 …. 40”) income(x, “20K …. 40K”)
buys(x, “new car”)
Introduction to Data Mining 63
Levels of Abstractions
All
Parker
Pen InkWriting Pad
Oxford Link
Bottle CartridgeBlankRuled
Pioneer
DotFountain
Pilot ……… … … ……
Introduction to Data Mining 64
Multilevel Association Rule
Using •Uniform minimum support•Reduced minimum support at lower level•Group based minimum support
Introduction to Data Mining 65
Rules over Taxonomies
•The items used for rule mining may not be at the same level. There can be an in-built taxonomy among the items. An example of a taxonomy as applicable to market basket data :
This taxonomy implies : • Track Suits is-a Outerwear, Outerwear is-a Clothes etc.
ClothesFootwear
Outerwear Shirts
Track Suits Track PantsShoes Snickers
Introduction to Data Mining 66
Rules over Taxonomies
Application domain may need rules at different levels of the taxonomy.
Trivial Rule: If Ŷ implies ancestor(Y), then rule Y Ŷ is Trivial.Shoes Footwear (A rule with 100% confidence)
Footwear
Shoes Snickers
Introduction to Data Mining 67
Rules across Levels
•Rule OuterwearSnickers does not infer either Track SuitsSnickers or Track PantsSnickers So, a rule at a higher level does not infer the same
rule at the lower level of the taxonomy. Clothes
FootwearOuterwear Shirts
Track Suits Track PantsShoes Snickers
Introduction to Data Mining 68
Rules across Levels
•Rule Track SuitsSnickers definitely infers the rule OuterwearSnickers
So, a rule at a lower level definitely infers the same rule at the higher level of the taxonomy.
ClothesFootwear
Outerwear Shirts
Track Suits Track PantsShoes Snickers
Introduction to Data Mining 69
Interest Measure
•To find rules whose support is more than R times the expected value or whose confidence is more than R times the expected value , for some user specified constant R.
Introduction to Data Mining 70
Rule (with Taxonomies) Generation
Steps1. Find frequent itemsets2. Use frequent itemsets to generate the desired
rules.3. Prune all uninteresting rules from this set.
Introduction to Data Mining 71
The Database
TID Items1 Shirts2 Track Suits, Snickers3 Track Pants, Snickers4 Shoes5 Shoes6 Track Suits
minsup = 30%minconf=60%
Introduction to Data Mining 72
Frequent Itemset & Taxonomies
Itemsets Sup (out of 6){Track Suit} 2
{Outerwear} 3
{Clothes} 4
{Shoes} 2
{Snickers} 2
{Footwear} 4
{Outerwear, Snickers} 2
{Clothes, Snickers} 2
{Outerwear, Footwear} 2
{Clothes, Footwear} 2
Clothes
Footwear
Outerwear Shirts
Track Suits Track Pants
Shoes Snickers
Introduction to Data Mining 73
Rules
Rule Sup% Conf%
OuterwearSnickers 33 66OuterwearFootwear 33 66SnickersOuterwear 33 100SnickersClothes 33 100
Introduction to Data Mining 74
Rule under Item Constraints
Some applications may need association rules under user specified constraints on items. When a taxonomy is present, these constraints may be specified using the taxonomy.
Introduction to Data Mining 75
Rule under Item Constraints
(Track Suits Shoes) (descendants(Clothes) ancestors(Snickers))
•A Boolean expression representing a constraint. •Allow rules containing either, both Track Suits and
Shoes or Clothes or any descendant of Clothes and do not contain Snickers or Footwear as its ancestor.
Introduction to Data Mining 76
Rule under Item Constraints
Exploitation of hierarchy does not stop the generation of association rules among the items at the same level. Thus, these types of association rules are the Generalized Association Rules.
Introduction to Data Mining 77
Number of Data Dimensions
•Single Dimension – Discrete Predicate:
buy(X,”Pen”) buy (X, “Ink”)•Multidimension
– Discrete Predicate: age(X,”9..21”)^occupation(X,”Student”) buy (X, “Pen”)
– Multiple occurrence of Predicate: age(X,”9..21”)^occupation(X,”Student”)^ buy(X,”Pen”)
buy (X, “Ink”)
Introduction to Data Mining 78
Sequential Patterns
A sequential pattern always provides an order.• In a market basket application, it is not interested in
the set of items appearing in a transaction but tries to find an inter-transaction purchase pattern. So the transactions need to be ordered.
Introduction to Data Mining 79
Sequential Patterns
It is assumed that a customer can have only one transaction at a given transaction time.
•An itemset (I) is a non-empty set of items (ij) I = {i1 i2…in}
•A sequence (s) is an ordered list of itemsets or events (ej). s = {e1 e2…em} where ei occurs before ej (i<j)
Introduction to Data Mining 80
Sequential Patterns
A sequence is contained in another sequence if each itemset in the first sequence is contained in some itemset of the second sequence. A sequence {(3) (4 5) (8)} is contained in another sequence {(7) (3 8) (9) (4 5 6) (8)} since, (3) (3 8), (4 5) (4 5 6) and (8) (8). A sequence {(3) (5)} is not contained in {(3 5)} and vice versa.
Introduction to Data Mining 81
Sequential Patterns
• In a set of sequences, a sequence s is maximal if it is not contained in any other sequence.
•A sequence to be frequent it must at least cross the minimum support threshold.
•A frequent sequence is called sequential pattern. •A sequential patterns with length l is called an l-
pattern.
Introduction to Data Mining 82
Discovery of Sequential Patterns
Sequence Support
{(10)} 1
{(20)} 1
{(30)} 4
{(40)} 2
{(50)) 1
{(60)} 1
{(70)} 3
{(90)} 3
CustId Date Items001 13/0205/12 30001 14/05/2012 90002 13/05/2012 10, 20002 15/05/2012 30002 16/05/2012 40, 60, 70003 17/05/2012 30, 50, 70004 13/05/2012 30004 14/015/2012 40, 70004 16/05/2012 90005 13/05/2012 90 minsup = 25%
Introduction to Data Mining 83
Discovery of Sequential Patterns
•L1={{(30)}, {(40)}, {(70)}, {(90)}}• candidate sequence (22) c2={{(30) (30)},{(30)
(40)}, {(30) (70)}, {(30) (90)}, …, {(90) (90)} , {(30 40)}, …, {(70 90)}}
Sequence Support Sequence Support(10 20) 1 (30) (70) 2(10) (30) 1 (30) (90) 2(20) (30) 1 (40) (90) 1(30) (40) 2 (70) (90) 1(30) (60) 1 (40 70) 2
Introduction to Data Mining 84
Discovery of Sequential Patterns
•L2={{(30) (40)}, {(30) (70)}, {(30) (90)} {(40 70)}} candidate sequence c2={{(30) (30) (70)},{(30) (30) (90)}, {(30) (40 70)}, …, {(40) (30) (70)},{(40) (30) (90)}, {(40) (40 70)}, …, {(30) (40) (30) (70)},{(30) (40) (30) (90)}, {(30) (40) (40 70)}, …, {(40 70) (40 70)}, …, {(30) (40 70 90)}}
Sequence Support(30) (40 70) 2
Introduction to Data Mining 85
Discovery of Sequential Patterns
CustId Sequence1 (30) (90)
2 (10 20) (30) (40 60 70)
3 (30 60 70)
4 (30) (40 70) (90)
5 (90)
CustId Date Items001 13/0205/12 30001 14/05/2012 90002 13/05/2012 10, 20002 15/05/2012 30002 16/05/2012 40, 60, 70003 17/05/2012 30, 50, 70004 13/05/2012 30004 14/015/2012 40, 70004 16/05/2012 90005 13/05/2012 90
If minsup of any maximal sequence = 0.25 (say), then, acceptable sequential patterns: {(30) (90)} and {(30) (40 70)}.
Introduction to Data Mining 86
Specification of Time Windows
•User may define a time window within which the patterns are to be discovered.
• If a pattern is found without adequate support within a time window but crosses minsup across different time windows, it would not be considered as a valid sequential pattern.
• This effort helps in studying seasonal purchase patterns in case of market basket analysis.
Introduction to Data Mining 87
Sequential Patterns over Taxonomies
Similar to rule mining, the items under consideration may not be at the same level.
From the available transactions if a sequential pattern is found as {(Track Suits) (Shoes)}, it would also support patterns like, {(Outerwear)(Shoes)},{(Outerwear) (Footwear)} etc. These are called generalized sequential patterns.
Clothes
FootwearOuterwear
Shirts
Track Suits Track PantsShoes Snickers
Introduction to Data Mining 88
Data Classification
•Classification is a method where the data instances in a problem domain are distributed among different pre-defined classes or concepts.
• Usually a data instance is placed in only one class. • For the purpose of classification, definite criteria /
rules are defined for the membership of each class.
Introduction to Data Mining 89
Data Classification
•Classification is usually done under the supervision of domain experts of the problem domain under consideration. So, classification process involves supervised learning.
• Clustering, on the other hand, is the result of unsupervised learning. Here the class or concept label of each data instance or each cluster is not known. The number of such classes or concepts are pre-defined intuitively.
Introduction to Data Mining 90
Data Classification
Classification process has two steps. 1. build the model from training data set
– Learning a mapping function y = f(X) where y is the associated class label for an instance X.
2. classify unknown data.
Introduction to Data Mining 91
Comparison of Classification Methods
Properties for the comparison:• Predictive Accuracy : Ability of a model to
correctly predict the class label for a new data instant.
• Speed : Computation cost, in terms of time, required in a model to generate, i.e. to train the classes and then to classify data.
Introduction to Data Mining 92
Comparison of Classification Methods
Properties for the comparison:•Robustness : Ability of a model to make correct
classification under noisy data or data with missing values.
• Scalability : The response of a model in training and classification step against the increase in data volume.
Introduction to Data Mining 93
Classification by Decision Tree Induction
•A Decision Tree is a tree structure. •Classification is done against a concept. •Tree is formed by testing an attribute or attribute
combination in each node. •Each branch of the tree is caused by an outcome of
this test. •The leaf nodes represent the classes.
Introduction to Data Mining 94
Decision Tree Concept: Buy New Car
INCOME
MARITAL STATUS AGE
YESNOYES YES NO
=20K 20-50K
MarriedSingle <40 >40
>50K
Introduction to Data Mining 95
Decision Tree Induction Algorithm
1. Tree starts as a single node on which training samples are tested.
2. If all the training samples are of the same class the node becomes the leaf and it is labeled with that class.
3. Running an attribute selection algorithm, an attribute is chosen for tree generation (attribute INCOME in the example).
Introduction to Data Mining 96
Decision Tree Induction Algorithm
4. A branch is created for each value of the chosen attribute and the samples are partitioned accordingly(three branches under INCOME).
5. Algorithm repeats steps 3 and 4 recursively to form decision tree for the samples at each partition. Once an attribute is considered in a node, it is not considered in any of its descendent nodes.
Introduction to Data Mining 97
Decision Tree Induction Algorithm
6. The recursive procedure stops wheni. all samples for each node belong to the same class
according to the domain expert.ii. there is no other attribute on which the
samples can be further partitioned. Majority Voting may be employed here to convert a node to a leaf node and be labeled as a class that covers majority of its samples.
iii. There are no tuples for a given branch
Introduction to Data Mining 98
Tree Pruning
Tree pruning is done to avoid overfitting of data at different nodes. Statistical measures are taken to identify and to remove branches not reliable enough. This process results in faster classification and makes better classification of unknown data.
• Prepruning •Postpruning
Introduction to Data Mining 99
Prepruning
The tree generation process is stopped after every partitioning. As a result all the new nodes generated become leaf nodes with membership of samples decided by Majority Voting. Goodness of partitioning is then tested by measures like 2, information gain etc. If any result goes below a pre-specified threshold, further partitioning of the affected subset of samples is stopped.
Introduction to Data Mining 100
Prepruning
•High threshold would generate an over-simplified tree and low threshold may cause hardly any pruning.
Introduction to Data Mining 101
Postpruning
•Branches are removed from a fully grown tree. Here the expected error rate at each non-leaf node is computed if its sub-tree is pruned. It is compared with the combined error rates along its each branch weighted by the proportion of the participating samples. If the expected error rate is lower, the sub-tree is removed.
Introduction to Data Mining 102
Classification Rule Generation
Each path of a decision tree from the root to a leaf gives rise to a IF-THEN classification rule. From the decision tree in the example rules may be formed as:
IF income=20k AND marital-status=“married” THEN buys-new-car=“no”IF income=50k THEN buys-new-car=“yes” etc.
Introduction to Data Mining 103
Classification Rule Generation
Either during Rule Generation or during Postpruning the redundant paths are pruned. For example if the following rules are found,
IF income=20k AND marital-status=“married” THEN buys-new-car=“no”IF income=20k AND marital-status=“widow” THEN buys-new-car=“no”
Introduction to Data Mining 104
Classification Rule Generation
The 2 paths are pruned to 1 path as,IF income=20k AND marital-status=(“married”
OR “widow”) THEN buys-new-car=“no”
Other well known classification methods are, Bayesian Classification, Classification by Backpropagation, k-Nearest Neighbor Classifiers etc.
Introduction to Data Mining 105
Case Study: Dynamic Classification Hierarchy
Classification of Archaeological data:•Classification Hierarchy is created over a Backend
Database to generate and update Association Rules. Continuous restructuring of Classification Hierarchy is done with the updation of the database.
• On arrival of a new instance, system tries to place it in the existing hierarchy. Failure to classify, considers the instance as an Exception to the class found to be the closest.
Introduction to Data Mining 106
Case Study: Dynamic Classification Hierarchy
Classification of Archaeological data:•System initiates restructuring when the number of
Exceptions exceeds a predefined threshold value.Three important operations are used.1. ADD : adds a new branch to the hierarchy.2. FUSE : merges more than one classes to one.3. BREAK : decomposes a class into more than one
classes.
Introduction to Data Mining 107
Initial Transaction
•Universal attribute set:A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}
Transactions:I1={ao, a1, a2, a3, a4}I2={ao, a1, a2, a5, a6}I3={ao, b0, b1, b2}I4={ao, b0, b3, b4}I5={ao, b0, b5, b6}
Introduction to Data Mining 108
Initial Hierarchy
Exact match at leaf level classes• 5 leaf classes
C1
C0
C11 C12 C22C21
C2
{a0}
{b0}
{b1,b2}{a5,a6}
{a1,a2}
{a3,a4} {b3,b4} {b5,b6}C23
Introduction to Data Mining 109
Add
I6 ={ao, a3, a4, b0, b1, b2, b3, b4}Approximate – up to intermediate level (exception)
Large number of exception may generate new class
C1
C0
C11 C12 C22C21
C2
{a0}
{b0}
{b1,b2}{a5,a6}
{a1,a2}
{a3,a4} {b3,b4} {b5,b6}C23
{b1,b2b3,b4}C24
Introduction to Data Mining 110
Fuse
C1
C0
C11 C1n C21
C2
{a0}
{a1,a2, a3,a4}{a1,a2}
C2m… …
C1
C0
C11 C1nC21
C2
{a0}
{a3,a4}
{a1,a2}
C2m
……
Introduction to Data Mining 111
Fuse
•The fuse of two peer classes K1 and K2
is not allowed if there exists any other peer class K3
with
AA KK 21
AA KK 23
Introduction to Data Mining 112
Further Transaction
•Universal attribute set:A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6}
Transactions:I7={ao, a3, a4, b0, b1, b2 , b3, b4}I8 ={ao, a5, a6, b0, b1, b2 , b3, b4}I9 ={ao, a3, a5, b0, b1, b2 , b3, b4}I10 ={ao, a3, a5, b0, b1, b2 , b5}I11 ={ao, a3, b0, b1, b2 , b3, b4}
Introduction to Data Mining 113
Break
C1
C0
C11 C12 C22C21
C2
{a0}
{b0}
{b1,b2}{a5,a6}
{a1,a2}
{a3,a4} {b3,b4} {b5,b6}C23
{b1,b2b3,b4}C24
C41 C42{a5,a6}{a3,a4}
Introduction to Data Mining 114
Cluster Analysis
•The process of partitioning a set of data objects into groups of similar objects is called Clustering. The objects belonging to same cluster are supposed to be similar whereas those in different clusters should be dissimilar under the same similarity measure.
Introduction to Data Mining 115
Cluster Analysis
•A good clustering algorithm should have the following properties :
• Scalability • Ability to handle different data types • Insensitivity to the order of input records• Working under minimum intervention• Constraint based clustering• Accept high dimensionality
Introduction to Data Mining 116
Clustering Algorithms
•Partitioning Method : In presence of n objects or data instances, a partitioning method constructs k partitions where k n. Each group/partition must have at least one object. Each object must belong to only one group (may not be true for a fuzzy partitioning algorithm).
Introduction to Data Mining 117
k-Means Algorithm or aCentroid-based Technique
Accepts an input parameter k and partitions n objects into k clusters where intra-cluster similarity is high and inter-cluster similarity is low. Similarity is measured with respect to the mean value of the objects in a cluster, called the centroid of the cluster.
Introduction to Data Mining 118
Centroid-based Technique
1.arbitrarily choose k objects out of n as initial cluster centers;
2.assign or reassign each object to a cluster where it is most similar, with respect to the mean value;
3.re-compute the cluster means;4.repeat steps 2 and 3 until there is no further change
or there is an exit condition.
Introduction to Data Mining 119
Centroid-based Technique
k-means is an iterative algorithm that works on the convergence of a squared-error criterion of the form,
E = i=1 to k pCi |p-mi|2
Where, E is the sum of square-error for all objects, p is a given object and mi is the centroid of the cluster Ci.
Introduction to Data Mining 120
k-Medoids Algoritrhm
k-means algorithm is sensitive to outliers where a very large value may distort the distribution of data among clusters. In order to overcome it, instead of the mean a medoid is used as the reference point of a cluster. A medoid is the most centrally located object in a cluster.
Introduction to Data Mining 121
k-Medoids Algoritrhm
1.arbitrarily choose k objects out of n as initial medoids;
2.assign each remaining object to the cluster with the nearest medoid;
3.randomly select a non-medoid object, Orandom ;
Introduction to Data Mining 122
k-Medoids Algoritrhm
4. compute the total cost S of swapping Oj with Orandom (the cost function calculates the difference in square-error value if a current medoid is replaced by a nonmedoid object);
5. if S<0 then swap Oj with Orandom to form new set of k-medoids (the total cost of swapping is the sum of costs incurred by all nonmedoid objects);
Introduction to Data Mining 123
k-Medoids Algoritrhm
6.repeat steps 2 to 5 until no change.•To judge the quality of replacement of Oj by
Orandom , each nonmedoid object p is examined for following four cases.
• If p Oj and Oj is replaced by Orandom and p is closest to Oi where ij, then reassign p to Oi .
• If p Oj and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom .
Introduction to Data Mining 124
k-Medoids Algoritrhm
• If p Oi , where ij and Oj is replaced by Orandom and p is still closest to Oi , then assignment of p does not change.
• If p Oi , where ij and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom .
Introduction to Data Mining 125
Parallel Association Rule Mining Algorithms
Challenges include:• synchronization and communication minimization• disk I/O minimization•workload balancing
Introduction to Data Mining 126
Parallel Association Rule Mining Algorithms
Strategies are,•Distributed vs. shared memory architecture - SM
needs more synchronization by locking etc. where for DM message passing claims higher communication overhead.
•Data vs. task parallelism.•Static vs. dynamic parallelism.
Introduction to Data Mining 127
Sources & References
1.Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, 2007
2.Willi Klosgen and Jan M Zytkow, “Handbook of Data Mining and Discovery”, 2002
3.R.Srikant, “Fast algorithms for mining association rules and sequential patterns”, Ph.D. Thesis at the University of Wisconsin-Madison, 1996.
4.R.Agrawal, T.Imielimski & A.Swami, “Mining association rules between sets of items in large databases,” Proc. ACM SIGMOD, pp.207-216, 1993.
Introduction to Data Mining 128
Sources & References
5.R.Agrawal & R.Srikant, “Fast algorithms for mining association rules,” Proc. International Conference for Very Large databases, 1994.
6.J.S.Park, M.S.Chen & P.S.Yu, “An effective hash based algorithm for mining association rules,” Proc. ACM SIGMOD,1995.
7.R.Srikant, Q.Vu & R.Agrawal, “Mining association rules with item constraints,” Proc. International Conference on Knowledge Discovery in Databases, 1997.
Introduction to Data Mining 129
Sources & References
8. K.Ali, S.Manganaris & R.Srikant, “Partial classification using association rules,” Proc. International Conference on Knowledge Discovery in Databases, 1997.
9.S Pal and A Bagchi, “Association against Dissociation: some pragmatic considerations for Frequent Itemset generation under Fixed and Variable Thresholds,” ACM SigKDD Explorations, Vol.7, Issue 2, Dec.2005, pp. 151-159.
Introduction to Data Mining 130
Sources & References
10.S Ray and A Bagchi, “Rule Generation by Boolean Minimization – Experience with Coronary Bifurcation Stenting in Angioplasty,” ReTIS 2006.
11.S.Maitra & A.Bagchi, “Dynamic restructuring of classification hierarchy towards data mining,” Proc. International Conference on Management of Data, 1998.
12.T.G.Dietterich & R.S.Michalski, “Discovering patterns in sequences of events,” Artificial Intelligence, vol.25, pp.187-232, 1985.
Introduction to Data Mining 131
Sources & References
13.R.Agrawal & R.Srikant, “Mining sequential patterns” Proc. IEEE International Conference on Data Engineering, 1995.
14.R.Srikant & R.Agrawal, Mining sequential patterns : generalizations and performance improvements,” Proc. International Conference on Extending Database Technology, 1996.
15.M.J.Zaki, “Parallel & distributed association mining: a survey,” IEEE Concurrency, 7(4), pp.14-25, 1999.
Introduction to Data Mining 132
Research Challenges
Areas:•Query Language•Architecture•Text Mining•Multimedia Mining•Spatial / Temporal Analysis•Graph-Mining
THANK YOU