Homework 4: Solutions CS4445/B12 Provided by: Kenneth J. Loomis
Dec 18, 2015
Homework 4: Solutions
CS4445/B12Provided by: Kenneth J. Loomis
Homework 4 SolutionsCLASSIFICATION RULES: RIPPER ALGORITHM
RIPPER: First Rule
• The first thing that needs to be determined is the consequence of the rule: Recall that a rule is made up of an antecedent consequence.• The table below contains the frequency counts of the possible consequences of the rules from the userprofile dataset using budget as the classification attribute:
Rule Frequency
… budget=low 35
… budget=medium
91
… budget=high 5
… budget=? 7
• We can see that budget=high has the lowest frequency count in our training dataset, so we choose that as the first antecedent that we will find rules for.• Note: I have included missing values here as one could classify the target as missing. Alternately, these instances could be removed.
RIPPER: First Rule
• Next we attempt to find the first condition in the antecedent. We need only look at possible conditions that exists in the 5 instances that have budget=high.• The list of possible conditions are in the table below.
Rule: ___ -> budget=high
smoker=true ambience=family personality=hard-worker
smoker=false ambience=friends personality=conformist
drink_level=abstemious transport=car owner personality=hunter-ostentatious
drink_level=casual drinker transport=public personality=thrifty-protector
drink_level=social drinker marital_status=single religion=none
dress_preference=no preference
interest=technology religion=mormon
dress_preference=informal
interest=none religion=christian
dress_preference=formal interest=variety activity=student
RIPPER: First Rule
• Next we determine the information gain for each of the candidate rules in the table.• Below is a detailed example of the calculation for the rulesmoker = true budget = high:Given: is the number of instances such that budget=highis the number of instance such that budget ≠ highis the number of instances such that smoker=true and budget=highis the number of instance such that smoker=true but budget ≠ high
=
RIPPER: First Rule
• Here we see a list of the information gain for each of the possible first condition in the antecedentRule: ___ -> budget=high Info
GainRule: ___ -> budget=high Info
Gain
smoker=true 0.0862 marital_status=single 0.8889
smoker=false 0.07365 interest=technology 3.6049
drink_level=abstemious 2.0974 interest=none -0.1203
drink_level=casual drinker -0.7680 interest=variety 3.6049
drink_level=social drinker -0.5353 personality=hard-worker -1.1441
dress_preference=no preference
0.1174 personality=conformist 1.9792
dress_preference=informal -0.3426 personality=hunter-ostentatious
1.2016
dress_preference=formal -0.5710 personality=thrifty-protector -0.1428
ambience=family -0.6854 religion=none -0.1203
ambience=friends 2.5440 religion=mormon 4.7866
transport=car owner 6.7865 religion=christian 1.9792
transport=public -1.5710 activity=student -0.1343
RIPPER: First Rule
• Since the following rule results in the highest information gain we select that as the first condition of our rule:transport = car owner budget = high:• Now we can use the number of instances calculated from this rule as and we calculate all the possible second conditions as in the next set of calculations.
RIPPER: First Rule
• Next we attempt to find the second condition in the antecedent. We need only look at possible conditions that exists in the 4 instances that have transport = car owner and budget=high.• The list of possible conditions are in the table below.
Rule: transport=car owner and ___ -> budget=high
smoker=false ambience=friends personality=thrifty-protector
drink_level=abstemious marital_status=single religion=none
drink_level=casual drinker interest=technology religion=mormon
dress_preference=no preference
interest=none religion=christian
dress_preference=informal
interest=variety activity=student
dress_preference=elegant personality=hard-worker
ambience=family personality=hunter-ostentatious
RIPPER: First Rule
• Here we see a list of the information gain for each of the possible second condition in the antecedentRule: transport=car owner
and ___ -> budget=high
Info Gain
Rule: transport=car owner and
___ -> budget=high
Info Gain
smoker=false 2.5121 interest=none 0.0875
drink_level=abstemious 5.0173 interest=variety 2.5602
drink_level=casual drinker -0.6130 personality=hard-worker -1.1605
dress_preference=no preference
-.06097 personality=hunter-ostentatious
0.7655
dress_preference=informal 0.7655 personality=thrifty-protector 1.5311
dress_preference=elegant 3.0875 religion=none -0.0824
ambience=family -0.6130 religion=mormon 3.0875
ambience=friends 1.5075 religion=christian 3.0875
marital_status=single 2.7570 activity=student -0.0840
interest=technology 2.5602
RIPPER: First Rule
• Since the following rule results in the highest information gain we select that as the second condition of our rule:transport = car owner and drink_level=abstemious budget = high:• Now we can use the number of instances calculated from this rule as and we calculate all the possible third conditions as in the next set of calculations.
RIPPER: First Rule
• Next we attempt to find the third condition in the antecedent. We need only look at possible conditions that exists in the 3 instances that have transport = car owner and drink_level = abstemious and budget=high.• The list of possible conditions are in the table below.
Rule: transport=car owner and drink_level=abstemious and ___ -> budget=high
smoker=false interest=technology personality=thrifty-protector
dress_preference=no preference
interest=none religion=none
dress_preference=formal interest=variety religion=catholic
ambience=family personality=hard-worker religion=christian
ambience=friends personality=hunter-ostentatious
activity=student
marital_status=single
RIPPER: First Rule
• Here we see a list of the information gain for each of the possible third conditions in the antecedentRule: transport=car owner
and drink_level=abstemious and ___ -> budget=high
Info Gain
Rule: transport=car owner and
drink_level=abstemious and ___ -> budget=high
Info Gain
smoker=false 0 interest=variety 0.4515
dress_preference=no preference
-0.3399 personality=hard-worker -0.5850
dress_preference=formal 1.4513 personality=hunter-ostentatious
1.4150
ambience=family -0.5850 personality=thrifty-protector -0.1699
ambience=friends 2.8300 religion=none -0.1699
marital_status=single 1.2415 religion=catholic -0.5850
interest=technology 0.4515 religion=christian 1.4150
interest=none -0.5850 activity=student .01826
RIPPER: First Rule
• Since the following rule results in the highest information gain we select that as the third condition of our rule:transport = car owner and drink_level = abstemious and ambience = friends budget = high:• Note that this rule covers only positive examples (i.e., budget=high data instances). Since it doesn’t cover negative examples, then there is no need to add more conditions to the rule. RIPPER’s construction of the first rule is now complete.
RIPPER: Pruning the First Rule
First rule: transport = car owner and drink_level = abstemious and ambience = friends budget = high:In order to decide if/how to prune this rule, RIPPER will:• use a validation set (that is, a piece of the training set that was kept apart and not used to construct the rule)• use a metric for pruning: v = (p-n)/(p+n) where
• p: # of positive examples covered by the rule in the validation set• n: # of negative examples covered by the rule in the validation set
• pruning method: deletes any final sequence of conditions that maximizes v. That is, it calculates v for each of the following pruned versions of the rule and keeps the version of the rule with maximum v:• transport = car owner & drink_level = abstemious & ambience = friends budget = high• transport = car owner & drink_level = abstemious budget = high• transport = car owner budget = high• budget = high
Homework 4 SolutionsASSOCIATION RULES: APRIORI ALGORITHM
Apriori: Level 1
• We begin the Apriori algorithm by determining the order:• Here I will use the order that the attributes appear and the values for each attribute in alphabetical order.
• Then all the possible single item rules are generated and the support calculated for each rule.• The following slide shows the complete list of possible items in the rule.• Support is calculated in the following manner:
• Since we know the minimum acceptable support count is 55, we need only look at the numerator of this ratio to determine whether or not to keep this item.
Apriori: Level 1
Candidate Itemsets with Support Countsmoker=false 109 transport=on foot 14 religion=christian 7
smoker=true 26 transport=public 82 religion=jewish 1
drink_level=abstemious 51 marital_status=single 122
religion=mormon 1
drink_level=casual drinker 47 marital_status=married 10 religion=none 30
drink_level=social drinker 40 interest=eco-friendly 16 activity=professional
15
dress_preference=elegant 4 interest=none 30 activity=student
113
dress_preference=formal 41 interest=technology 36 activity=unemployed
2
dress_preference=informal
53 interest=variety 50 activity=working-class
1
dress_preference=no preference
35 personality=conformist 7 budget=high 5
ambience=family 70 personality=hard-worker
61 budget-low 35
ambience=friends 46 personality=hunter-ostentatious
12 budget=medium
91
ambience=solitary 16 personality=thrifty-protector
58
transport=car owner 34 religion=catholic 99
• We keep the ones in bold as they meet the minimum support threshold.
Apriori: Level 1
Itemsets with Support
smoker=false 109
ambience=family 70
transport=public 82
marital_status=single 122
personality=hard-worker 61
personality=thrifty-protector
58
religion=catholic 99
activity=student 113
budget=medium 91
• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 2• We merge pairs from the level 1 set. Since there are no prefixes here then we must consider all combinations. (Continued on next slide)Candidate Itemsets with Support Count
smoker=false, ambience=family 59 smoker=false,
budget=medium 75 ambience=family, budget=medium
54
smoker=false, transport=public 69 ambience=family,
transport=public 46transport=public,
marital_status=single
76
smoker=false, marital_status=single 98
ambience=family, marital_status=singl
e63
transport=public, personality=hard-
worker28
smoker=false, personality=hard-
worker49
ambience=family, personality=hard-
worker26
transport=public, personality=thrifty-
protector44
smoker=false, personality=thrifty-
protector48
ambience=family, personality=thrifty-
protector33 transport=public,
religion=catholic62
smoker=false, religion=catholic 79 ambience=family,
religion=catholic 57 transport=public, activity=student
71
smoker=false, activity=student 90 ambience=family,
activity=student 61 transport=public, budget=medium
54
Apriori: Level 2
Candidate Itemsets with Support Count
marital_status=single, personality=hard-worker 52 personality=hard-worker
budget=medium 40
marital_status=single, personality=thrifty-
protector51 personality=thrifty-
protector, religion=catholic 45
marital_status=single, religion=catholic 91 personality=thrifty-
protector, activity=student 50
marital_status=single, activity=student
107
personality=thrifty-protector, budget=medium 41
marital_status=single, budget=medium 79 religion=catholic,
activity=student 84
personality=hard-worker, personality=thrifty-
protector0 religion=catholic,
budget=medium 67
personality=hard-worker, religion=catholic 40 activity=student,
budget=medium 71
personality=hard-worker, activity=student 46
Apriori: Level 2
Itemsets with Support Count
smoker=false, ambience=family 59 ambience=family,
marital_status=single 63marital_status=sin
gle, religion=catholic
91
smoker=false, transport=public 69 ambience=family,
religion=catholic 57marital_status=sin
gle, activity=student
107
smoker=false, marital_status=single 98 ambience=family,
activity=student 61marital_status=sin
gle, budget=medium
79
smoker=false, religion=catholic 79 transport=public,
marital_status=single76
religion=catholic,activity=student 84
smoker=false, activity=student 90 transport=public,
religion=catholic62
religion=catholic,budget=medium 67
smoker=false, budget=medium 75 transport=public,
activity=student71
activity=student,budget=medium 71
• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 3• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.
Itemsets from Level 2
smoker=false, ambience=family
ambience=family, marital_status=single
marital_status=single,
religion=catholic
smoker=false, transport=public
ambience=family, religion=catholic
marital_status=single,
activity=student
smoker=false, marital_status=single
ambience=family, activity=student
marital_status=single,
budget=medium
smoker=false, religion=catholic
transport=public, marital_status=single
religion=catholic,activity=student
smoker=false, activity=student
transport=public, religion=catholic
religion=catholic,budget=medium
smoker=false, budget=medium
transport=public, activity=student
activity=student,budget=medium
Apriori: Level 3• First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets are the same)• Here we need only match the first item in the itemset.Itemsets from Level 2
smoker=false, ambience=family
ambience=family, marital_status=single
marital_status=single,
religion=catholic
smoker=false, transport=public
ambience=family, religion=catholic
marital_status=single,
activity=student
smoker=false, marital_status=single
ambience=family, activity=student
marital_status=single,
budget=medium
smoker=false, religion=catholic
transport=public, marital_status=single
religion=catholic,activity=student
smoker=false, activity=student
transport=public, religion=catholic
religion=catholic,budget=medium
smoker=false, budget=medium
transport=public, activity=student
activity=student,budget=medium
Apriori: Level 3• That results in this set of potential candidate itemsets.
Potential Candidate Itemsets
smoker=false, ambience=family,transport=public
smoker=false, transport=public,religion=catholic
smoker=false, activity=student,budget=medium
transport=public, religion=catholic,activity=student
smoker=false, ambience=family,
marital_status=single
smoker=false, transport=public,activity=student
ambience=family, marital_status=singl
e, religion=catholic
marital_status=single, religion=catholic,
activity=student
smoker=false, ambience=family,religion=catholic
smoker=false, transport=public,budget=medium
ambience=family, marital_status=singl
e, activity=student
marital_status=single, religion=catholic,
budget=medium
smoker=false, ambience=family,activity=student
smoker=false, marital_status=singl
e,religion=catholic
ambience=family, religion=catholic,activity=student
marital_status=single, activity=student,
budget=medium
smoker=false, ambience=family,budget=medium
smoker=false, marital_status=singl
e,activity=student
transport=public, marital_status=singl
e,religion=catholic
religion=catholic,activity=student,budget=medium
smoker=false, transport=public,
marital_status=single
smoker=false, marital_status=single, budget=medium
transport=public, marital_status=singl
e,activity=student
Apriori: Level 3
• We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 2 in each of these itemsets also existed in the level 2 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets.• The following itemsets can be removed as the bolded subsets do not appear in the Level 2 itemsets. This leaves us the candidate itemsets on the next slide.
Candidate Itemsets That Can be Removed
smoker=false, ambience=family,transport=public
smoker=false, ambience=family,budget=medium
smoker=false, transport=public,budget=medium
Apriori: Level 3
Candidate Itemsets with Support Count
smoker=false, ambience=family,
marital_status=single
53
smoker=false, transport=public
,activity=student
58
ambience=family, marital_status=singl
e, religion=catholic
50
transport=public,
religion=catholic,
activity=student
59
smoker=false, ambience=family,religion=catholic
46
smoker=false, marital_status=s
ingle,religion=catholic
72
ambience=family, marital_status=si
ngle, activity=student
57
marital_status=single,
religion=catholic,
activity=student
80
smoker=false, ambience=family,activity=student
52
smoker=false, marital_status=s
ingle,activity=student
85ambience=family, religion=catholic,activity=student
51
marital_status=single,
religion=catholic,
budget=medium
80
smoker=false, transport=public,marital_status=si
ngle
63
smoker=false, marital_status=s
ingle, budget=medium
65
transport=public, marital_status=si
ngle,religion=catholic
57
marital_status=single,
activity=student,
budget=medium
59
smoker=false, transport=public,religion=catholic
52
smoker=false, activity=student
,budget=medium
58
transport=public, marital_status=si
ngle,activity=student
67religion=catholic,activity=student,budget=medium
53
Apriori: Level 3
Level 3 Itemsets with Support
smoker=false, transport=public,
marital_status=single63
smoker=false, activity=student,budget=medium
58
marital_status=single,
religion=catholic,activity=student
80
smoker=false, transport=public,activity=student
58ambience=family,
marital_status=single, activity=student
57
marital_status=single,
religion=catholic,budget=medium
80
smoker=false, marital_status=single,
religion=catholic72
transport=public, marital_status=single,
religion=catholic57
marital_status=single, activity=student,
budget=medium59
smoker=false, marital_status=single,
activity=student85
transport=public, marital_status=single,
activity=student67
smoker=false, marital_status=single,
budget=medium65
transport=public, religion=catholic,activity=student
59
• We keep the following item sets as they contain enough support, and use these item sets to generate candidate item sets for the next level.
Apriori: Level 4
• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.Level 3 Itemsets
smoker=false, transport=public,
marital_status=single
smoker=false, activity=student,budget=medium
marital_status=single,
religion=catholic,activity=student
smoker=false, transport=public,activity=student
ambience=family, marital_status=single,
activity=student
marital_status=single,
religion=catholic,budget=medium
smoker=false, marital_status=single,
religion=catholic
transport=public, marital_status=single,
religion=catholic
marital_status=single, activity=student,
budget=medium
smoker=false, marital_status=single,
activity=student
transport=public, marital_status=single,
activity=student
smoker=false, marital_status=single,
budget=medium
transport=public, religion=catholic,activity=student
Apriori: Level 4
• We generate the next level of candidate sets, but before we calculate the support we can use the Apriori principle to determine if they are viable candidates.Level 3 Itemsets
smoker=false, transport=public,
marital_status=single
smoker=false, activity=student,budget=medium
marital_status=single,
religion=catholic,activity=student
smoker=false, transport=public,activity=student
ambience=family, marital_status=single,
activity=student
marital_status=single,
religion=catholic,budget=medium
smoker=false, marital_status=single,
religion=catholic
transport=public, marital_status=single,
religion=catholic
marital_status=single, activity=student,
budget=medium
smoker=false, marital_status=single,
activity=student
transport=public, marital_status=single,
activity=student
smoker=false, marital_status=single,
budget=medium
transport=public, religion=catholic,activity=student
• First we determine the candidates by “joining” itemsets with like prefixes. (i.e., the first k-1 items in the items sets match)• Here we need only match the first two items in the itemset.
Apriori: Level 4
Potential Candidate Item Setssmoker=false,
transport=public,marital_status=sing
le,activity=student
smoker=false, marital_status=single,
activity=student, budget=medium
smoker=false, marital_status=single,
religion=catholic,activity=student
transport=public, marital_status=single,religion=catholic,activity=student
smoker=false, marital_status=single,religion=catholic,budget=medium
marital_status=single, religion=catholic,activity=student, budget=medium
• That results in this set of candidate itemsets.• We have one final step before calculating the support: we can eliminate unnecessary candidates. We must check that all subsets of size 3 in each of these itemsets also existed in the level 3 set. We can make this a little easier by ignoring the prefix subsets as we know those existed because we used them to create the itemsets.• Here we again eliminate candidates from consideration, the offending subsets are bolded.
Apriori: Level 4
Candidate Itemsets with Support Count
smoker=false, marital_status=sing
le,religion=catholic,activity=student
63
smoker=false, marital_status=single,
activity=student, budget=medium
53
• In the end we keep only one single itemset that has enough support for this level.
• The following slide depicts the complete itemset.
Level 4 Itemsets with Support Count
smoker=false, marital_status=single,
religion=catholic,activity=student
63
Apriori: Complete ItemsetItemsets with Support Count
smoker=false 109smoker=false,
marital_status=single98 marital_status=single,
religion=catholic91
smoker=false, marital_status=single,
budget=medium
65
ambience=family 70smoker=false,
religion=catholic79 marital_status=single,
activity=student107
smoker=false, activity=student,budget=medium
58
marital_status=single 122smoker=false,
activity=student90 marital_status=single,
budget=medium79
ambience=family, marital_status=single,
activity=student
57
personality=hard-worker
61smoker=false,
budget=medium75 religion=catholic,
activity=student84
transport=public, marital_status=single,
religion=catholic
57
transport=public 82ambience=family,
marital_status=single63 religion=catholic,
budget=medium67
transport=public, marital_status=single,
activity=student
67
religion=catholic 99ambience=family, religion=catholic
57 activity=student,budget=medium
71transport=public, religion=catholic,activity=student
59
activity=student 113ambience=family, activity=student
61smoker=false,
transport=public,marital_status=single
63marital_status=single,
religion=catholic,activity=student
80
budget=medium 91transport=public,
marital_status=single76
smoker=false, transport=public,activity=student
58marital_status=single,
religion=catholic,budget=medium
80
smoker=false, ambience=family
59 transport=public, religion=catholic
62smoker=false,
marital_status=single,religion=catholic
72marital_status=single,
activity=student,budget=medium
59
smoker=false, transport=public 69
transport=public, activity=student 71
smoker=false, marital_status=single,
activity=student85
smoker=false, marital_status=single,
religion=catholic,activity=student
63
Rule ConstructionLargest itemset: Let’s call this itemset I4:
I4: smoker=false, marital_status=single, religion=catholic, activity=student
Rules constructed from I4 with 2 items in the antecedent: R1: smoker=false, marital_status=single religion=catholic, activity=student
conf(R1) = supp(I4)/supp(smoker=false, marital_status=single ) = 63/ 98 = 64.28% R2: smoker=false, religion=catholic marital_status=single, activity=student
conf(R2) = supp(I4)/supp(smoker=false, religion=catholic ) = 63/ 79 = 79.74% R3: smoker=false, activity=student marital_status=single, religion=catholic conf(R3) =
supp(I4)/supp(smoker=false, activity=student ) = 63/ 90= 70% R4: marital_status=single, religion=catholic smoker=false, activity=student
conf(R4) = supp(I4)/supp(marital_status=single, religion=catholic ) = 63/ 91 = 69.23% R5: marital_status=single, activity=student smoker=false, religion=catholic
conf(R5) = supp(I4)/supp(marital_status=single, activity=student ) = 63/ 107 = 58.87% R6: religion=catholic, activity=student smoker=false, marital_status=single
conf(R6) = supp(I4)/supp(religion=catholic, activity=student) = 63/ 84 = 75%