1 ECML / PKDD 2004 Discovery Challenge ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Mining Strong Associations and Exceptions in the STULONG Data Set Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino * *work sponsored by CNPq research grant 300879/00-8 Universidade Federal Fluminense Department of Computer Science Niterói, Rio de Janeiro, Brazil {egoncalves,plastino}@ic.uff.br - http://www.ic.uff.br
ECML / PKDD 2004 Discovery Challenge. Mining Strong Associations and Exceptions in the STULONG Data Set. Eduardo Corrêa Gonçalves and Alexandre Plastino *. Universidade Federal Fluminense Department of Computer Science Niterói, Rio de Janeiro, Brazil - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1ECML / PKDD 2004 Discovery Challenge
ECML / PKDD 2004 Discovery Challenge
Mining Strong Associations and Mining Strong Associations and Exceptions in the STULONG Data SetExceptions in the STULONG Data Set
Eduardo Corrêa Gonçalves and Alexandre Plastino*
*work sponsored by CNPq research grant 300879/00-8
Universidade Federal Fluminense
Department of Computer Science Niterói, Rio de Janeiro, Brazil
R1 is a strong association rule, while R2 is not true.
In order to mine interesting information, we need to evaluate the type of dependence between the antecedent and the consequent of a rule.
15ECML / PKDD 2004 Discovery Challenge
Lift: how much more frequent is B when A occurs.
Lift(A B) = Conf(A B) Sup(B)
RI - Rule Interest (G. Piatetsky-Shapiro, 1991): computes the percentage of additional tuples matched by an association rule that are above the expected.
RI(A B) = Sup(A B) - Sup(A) x Sup(B)
We believe that the use of different interest measures (Sup, Conf, Lift and RI) provides alternative analysis of the same data, giving a better understanding about the associations.
Lift and RI
16ECML / PKDD 2004 Discovery Challenge
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
17ECML / PKDD 2004 Discovery Challenge
In our approach, exceptions represent association rules that become much weaker in some specific subsets of the database.
meaning: “among the men who are 50 years old or above, the support value of the association between being a heavy beer consumer and being a heavy smoker is surprisingly smaller than what is expected”.
Exceptions
Example: Does the rule (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”) become weaker on any subset of the database?
This exception was obtained because the conventional rule (DailyBeerCons = “>1l”) & (Age = “50”) (Smoking = “>20 cig/day”) did not achieve an expected support.
This expected support is evaluated from the support of the original rule (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”) and the support of the condition (Age = “50”).
19ECML / PKDD 2004 Discovery Challenge
Let D be a database relation.
Let R: A B be a multidimensional association rule.
Let Z = {Z1 = z1, ..., Zk = Zk} be a set of conditions defined over D, where Z A B = . Z is named as probe set.
An exception related to the positive rule R is an implication of the form:
A Z B
Exceptions: Formal Definition
20ECML / PKDD 2004 Discovery Challenge
Exceptions are extracted from candidate exceptions. A candidate exception is an expression in the form:
A Z B
Exceptions are mined only if the candidates do not achieve an expected support.
This expectation is evaluated based on the support of the original rule A B and the support of the conditions that compose the probe set Z:
ExpSup(A Z B) = Sup(A B) x Sup(Z)
Candidate Exceptions
21ECML / PKDD 2004 Discovery Challenge
The Interest Measure (IM) Index
We developed two interest measures to evaluate the degree of interestingness of an exception.
The IM (Interest Measure) index evaluates the strength (relevance) of an exception.
IM(E) = 1 - (Sup(A Z B) ExpSup(A Z B))
An exception E is potentially interesting if the actual support value of Sup(A Z B) is much lower than its expected support value.
This measure captures the type of dependence between Z and A B. The closer the value is from 1, the more the negative dependence.
The expected support for A Z B can be computed as 4.48% x 9.47% = 0.42%.
The actual support for this candidate rule is 0.00%.
IM(A Z B) = 1 - (0.00 0.48) = 1.00.
However, this exception represents na information that is obvious. The IM index could not detect the strong negative dependence between A and Z.
24ECML / PKDD 2004 Discovery Challenge
Degree of Unexpectedness
The DU (Degree of Unexpectedness ) Index is used to determine the validity of an exception.
This measure captures how much the negative dependence between a probe set Z and a rule A B is higher than the negative dependence between Z and either A and B.
Drinking a lot and smoking for more than 20 years are positively dependent in groups A, B, and C (Lift and RI columns).
However, there are much fewer smokers in Group A (SupB column). In groups B and C, the greatest part of the heavy beer consumers smoked cigarettes for more than 20 years (Conf column).
Men from group B tend to smoke and drink more (SupA, SupB
and Sup columns).
33ECML / PKDD 2004 Discovery Challenge
Alcohol Consumption x Cholesterol
Group SupA SupB Sup Conf Lift RI
A 0.0870 0.3370 0.0507 0.5833 1.731 0.0214
B 0.0861 0.1828 0.0186 0.2162 1.183 0.0029
C 0.1316 0.1316 0.0263 0.2000 1.520 0.0090
(Alcohol = “No”) (Cholesterol = “desirable”)
Not drinking alcohol and having the cholesterol in the desirable range are positively dependent in groups A, B, and C (Lift and RI columns).
There are less alcohol consumers in Group C (SupA column).
In group A, the greatest part of the men who do not drink alcohol have the cholesterol in the desirable range (Conf column).
34ECML / PKDD 2004 Discovery Challenge
Education x Smoking
Group SupA SupB Sup Conf Lift RI
A 0.3949 0.5109 0.2210 0.5596 1.095 0.0193
B 0.2526 0.1793 0.0664 0.2627 1.465 0.0211
C 0.1667 0.2018 0.0877 0.5263 2.608 0.0541
(Education = “university”) (Smoking = “no”)
People with the highest education degree are less likely to be smokers (Lift and RI columns).
In groups A and C, the majority of men with university degree do not smoke (Conf column). The support of this rule is very high in group A.
In group B, most of them are smokers (Conf column). However, not smoking and having reached university degree still are very positively dependent (Lift and RI columns).
35ECML / PKDD 2004 Discovery Challenge
Skin Folds x Body Mass Index
Group SupA SupB Sup Conf Lift RI
A 0.2319 0.5326 0.1558 0.6719 1.261 0.0323
B 0.2154 0.3586 0.1478 0.6865 1.914 0.0706
C 0.1140 0.2632 0.0789 0.6923 2.631 0.0489
(Skin Folds = “ 20”) (BMI = “normal”)
Most of the men who have the body mass index into the normal range were classified into the lowest range of the attribute Skin Folds (Conf column).
Both attributes are highly positive dependent (Lift and RI columns).
There are much fewer people who have normal BMI in Group C (SupB column).
Original rule: “people whose education degree is apprentice school tend to smoke a lot”.
Exception: Among the men who practice physical activities intensely in free time, the support value of the original rule is 47.55% smaller than what is expected.
Original rule: “people with the highest education degree tend to have the body mass index into the normal range”.
Exception: Among the men who belong to Group C, the support value of the original rule is 70.18% smaller than what is expected.
The degree of unexpectedness is equal to 30.52%.
38ECML / PKDD 2004 Discovery Challenge
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
Outline of the talk
1. Atherosclerosis Data Set
2. Multidimensional Association Rules
3. Exceptions
4. Data Preparation
5. Results
6. Summary
39ECML / PKDD 2004 Discovery Challenge
Summary
We presented some strong association rules and exceptions mined from the STULONG Data Set, concerning the entry examinations.
Strong association rules evaluated the differences of the correlations concerning the characteristics of the patients from the three basic groups.
Exceptions indicated negative patterns associated with previously known strong positive rules. These exceptions were mined from candidates that do not achieve an expected support value.
40ECML / PKDD 2004 Discovery Challenge
Apply the same approach to the relations: Letter, Control and Death.
Besides mining rules with large deviation between the actual and the expected support, we intend to investigate the interestingness of rules with large deviation between the actual and the expected confidence value.
Future Work
41ECML / PKDD 2004 Discovery Challenge
UniversidadeUniversidade Federal Fluminense Federal Fluminense