Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University, Ames, Iowa 2003 A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments
40
Embed
Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Anna Atramentov
Major: Computer Science
Program of Study Committee:
Vasant Honavar, Major Professor
Drena Leigh Dobbs
Yan-Bin Jia
Iowa State University,
Ames, Iowa
2003
A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments
KDD and Relational Data Mining
Term KDD stands for Knowledge Discovery in Databases Traditional techniques in KDD work with the instances represented by one
table
Relational Data Mining is a subfield of KDD where the instances are represented by several tables
Day Outlook Temp-re Humidity Wind Play Tennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Department
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor 80-100k
Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Motivation
Importance of relational learning:
Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB
Promising approach to relational learning:
MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999)
MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002)
Goals Speed up MRDM framework and in particular MRDTL algorithm Incorporate handling of missing values Perform more extensive experimental evaluation of the algorithm
Relational Learning Literature
Inductive Logic Programming (Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 19972001; Blockeel, 1998; De Raedt, 1997))
First order extensions of probabilistic models First order extensions of probabilistic models
Bayesian Logic Programs (Bayesian Logic Programs (Kersting et al., 2000Kersting et al., 2000))
Combining First Order Logic and Probability TheoryCombining First Order Logic and Probability Theory
Multi-Relational Data Mining (Multi-Relational Data Mining (Knobbe et al., 1999Knobbe et al., 1999))
Propositionalization methods (Propositionalization methods (Krogel and Wrobel, 2001Krogel and Wrobel, 2001))
PRMs extension for cumulative learning for learning and reasoning as agents PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (interact with the world (Pfeffer, 2000Pfeffer, 2000))
Approaches for mining data in form of graph (Approaches for mining data in form of graph (Holder and Cook, 2000; Holder and Cook, 2000; Gonzalez et al., 2000Gonzalez et al., 2000))
Problem Formulation
Example of multi-relational database
Given: Data stored in relational data baseGoal: Build decision tree for predicting target attribute in the target table
schemainstances
Department
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Staff
p1 Dale d1 Professor
70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor
80-100k
Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Department
ID
Specialization
#Students
Staff
ID
Name
Department
Position
Salary
Grad.Student
ID
Name
GPA
#Publications
Advisor
Department
No
{d3, d4}{d1, d2}
{d1, d2, d3, d4}Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright)
Propositional decision tree algorithm. Construction phase
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast
Hot High Weak Yes
d4 Overcast
Cold Normal Weak No
Outlook not sunny
…
…
…
…
Temperature
hot not hot
No Yes
{d3} {d4}
Day Outlook Temp Hum-ty Wind PlayT
Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
sunny
MR setting. Splitting data with Selection Graphs
ID Specialization #Students
d1 Math 1000
d2 Physics 300
d3 Computer Science
400
Department Graduate Student
ID Name Department
Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
p4 David d3 Professor 80-100k
Staff
ID Name GPA #Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Staff
Grad. Student
Grad. Student
GPA >2.0
Department
Staff
Grad.Student
complement selection graphs
Staff Grad. Student
GPA >2.0
Staff Grad. Student
ID Name Department
Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist
40-50k
ID Name Department
Position Salary
p4 David d3 Professor
80-100k
ID Name Department
Position Salary
Dale d1 Professor 70-80k
What is selection graph?
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
It corresponds to the subset of the instances from target table
Nodes correspond to the tables from the database
Edges correspond to the associations between tables
Open edge = “have at least one”
Closed edge = “have non of ”
Department
Staff
Grad.Student
Specialization=math
Transforming selection graphs into SQL queriesStaff
Staff
Staff
Staff Grad. Student
Grad. Student
Grad. Student
Grad. Student
SelectSelect distinct T0.idT0.id
FromFrom Staff
WhereWhere T0.position=Professor T0.position=ProfessorPosition = Professor
from table_listwhere join_list and condition_listgroup by Staff.Salary
A way to speed up - eliminate redundant calculations
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Problem:Problem:For selection graph with 162 For selection graph with 162 nodes the time to execute a nodes the time to execute a query is more than 3 minutes!query is more than 3 minutes!
Redundancy in calculation:Redundancy in calculation:For this selection graph tables For this selection graph tables Staff and Grad.Student will be Staff and Grad.Student will be joined over and over for all the joined over and over for all the children refinements of the treechildren refinements of the tree
A way to fix:A way to fix:calculate it only once and save calculate it only once and save for all further calculationsfor all further calculations
Speed Up Method. Sufficient tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
Staff_ID
Grad.Student_ID
Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2
… …
… …
Speed Up Method. Sufficient tables
Entropy associated with this selection Entropy associated with this selection graph:graph:Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
n1
n2
…
E = E = ( (nnii /N /N)) log log ((nnii /N /N))
Query associated with counts Query associated with counts nnii::
select S.Salary, count(distinct S.Staff_ID)
from Sgroup by S.Salary
Result of the query is the following list:Result of the query is the following list:
cci i , n, nii
Staff_ID Grad.Student_ID Dep_ID Salary
p1 s1 d1 c1
p2 s1 d1 c1
p3 s6 d4 c1
p4 s3 d3 c1
p5 s1 d2 c2
p6 s9 d3 c2
… …
… …
Speed Up Method. Sufficient tables
Staff
Grad.Student
GPA >3.9
Grad.Student
Department
Specialization=math
select S.Salary, X.A, count(distinct S.Staff_ID)
from S, Xwhere S.X_ID = X.IDgroup by S.Salary, X.A
Queries associated with the addQueries associated with the addcondition refinement:condition refinement:
Calculations for the complement Calculations for the complement refinement:refinement:
Describes molecules of certain nitro aromatic compounds.
Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer.
Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset.
5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level.
Experimental results. Mutagenesis
Results of 10-fold cross-validation for regression friendly set.
Data Set Accuracy Sel graphsize (max)
Tree size Time withspeed up
Time withoutspeed up
mutagenesis 87.5% 3 9 28.45 52.15
Best-known reported accuracy is 86%
Schema of the mutagenesis database
Consists of a variety of details about the various genes of one particular type of organism.
Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions.
2 Tasks: Prediction of gene/protein localization and function 862 training genes, 381 test genes.
Experimental results. KDD Cup 2001
Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table
Consists of 5 tables Target table consists of 1239 records The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table
The results for 5:2 cross validation:
Data Set Accuracy Sel Graphsize (max)
Tree size Time with speed up
Time without speed up
thrombosis 98.1% 31 71 127.75 198.22
Best-known reported accuracy is 99.28%
PATIENT_INFO
DIAGNOSIS
THROMBOSISANTIBODY_EXAM
ANA_PATTERN
Summary the algorithm significantly outperforms MRDTL in terms of running time the accuracy results are comparable with the best reported results
obtained using different data-mining algorithms
Future work
Incorporation of the more sophisticated techniques for handling missing values
Incorporating of more sophisticated pruning techniques or complexity regularizations
More extensive evaluation of MRDTL on real-world data sets Development of ontology-guided multi-relational decision tree learning
algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002]
Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration
Thanks to
Dr. Honavar for providing guidance, help and support throughout this research
Colleges from Artificial Intelligence Lab for various helpful discussions
My committee members: Drena Dobbs and Yan-Bin Jia for their help
Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions
Iowa State University and Computer Science department for funding in part this research