Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,

Anna Atramentov

Major: Computer Science

Program of Study Committee:

Vasant Honavar, Major Professor

Drena Leigh Dobbs

Yan-Bin Jia

Iowa State University,

Ames, Iowa

2003

A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments

KDD and Relational Data Mining

Term KDD stands for Knowledge Discovery in Databases Traditional techniques in KDD work with the instances represented by one

table

Relational Data Mining is a subfield of KDD where the instances are represented by several tables

Day Outlook Temp-re Humidity Wind Play Tennis

d1 Sunny Hot High Weak No

d2 Sunny Hot High Strong No

d3 Overcast Hot High Weak Yes

d4 Overcast Cold Normal Weak No

Department

d1 Math 1000

d2 Physics 300

d3 Computer Science

400

Staff

p1 Dale d1 Professor 70 - 80k

p2 Martin d3 Postdoc 30-40k

p3 Victor d2 VisitorScientist

40-50k

p4 David d3 Professor 80-100k

Graduate Student

s1 John 2.0 4 p1 d3

s2 Lisa 3.5 10 p4 d3

s3 Michel 3.9 3 p4 d4

Motivation

Importance of relational learning:

Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB

Promising approach to relational learning:

MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999)

MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002)

Goals Speed up MRDM framework and in particular MRDTL algorithm Incorporate handling of missing values Perform more extensive experimental evaluation of the algorithm

Relational Learning Literature

Inductive Logic Programming (Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 19972001; Blockeel, 1998; De Raedt, 1997))

First order extensions of probabilistic models First order extensions of probabilistic models

Relational Bayesian Networks(Relational Bayesian Networks(Jaeger, 1997Jaeger, 1997))

Probabilistic Relational Models (Probabilistic Relational Models (Getoor, 2001; Koller, 1999Getoor, 2001; Koller, 1999))

Bayesian Logic Programs (Bayesian Logic Programs (Kersting et al., 2000Kersting et al., 2000))

Combining First Order Logic and Probability TheoryCombining First Order Logic and Probability Theory

Multi-Relational Data Mining (Multi-Relational Data Mining (Knobbe et al., 1999Knobbe et al., 1999))

Propositionalization methods (Propositionalization methods (Krogel and Wrobel, 2001Krogel and Wrobel, 2001))

PRMs extension for cumulative learning for learning and reasoning as agents PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (interact with the world (Pfeffer, 2000Pfeffer, 2000))

Approaches for mining data in form of graph (Approaches for mining data in form of graph (Holder and Cook, 2000; Holder and Cook, 2000; Gonzalez et al., 2000Gonzalez et al., 2000))

Problem Formulation

Example of multi-relational database

Given: Data stored in relational data baseGoal: Build decision tree for predicting target attribute in the target table

schemainstances

Department

d1 Math 1000

d2 Physics 300

d3 Computer Science

400

Staff

p1 Dale d1 Professor

70 - 80k



40-50k

p4 David d3 Professor

80-100k

Graduate Student

s1 John 2.0 4 p1 d3

s2 Lisa 3.5 10 p4 d3


Department

ID

Specialization

#Students

Staff

ID

Name

Department

Position

Salary

Grad.Student

ID

Name

GPA

#Publications

Advisor

Department

No

{d3, d4}{d1, d2}

{d1, d2, d3, d4}Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright)

Propositional decision tree algorithm. Construction phase

Day Outlook Temp-re Humidity Wind PlayTennis



d3 Overcast

Hot High Weak Yes

d4 Overcast

Cold Normal Weak No

Outlook not sunny

…

…

…

…

Temperature

hot not hot

No Yes

{d3} {d4}

Day Outlook Temp Hum-ty Wind PlayT

Overcast Hot High Weak Yes

d4 Overcast Cold Normal Weak No

Day Outlook Temp Hum-ty Wind PlayT



sunny

MR setting. Splitting data with Selection Graphs

ID Specialization #Students

d1 Math 1000

d2 Physics 300

d3 Computer Science

400

Department Graduate Student

ID Name Department

Position Salary

p1 Dale d1 Professor 70 - 80k



40-50k

p4 David d3 Professor 80-100k

Staff

ID Name GPA #Public. Advisor Department

s1 John 2.0 4 p1 d3

s2 Lisa 3.5 10 p4 d3


Staff

Grad. Student

Grad. Student

GPA >2.0

Department

Staff

Grad.Student

complement selection graphs

Staff Grad. Student

GPA >2.0

Staff Grad. Student

ID Name Department

Position Salary



40-50k

ID Name Department

Position Salary

p4 David d3 Professor

80-100k

ID Name Department

Position Salary

Dale d1 Professor 70-80k

What is selection graph?

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

It corresponds to the subset of the instances from target table

Nodes correspond to the tables from the database

Edges correspond to the associations between tables

Open edge = “have at least one”

Closed edge = “have non of ”

Department

Staff

Grad.Student

Specialization=math

Transforming selection graphs into SQL queriesStaff

Staff

Staff

Staff Grad. Student

Grad. Student

Grad. Student

Grad. Student

SelectSelect distinct T0.idT0.id

FromFrom Staff

WhereWhere T0.position=Professor T0.position=ProfessorPosition = Professor

Select Select distinct T0.idT0.id

FromFrom Staff T0, Graduate_Student T1T0, Graduate_Student T1

Where Where T0.id=T1.AdvisorT0.id=T1.Advisor

SelectSelect distinct T0.idT0.id

FromFrom Staff T0T0

WhereWhere T0.id not in T0.id not in

( ( SelectSelect T1. id T1. id

FromFrom Graduate_Student T1) Graduate_Student T1)

GPA >3.9

Select distinct T0. idFrom Staff T0, Graduate_Student T1Graduate_Student T1Where T0.id=T1.AdvisorT0.id=T1.Advisor

T0. id not in ( ( SelectSelect T1. id T1. id

FromFrom Graduate_Student T1 Graduate_Student T1

WhereWhere T1.GPA > 3.9) T1.GPA > 3.9)

Generic query:

select distinct T0.primary_key from table_list where join_list and condition_list

MR decision tree

Staff

… …

… …

… …

Staff Staff

StaffStaff Grad. StudentGrad.Student

Grad.Student Grad.Student

GPA >3.9

GPA >3.9

Grad.Student

Each node contains selection graph Each child selection graph is a supergraph

of the parent selection graph

How to choose selection graphs in nodes?

Problem: There are too many supergraph selection graphs to choose from in each node

Solution: start with initial selection graph find greedy heuristic to choose supergraph

selection graphs: refinements use binary splits for simplicity for each refinement

get complement refinement choose the best refinement based

on information gain criterion

Problem: Somepotentiallygood refinementsmay give noimmediate benefit

Solution: look ahead capability

Staff

… …

… …

… …

Staff Staff

StaffStaff Grad. StudentGrad.Student


GPA >3.9

GPA >3.9

Grad.Student

Refinements of selection graph

add condition to the node - explore attribute information in the tables

add present edge and open node –explore relational properties between the tables

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

Specialization=math


Position = Professor

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Position != Professor

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

refinement

complement refinement

Department

Staff

Grad.Student

add condition to the nodeadd condition to the node add present edge and open node

Specialization=math

Specialization=math

Specialization=math


Staff

Grad.Student

GPA >3.9

Grad.Student

Department

GPA >2.0

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Grad.StudentGPA >2.0

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student


refinement

complement refinementSpecialization=math

Specialization=math

Specialization=math


Staff

Grad.Student

GPA >3.9

Grad.Student

Department

#Students >200

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

#Students >200

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student


refinement

complement refinementSpecialization=math

Specialization=math

Specialization=math


Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

add condition to the node add present edge and open nodeadd present edge and open node

refinement


Note: information gain = 0

Specialization=math

Specialization=math

Specialization=math


Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Staff

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Staff

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

refinement



Specialization=math

Specialization=math

Specialization=math


Staff

Grad.Student

GPA >3.9

Grad.Student

Department Staff

Staff

Grad.Student

GPA >3.9

Grad.Student

Department Staff

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

refinement



Specialization=math

Specialization=math

Specialization=math


Staff

Grad.Student

GPA >3.9

Grad.Student

Department Grad.S

Staff

Grad.Student

GPA >3.9

Grad.Student

Department Grad.S

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

refinement



Specialization=math

Specialization=math

Specialization=math

Look ahead capability

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

refinement


Specialization=math

Specialization=math

Specialization=math

Look ahead capability

Department

Staff

Grad.Student

#Students > 200

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

refinement


#Students > 200

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Department

Department

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

Specialization=math

Specialization=math

MRDTL algorithm. Construction phase

Staff

… …

… …

… …

Staff StaffGrad.Student Grad.Student

Staff Grad. Student

GPA >3.9

StaffGrad.Student

GPA >3.9

Grad.Student

for each non-leaf node: consider all possible refinements

and their complements of the node’s selection graph

choose the best onesbased on informationgain criterion

createchildrennodes

MRDTL algorithm. Classification phaseStaff

… …

… …… …

Staff Staff

StaffStaff Grad. Student

Grad.Student


GPA >3.9

GPA >3.9

Grad.Student

Staff Grad. Student

GPA >3.9

Department

Spec=math

Staff Grad. Student

GPA >3.9

Department

Spec=physics

Position =Professor ……………..

70-80k 80-100k

for each leaf: apply selection graph of the

leaf to the test data classify resulting instances

with classificationof the leaf

The most time consuming operations of MRDTL

Entropy associated with this selection Entropy associated with this selection graph:graph:Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

ID Name Dep Position Salary

p1 Dale d1 Postdoc c1

p2 Martin d1 Postdoc c1

p3 David d4 Postdoc c1

p4 Peter d3 Postdoc c1

p5 Adrian d2 Professor c2

p6 Doina d3 Professor c2

… …

… …

n1

n2

…

E = E = ( (nnii /N /N)) log log ((nnii /N /N))

Query associated with counts Query associated with counts nnii::

select distinct Staff.Salary, count(distinct Staff.ID)

from Staff, Grad.Student, Deparment

where join_list and condition_listgroup by Staff.Salary

Result of the query is the following list:Result of the query is the following list:

cci i , n, nii

The most time consuming operations of MRDTL

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

GPA >2.0

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Grad.StudentGPA >2.0

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

Specialization=math

Specialization=math

Entropy associated with each of Entropy associated with each of the refinementsthe refinements

select distinct Staff.Salary, count(distinct Staff.ID)

from table_listwhere join_list and condition_listgroup by Staff.Salary

A way to speed up - eliminate redundant calculations

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

Problem:Problem:For selection graph with 162 For selection graph with 162 nodes the time to execute a nodes the time to execute a query is more than 3 minutes!query is more than 3 minutes!

Redundancy in calculation:Redundancy in calculation:For this selection graph tables For this selection graph tables Staff and Grad.Student will be Staff and Grad.Student will be joined over and over for all the joined over and over for all the children refinements of the treechildren refinements of the tree

A way to fix:A way to fix:calculate it only once and save calculate it only once and save for all further calculationsfor all further calculations

Speed Up Method. Sufficient tables

Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

Staff_ID

Grad.Student_ID

Dep_ID Salary

p1 s1 d1 c1

p2 s1 d1 c1

p3 s6 d4 c1

p4 s3 d3 c1

p5 s1 d2 c2

p6 s9 d3 c2

… …

… …


Entropy associated with this selection Entropy associated with this selection graph:graph:Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

n1

n2

…

E = E = ( (nnii /N /N)) log log ((nnii /N /N))

Query associated with counts Query associated with counts nnii::

select S.Salary, count(distinct S.Staff_ID)

from Sgroup by S.Salary

Result of the query is the following list:Result of the query is the following list:

cci i , n, nii

Staff_ID Grad.Student_ID Dep_ID Salary

p1 s1 d1 c1

p2 s1 d1 c1

p3 s6 d4 c1

p4 s3 d3 c1

p5 s1 d2 c2

p6 s9 d3 c2

… …

… …


Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

select S.Salary, X.A, count(distinct S.Staff_ID)

from S, Xwhere S.X_ID = X.IDgroup by S.Salary, X.A

Queries associated with the addQueries associated with the addcondition refinement:condition refinement:

Calculations for the complement Calculations for the complement refinement:refinement:

count(ccount(cii , , RRcompcomp((SS)) = count(c)) = count(cii, , SS) – count(c) – count(cii , , RR((SS))))


Staff

Grad.Student

GPA >3.9

Grad.Student

Department

Specialization=math

Queries associated with the addQueries associated with the addedge refinement:edge refinement:

select S.Salary, count(distinct S.Staff_ID)

from S, X, Ywhere S.X_ID = X.ID, and e.condgroup by S.Salary

Calculations for the complement Calculations for the complement refinement:refinement:

count(ccount(cii , , RRcompcomp((SS)) = count(c)) = count(cii, , SS) – count(c) – count(cii , , RR((SS))))

Speed Up Method

Significant speed up in obtaining the counts needed for the calculations of the entropy and information gain

The speed up is reached by the additional space used by the algorithm

Handling Missing Values

Staff.Position

Staff.Name Staff.Dep Department.Spec

For each attribute which has missing values we build a Naïve Bayes model:


d1 Math 1000

d2 Physics 300

d3 Computer Science

400

Department

Graduate Student

ID Name Department

Position Salary

p1 Dale d1 ? 70 - 80k

p2 Martin d3 ? 30-40k


40-50k

p4 David d3 ? 80-100k

Staff


s1 John 2.0 4 p1 d3

s2 Lisa 3.5 10 p1 d3


…

Staff.Position, b

Staff.Name, a

P(a|b)

Handling Missing Values

Then the most probable value for the missing attribute is calculated by formula:


d1 Math 1000

Department

Graduate Student

ID Name Department

Position Salary

p1 Dale d1 ? 70 - 80k

Staff


s1 John 2.0 4 p1 d3

s2 Lisa 3.5 10 p1 d3

P(vi | X1.A1, X2.A2, X3.A3 …) =

P(X1.A1, X2.A2, X3.A3 …| vi) P(vi) / P(X1.A1, X2.A2, X3.A3 … ) =

P(X1.A1| vi) P(X2.A2| vi) P(X3.A3| vi) … P(vi) / P(X1.A1, X2.A2, X3.A3 … )

Experimental results. Mutagenesis

Most widely DB used in ILP.

Describes molecules of certain nitro aromatic compounds.

Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer.

Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset.

5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level.

Experimental results. Mutagenesis

Results of 10-fold cross-validation for regression friendly set.

Data Set Accuracy Sel graphsize (max)

Tree size Time withspeed up

Time withoutspeed up

mutagenesis 87.5% 3 9 28.45 52.15

Best-known reported accuracy is 86%

Schema of the mutagenesis database

Consists of a variety of details about the various genes of one particular type of organism.

Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions.

2 Tasks: Prediction of gene/protein localization and function 862 training genes, 381 test genes.

Experimental results. KDD Cup 2001

Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table

FUNCTION

localization Accuracy Sel graphsize (max)

Tree size Time withspeed up


With handling missing values

76.11% 19 213 202.9 secs 1256.38 secs

Without handling missing values

50.14% 33 575 550.76 secs 2257.20 secs

Experimental results. KDD Cup 2001

function Accuracy Sel graphsize (max)

Tree size(max)

Time withspeed up


With handling missing values

91.44% 9 63 151.19 secs 307.83 secs

Without handling missing values

88.56% 9 19 61.29 secs 118.41 secs

Best-known reported accuracy is 93.6%


Experimental results. PKDD 2001 Discovery Challenge

Consists of 5 tables Target table consists of 1239 records The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table

The results for 5:2 cross validation:

Data Set Accuracy Sel Graphsize (max)

Tree size Time with speed up

Time without speed up

thrombosis 98.1% 31 71 127.75 198.22


PATIENT_INFO

DIAGNOSIS

THROMBOSISANTIBODY_EXAM

ANA_PATTERN

Summary the algorithm significantly outperforms MRDTL in terms of running time the accuracy results are comparable with the best reported results

obtained using different data-mining algorithms

Future work

Incorporation of the more sophisticated techniques for handling missing values

Incorporating of more sophisticated pruning techniques or complexity regularizations

More extensive evaluation of MRDTL on real-world data sets Development of ontology-guided multi-relational decision tree learning

algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002]

Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration

Thanks to

Dr. Honavar for providing guidance, help and support throughout this research

Colleges from Artificial Intelligence Lab for various helpful discussions

My committee members: Drena Dobbs and Yan-Bin Jia for their help

Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions

Iowa State University and Computer Science department for funding in part this research

Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,

Documents

relational data basegoal

mining data

relational data miningterm

growth of data

splitting data

unstructured data

tablerelational data

cumulative learning