Top Banner
Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational Classification Using Automatically Extracted Relations by Record Linkage
18

Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

Dec 18, 2015

Download

Documents

Colin Oliver
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim

Germany

Relational Classification Using Automatically Extracted

Relations by Record Linkage

Page 2: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

2

Outline

• Motivation

• Relation Extraction and Multi-Relational Classification Framework

• Relation Extraction

• Multi-Relational Classification

• Evaluation

• Conclusion

Page 3: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

3

• Example:

Motivation

P1

P3

P2

Publication Title Author Conference Category

1Classification of scientific publications

John Smith ICDM Data Mining

2 Classification of Hypertext

John Smith KDD

Data Mining

3 Hierarchical Clustering

Dan Miller ICDM

Data Mining

Page 4: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

4

Motivation

• Traditional classifiers takes only local attributes like keywords, title and abstract into account

• Assumption: Instances are independent• But: Assumption does not hold

– Instances can be related to other documents by the authorship, citations, same conference etc.

These relations should be exploited and combined in order to improve classification accuracy.

• But: Manuel extraction of relations by experts is expensive

Automatic extraction of relations from noisy attributes.

Page 5: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

5

Data Mining

Data Mining

Data Mining

Category

5th International Conference on Data Mining

KDD

ICDM 2005

Conference

Dan MillerHierarchical Clustering

3

John Smith

Classification of Hypertext

2

J. SmithClassification of scientific publications

1

AuthorTitlePublication

• Relation Extraction Component• Extraction of relations from objects with noisy

attributes

• Multi-Relational Classification Component• Use extracted relations instead or additionally to

local attributes for classification

Relation Extraction and Relational Classification Framework

Xx

a

a

R

Page 6: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

6

Relation Extraction• Pairwise feature extraction

– from noisy attributes with several similarity measures (e.g. TFIDF, cosine similarity, Levenshtein)

• Probabilistic pairwise decision model– Use extracted similarities as features for a

probabilistic classifier

and build a model on the training data

– And apply it on unknown pairs

• Collective decision model– If is an equivalence relation then use constrained

clustering (e.g. HAC) using the pair wise decision model as a learned similarity measure to transform into a binary relation

VXa :

Pairwise feature extraction

Probabilistic pairwaise decision model

Collective decision model

Attributes

RelationsIR: 2 Xf

1,0IR:ˆ lC

lf

trXyx ),(

trX

R

R

Page 7: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

7

Relation ExtractionCollective Decision Model

Initialisation

Must Links

Cannot Links

Page 8: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

8

Multi-Relational Classification

• Relational classification problem:– Make use of additional information of related objects

(i.e. their classes or attributes)– Propositionalize the relational data e.g. with:

where

is the neighborhood of

x

xc N

cxcNxxfreq

)'(|')(

.)(,)',(|': xcRxxXxN x

x

Page 9: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

9

Multi-Relational Classification

• Algorithm:

1. for each relation R:1 to m(a) Build a undirected weighted graph with (b) Perform relational classification simultaneously for all instances in the test set(c) Output a probability distribution2. Apply ensemble classification to the resulting probability distributions of these relations3. Output final classification

),( EXG

…Relational

ClassificationRelational

Classification…

Ensemble Classification

IR: XXw

Page 10: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

10

• Simple Relational Methods– Probabilistic Relational Neighbor Classifier (EPRN)

[Macskassy and Provost 2003]

Where is a normalization factor, is the weight and is the iteration

– EPRN2HOP• Takes additionally the neighbors of the direct neighbors into

account if the direct neighborhood size is small

)1('

)( )'|()',(1

)|(

tNx

t xcPxxwZ

xcPx

)1(|''

)1('

)( )''|()'','()'|()',(1

)|('

tdNNx

tNx

t xcPxxwxcPxxwZ

xcPxxx

Multi-Relational Classification

Z t

d

w

Page 11: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

11

• Aggregation-based Relational Learning Methods– Use aggregation functions in order to propositionalize

the set-valued attribute

– Use aggregated values as attributes for traditional machine learning methods

– We used Logistic Regression as classifier

Multi-Relational Classification

Category 1

Category 2

Category 3

Category 1

Page 12: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

12

• Methods which combine different models • Increases classification accuracy• Usage

– Combine results achieved by relational classification for different relations

– Combine results of relational and local models

• Voting

• Stacking– Use Meta-classifier to learn a model on the results of different

models– Build new instances– Apply cross validation

L

lM l

xcPL

xcP1

)|(1

)|(

),)|(,...)|(,...,)|(,...,)|(( 1111 cxcPxcPxcPxcPx LnLnnew

Ensemble Classification

Page 13: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

13

Evaluation• Data

– CompuScience data set• 147 571 scientific papers• 77 topics (categories)• Relations: authors, reviewer, journals

– Cora deduplication data set• 1 295 citations• 112 unique publications• Relation:samePaper

– Cora data set• 3298 papers• 12 categories• Relations: conferences, authors, citations

Page 14: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

14

Evaluation – Relation Extraction

Evaluation set

single linkage

complete linkage

average linkage

Xtst 0.90 0.74 0.92

X 0.92 0.71 0.93

F1 measure for finding the SamePaper relation on Cora

Pairwise feature extraction with TFIDF, Levenshtein, Jaccard, Cosine on all attributes

Page 15: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

15• The ensemble of relational and content-based text classification achieved

a significantly higher F-measure then the pure text classifier

Evaluation – Multi-Relational Classification

3-fold cross validation on CompuScience for Author, Reviewer and Journal relation

Page 16: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

16

EvaluationMulti-Relational Classification using automatically extracted relations

• 50%/50% splits, 10 runs

Author Relation

0.5

0.55

0.6

0.65

0.7

0.75

1 2 3 4 5 6 7 8 9 10

Runs

Acc

ura

cy

Annotated Relation

Learned Relation

Page 17: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

17

• Summary:– Presented framework for relation extraction and multi-

relational classification• Automatic relation extraction with record linkage• Relational classification using each extracted relation for

classification and fusing the results with ensemble methods

• Future Work– Evaluate our framework on different data sets and

relations– Evaluate the relational classifiers quality depending on

the quality of the extracted relations

Conclusion and Future Work

Page 18: Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

18

Questions ?

www.ismll.uni-hildesheim.de

Christine Preisach

[email protected]

Steffen Rendle

[email protected]

Lars Schmidt-Thieme

[email protected]

Thank you