Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational Classification Using Automatically Extracted Relations by Record Linkage
Dec 18, 2015
Christine Preisach, Steffen Rendle and Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim
Germany
Relational Classification Using Automatically Extracted
Relations by Record Linkage
2
Outline
• Motivation
• Relation Extraction and Multi-Relational Classification Framework
• Relation Extraction
• Multi-Relational Classification
• Evaluation
• Conclusion
3
• Example:
Motivation
P1
P3
P2
Publication Title Author Conference Category
1Classification of scientific publications
John Smith ICDM Data Mining
2 Classification of Hypertext
John Smith KDD
Data Mining
3 Hierarchical Clustering
Dan Miller ICDM
Data Mining
4
Motivation
• Traditional classifiers takes only local attributes like keywords, title and abstract into account
• Assumption: Instances are independent• But: Assumption does not hold
– Instances can be related to other documents by the authorship, citations, same conference etc.
These relations should be exploited and combined in order to improve classification accuracy.
• But: Manuel extraction of relations by experts is expensive
Automatic extraction of relations from noisy attributes.
5
Data Mining
Data Mining
Data Mining
Category
5th International Conference on Data Mining
KDD
ICDM 2005
Conference
Dan MillerHierarchical Clustering
3
John Smith
Classification of Hypertext
2
J. SmithClassification of scientific publications
1
AuthorTitlePublication
• Relation Extraction Component• Extraction of relations from objects with noisy
attributes
• Multi-Relational Classification Component• Use extracted relations instead or additionally to
local attributes for classification
Relation Extraction and Relational Classification Framework
Xx
a
a
R
6
Relation Extraction• Pairwise feature extraction
– from noisy attributes with several similarity measures (e.g. TFIDF, cosine similarity, Levenshtein)
• Probabilistic pairwise decision model– Use extracted similarities as features for a
probabilistic classifier
and build a model on the training data
– And apply it on unknown pairs
• Collective decision model– If is an equivalence relation then use constrained
clustering (e.g. HAC) using the pair wise decision model as a learned similarity measure to transform into a binary relation
VXa :
Pairwise feature extraction
Probabilistic pairwaise decision model
Collective decision model
Attributes
RelationsIR: 2 Xf
1,0IR:ˆ lC
lf
trXyx ),(
trX
R
R
7
Relation ExtractionCollective Decision Model
Initialisation
Must Links
Cannot Links
8
Multi-Relational Classification
• Relational classification problem:– Make use of additional information of related objects
(i.e. their classes or attributes)– Propositionalize the relational data e.g. with:
where
is the neighborhood of
x
xc N
cxcNxxfreq
)'(|')(
.)(,)',(|': xcRxxXxN x
x
9
Multi-Relational Classification
• Algorithm:
1. for each relation R:1 to m(a) Build a undirected weighted graph with (b) Perform relational classification simultaneously for all instances in the test set(c) Output a probability distribution2. Apply ensemble classification to the resulting probability distributions of these relations3. Output final classification
),( EXG
…
…Relational
ClassificationRelational
Classification…
Ensemble Classification
IR: XXw
10
• Simple Relational Methods– Probabilistic Relational Neighbor Classifier (EPRN)
[Macskassy and Provost 2003]
Where is a normalization factor, is the weight and is the iteration
– EPRN2HOP• Takes additionally the neighbors of the direct neighbors into
account if the direct neighborhood size is small
)1('
)( )'|()',(1
)|(
tNx
t xcPxxwZ
xcPx
)1(|''
)1('
)( )''|()'','()'|()',(1
)|('
tdNNx
tNx
t xcPxxwxcPxxwZ
xcPxxx
Multi-Relational Classification
Z t
d
w
11
• Aggregation-based Relational Learning Methods– Use aggregation functions in order to propositionalize
the set-valued attribute
– Use aggregated values as attributes for traditional machine learning methods
– We used Logistic Regression as classifier
Multi-Relational Classification
Category 1
Category 2
Category 3
Category 1
12
• Methods which combine different models • Increases classification accuracy• Usage
– Combine results achieved by relational classification for different relations
– Combine results of relational and local models
• Voting
• Stacking– Use Meta-classifier to learn a model on the results of different
models– Build new instances– Apply cross validation
L
lM l
xcPL
xcP1
)|(1
)|(
),)|(,...)|(,...,)|(,...,)|(( 1111 cxcPxcPxcPxcPx LnLnnew
Ensemble Classification
13
Evaluation• Data
– CompuScience data set• 147 571 scientific papers• 77 topics (categories)• Relations: authors, reviewer, journals
– Cora deduplication data set• 1 295 citations• 112 unique publications• Relation:samePaper
– Cora data set• 3298 papers• 12 categories• Relations: conferences, authors, citations
14
Evaluation – Relation Extraction
Evaluation set
single linkage
complete linkage
average linkage
Xtst 0.90 0.74 0.92
X 0.92 0.71 0.93
F1 measure for finding the SamePaper relation on Cora
Pairwise feature extraction with TFIDF, Levenshtein, Jaccard, Cosine on all attributes
15• The ensemble of relational and content-based text classification achieved
a significantly higher F-measure then the pure text classifier
Evaluation – Multi-Relational Classification
3-fold cross validation on CompuScience for Author, Reviewer and Journal relation
16
EvaluationMulti-Relational Classification using automatically extracted relations
• 50%/50% splits, 10 runs
Author Relation
0.5
0.55
0.6
0.65
0.7
0.75
1 2 3 4 5 6 7 8 9 10
Runs
Acc
ura
cy
Annotated Relation
Learned Relation
17
• Summary:– Presented framework for relation extraction and multi-
relational classification• Automatic relation extraction with record linkage• Relational classification using each extracted relation for
classification and fusing the results with ensemble methods
• Future Work– Evaluate our framework on different data sets and
relations– Evaluate the relational classifiers quality depending on
the quality of the extracted relations
Conclusion and Future Work
18
Questions ?
www.ismll.uni-hildesheim.de
Christine Preisach
Steffen Rendle
Lars Schmidt-Thieme
Thank you