Seed Selection for Distantly Supervised Web-Based Relation Extraction Isabelle Augenstein Department of Computer Science, University of Sheffield, UK [email protected]August 24, 2014 Semantic Web for Information Extraction (SWAIE) Workshop, COLING 2014
23
Embed
Seed Selection for Distantly Supervised Web-Based Relation Extraction
Slides of my presentation on "Seed Selection for Distantly Supervised Web-Based Relation Extraction" at the Semantic Web and Information Extraction workshop (SWAIE) and COLING 2014 Download link for the paper: http://staffwww.dcs.shef.ac.uk/people/I.Augenstein/SWAIE2014-Seed.pdf
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Seed Selection for Distantly Supervised Web-Based
Relation Extraction
Isabelle Augenstein Department of Computer Science, University of Sheffield, UK
Semantic Web for Information Extraction (SWAIE) Workshop, COLING 2014
2
Motivation
• Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase)
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
3
Motivation
• Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase)
• What are possible methodologies? • Supervised learning: manually annotate text, train machine learning
classifier • Unsupervised learning: extract language patterns, cluster similar ones • Semi-supervised learning: start with a small number of language
patterns, iteratively learn more (bootstrapping) • Distant supervision: automatically label text with relations from
knowledge base, train machine learning classifier
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
4
Motivation
• Goal: extraction of relations in text on Web pages (e.g. Mashable) with respect to a knowledge base (e.g. Freebase)
• What are possible methodologies? • Supervised learning: manually annotate text, train machine learning
classifier -> manual effort • Unsupervised learning: extract language patterns, cluster similar ones
-> difficult to map to KB, lower precision than supervised method • Semi-supervised learning: start with a small number of language
patterns, iteratively learn more (bootstrapping) -> still manual effort, semantic drift (unwanted shift in meaning)
• Distant supervision: automatically label text with relations from knowledge base, train machine learning classifier -> allows to extract relations with respect to KB, reasonably high precision, no manual effort
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
5
Distant Supervision
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Creating positive & negative training
examples
Feature Extraction
Classifier Training
Prediction of New
Relations
6
Distant Supervision
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Creating positive & negative training
examples
Feature Extraction
Classifier Training
Prediction of New
Relations
Supervised learning
Automatically generated training data
+
7
Generating training data
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
“If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009) Amy Jade Winehouse was a
singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz.� Blur helped to popularise the Britpop genre.� Beckham rose to fame with the all-female pop group Spice Girls.�
Name Genre … Amy Winehouse Amy Jade Winehouse Wino …
R&B soul jazz …
…
Blur …
Britpop …
…
Spice Girls …
pop …
…
different lexicalisations
8
Generating training data: is it that easy?
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.
Name Album Track The Beatles …
Let It Be …
Let It Be …
9
Generating training data: is it that easy?
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.
Name Album Track The Beatles …
Let It Be …
Let It Be …
• Use ‘Let It Be’ mentions as positive training examples for album or for track?
• Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt
• How can such ambiguous examples be detected? • Develop methods to detect, then automatically discard
potentially ambiguous positive and negative training data
10
Seed Selection: ambiguity within an entity • Example: Let It Be is the twelfth album by The Beatles
which contains their hit single Let It Be. • Let It Be can be both an album and a track of the musical artist
The Beatles • For every relation, consisting of a subject, a property and an
object (s, p, o), is the subject related to (at least) two different objects with the same lexicalisation which express two different relations?
• Unam: • Retrieve the number of such senses using the Freebase API • Discard the lexicalisation of the object as positive training data if it has at
least two different senses within an entity
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
11
Seed Selection: ambiguity across classes • Example: common names of book authors or common genres,
e.g. “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac.
• Stop: remove common words that are stopwords • Stat: Estimate how ambiguous a lexicalisation of an object is
compared to other lexicalisations of objects of the same relation • For every lexicalisation of an object of a relation, retrieve the number of
senses using the Freebase API (example: for Jack n=1066) • Compute frequency distribution per relation with min, max, median (50th
percentile), lower (25th percentile) and upper quartile (75th percentile) (example: for author: min=0, max=3059, median=10, lower=4, upper=32)
• For every lexicalisation of an object of a relation, if the number of senses > upper quartile (or the lower quartile, or median, depending on the model), discard it (example: 1066 > 32 -> Jack will be discarded)
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
12
Seed Selection: discarding negative seeds • Creating negative training data: all entities which appear in the
sentence with the subject, but are not in a relation with it, will be used as negative training data
• Problem: knowledge bases are incomplete • Idea: object lexicalisations are often shared across entities,
e.g. for the relation genre • Check if an unknown lexicalisation is a lexicalisation of a
different relation • Incomp: for every lexicalisation l of a property, discard it as
negative training data if any of the properties of the class we examine has an object lexicalisation l
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
13
Distant Supervision system: corpus
• Web crawl corpus, created using entity-specific search queries, consisting of 450k Web pages
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
Class Property / Relation
Book author, characters, publication date, genre, ISBN, original language
Musical Artist album, active (start), active (end), genre, record label, origin, track
• Merging and Ranking: aggregate predictions of occurrences with same surface form • E.g.: Dublin could have predictions MusicalArtist:album, origin and NONE • Compute mean average of confidence values, select highest ranked one
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
16
Evaluation • Split corpus equally for training and test • Hand-annotate the portion of the test corpus which has NONE
prediction (no representation in Freebase)
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
17
Results: precision per model, ranked by confidence
y-axis: precision, x-axis: min confidence (e.g. 0.1 are all occurrences with confidence >= 0.1)
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
0.7
0.75
0.8
0.85
0.9
0.95
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
unam_stop_stat25unam_stop_stat50unam_stop_stat75
unam_stopstop
baselineincompl
18
Results Summary
• Best-performing model (unam_stop_stat25) has a precision of 0.896, compared to baseline model with precision of 0.825, reducing the error rate by 35%
• However, those seed selection methods all come at a small loss of the number of extractions (20%) because they reduce the amount of training data
• Removing potentially false negative training data (incomp) does not perform well • Too many training examples are removed • The training examples which are removed are lexicalisations which
have the same types of values, those are crucial for learning • Especially poor performance for numerical values
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
19
Related work on dealing with noise for distant supervision
• At least one models: • Relaxed distant supervision assumption, assume that just one of the
relation mentions is a true relation mention • Graphical models, inference based on ranking • Very challenging to train
• Hierarchical topic models: • Only learn from positive training examples • Pre-processing with multi-layer topic model to group extraction patterns
to determine which ones are specific for each relation and which are not
• Pattern correlations: • Probabilistic graphical model to group extraction patterns
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
20
Comparison of our approach to deal with ambiguity with related approaches
• Related approaches all try to solve the problem of ambiguity within the machine learning model
• Our approach deals with ambiguity as a pre-processing step for creating training data
• While related approaches try to address the problem of noisy data by using more complicated models, we explored how to exploit background data from the KB even further
• We explored how simple, statistical statistical methods based on data already present in the knowledge base can help to filter unreliable training data
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
Isabelle Augenstein
21
Conclusions
• Simple, statistical methods based on background knowledge present in KB perform well at detecting ambiguous training examples
• Error reduction of up to 35% can be achieved by strategically selecting seed data
• Increase in precision is encouraging, however, this comes at the expense of the number of extractions (20% fewer extractions)
• Higher recall could be achieved by increasing the number of training instances initially • Use a bigger corpus • Make better use of knowledge contained in corpus
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)
22
Future Work
• Distantly supervised named entity classification for relation extraction to improve performance for non-standard entities, joint models for NER and RE
• Relax distant supervision assumption to achieve a higher number of extractions: extract relations across sentence boundaries, coreference resolution
• Combined extraction models for information from text, lists and tables on Web pages to improve precision and recall
Seed Selection for Distantly Supervised Web-Based Relation Extraction (SWAIE Workshop)