Distant supervision for relation extraction without labeled data Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky ACL 2009 Introduced by Makoto Morishita
Apr 15, 2017
Distant supervision for relation extraction without labeled dataMike Mintz, Steven Bills, Rion Snow, Dan JurafskyACL 2009
Introduced by Makoto Morishita
Contribution of this paper
• Proposed “distant supervision” for the first time.
• By using distant supervision,we can extract the relation between entities from the sentences without annotation work.
2
Current training methods
• Supervised learning
• Unsupervised learning
• Self-training
• Active learning
3
Supervised learning
• Use only annotated data to train a model.
• Need a heavy cost to make the data.
4
Annotated data
Unsupervised learning
5
• Use only unannotated data.
• The result may not be suitable for some purposes.
Unannotated data
Self-training
6
• Use annotated data for the seed of training model, then annotate the unlabeled data by myself.
• It may be low precision and have a bias from the annotated data.
Unannotated data
Annotated data
Active learning
7
• Use existing model to evaluate what data we want to next, then annotate the selected data.
Unannotated dataAnnotated data
Evaluate
Annotate
Distant supervision
8
• We use existing database and unannotated data to train classifier, then annotate the new data.
Unannotated dataClassifier
Unannotated data
Existing database
train
train
annotate
What we want to do
• Extract the relation between entities from sentences.
• e.g.sentence: Kyoto, the famous place in Japan.entity: Japan, Kyotorelation: location-contains <Japan, Kyoto>
12
In this work…
13
• Freebase: 102 relations, 940k entities, 1.8M instances.
Unannotated dataClassifier
Unannotated data
Freebase
train
train
annotate
Wikipedia
Multiclass logisticregression classifier Wikipedia
Training
• Find the sentence that contains two entities.- This sentence tends to express the relation.- Entities are found by a named entity tagger.
• Train classifier.- I will explain the features later.
15
Example
• Known relation:location-contains <Virginia, Richmond>location-contains <France, Nantes>
• We found the sentences like:- Richmond, the capital of Virginia.- Henry’s Edict of Nantes helped the Protestants of France.
• Train the classified using these sentences.
16
Testing
• Find the sentence that contains two entities.- This sentence tends to express the relation.- Entities are found by a named entity tagger.
• Using trained classifier, we can know these entities have a relation.
17
Features
• Lexical features:- specific words between and surrounding the two entities in the sentence.
• Syntactic features:- dependency path
18
Lexical features• The sequence of words between the two entities. • The part-of-speech tags of these words. • A flag indication which entity came first in the sentence. • A window of k words to the left of Entity 1 and their part-of-speech tags. • A window of k words to the right of Entity 2 and their part-of-speech tags.
19
Astronomer Edwin Hubble was born in Marshfield, Missouri.
Syntactic features
20
• A dependency path between the two entities. • For each entity, one “window” node that is not part of the dependency path.
Conclusion
• By using this method, we can extract the relation from unlabeled texts.
• By using database, the label is suit for the current database.
• Extracted relations are seemed to be accurate.
25
Example usage of distant supervision
26
Existing database Target annotation
Freebase(relation between entities)
Wikipedia sentences(find new relations)
Emoticon Tweet(annotate positive, negative)
Dependency parse tree, knowledge base
semantic parser
Comments
• Distant supervision can be useful for other tasks.- Currently, this method is used mainly for relation extraction task.
• However, it supposes that we already have a large database.
27