Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter (CMU) Luke Zettlemoyer (University of Washington) Mausam (University of Washington) Oren Etzioni (Vulcan Inc.) TACL, 1, 367-378, 2013. Presented by Naoaki Okazaki (Tohoku University) 2014-09-05 Modeling Missing Data in Distant Supervision 1
17
Embed
Modeling missing data in distant supervision for information extraction (Ritter+, TACL 2013)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Modeling Missing Data in Distant Supervision for Information Extraction
Alan Ritter (CMU)Luke Zettlemoyer (University of Washington)
Mausam (University of Washington)Oren Etzioni (Vulcan Inc.)TACL, 1, 367-378, 2013.
Presented by Naoaki Okazaki (Tohoku University)
2014-09-05 Modeling Missing Data in Distant Supervision 1
Relation instance extractionSteven Spielberg’s film Saving Private Ryan is loosely based on the brothers’ story.
Extractor Film Director
Saving Private Ryan Steven Spielberg
Film-director relation
• Fully-supervised learning (Zhou+ 05, …)• Uses ACE corpora to build relation-instance classifiers• Suffers from the limited number of training data
• Unsupervised information extraction (Banko+ 07, …)• Extracts relational patterns between entities, and clusters the
patterns into relations• Difficult to map clusters into relations of interest
• Bootstrap learning (Brin 98, …)• Uses seed instances to extract a new set of relational patterns• Often suffers from low precision (semantic drift)
• Distant supervision (Mintz+ 09, …)• Combines the advantages of the above approaches
2014-09-05 Modeling Missing Data in Distant Supervision 2
Distant supervision (Mintz+, 09)Person Birthplace
Edwin Hubble Marshfield
… … Automatic annotation
Astronomer Edwin Hubble was born in Marshfield, Missouri.
Feature extraction
Mintz et al. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.* Each row presents a single feature. Concatenate features from different sentences containing the same entity pairs.
Problem: An entity pair cannot have multiple relationsE.g., Founded(Jobs, Apple) and CEO-of(Jobs, Apple) are true.
2014-09-05 Modeling Missing Data in Distant Supervision 3
MultiR (Hoffmann+, 11)
Introduces latent variables (𝑧𝑧𝑖𝑖) to indicate the relation expressed by sentence 𝑥𝑥𝑖𝑖
0 1 1 0
Founder Founder CEO-of
𝑦𝑦born−in 𝑦𝑦founder 𝑦𝑦CEO−of 𝑦𝑦capital−of
Steve Jobs was founder of Apple.
Steve Jobs, Steve Wozniak and Ronald Wayne founded Apple.
Steve Jobs is CEO of Apple.
𝑧𝑧1 𝑧𝑧2 𝑧𝑧3
𝑝𝑝 𝒚𝒚, 𝒛𝒛 𝒙𝒙
=1𝑍𝑍𝑥𝑥�𝑟𝑟
Φjoin(𝑦𝑦𝑟𝑟 , 𝒛𝒛)�𝑖𝑖
Φextract(𝑧𝑧𝑖𝑖 , 𝑥𝑥𝑖𝑖)
𝑥𝑥1 𝑥𝑥2 𝑥𝑥3
𝒛𝒛
𝒙𝒙
𝒚𝒚
For entity pair, (Steve Jobs, Apple) 𝑥𝑥𝑖𝑖: a sentence containing the entity pair𝑦𝑦𝑟𝑟 ∈ {0,1}: 1 if the knowledge base includes the pair with relation 𝑟𝑟, 0 otherwise𝑧𝑧𝑖𝑖 ∈ 𝑅𝑅: the relation expressed by sentence 𝑥𝑥𝑖𝑖
• Knowledge base: Freebase (instances and their categories)• Text corpus: tweets
• Hold-out evaluation
2014-09-05 Modeling Missing Data in Distant Supervision 14
Results
17% increase in area under the curve.Incorporating popularity yielded 27% increase over the baseline.
This evaluation underestimate precision because many facts correctly extracted from text are missing in the database.DNMAR doubled the recall.
Ritter et al. (2013) Modeling Missing Data in Distant Supervision for Information Extraction, TACL(1), 367-378.
2014-09-05 Modeling Missing Data in Distant Supervision 15
Conclusion• Investigated the problem of missing data in distant
supervision• Presented an extension of MultiR to handle missing
data• Could incorporate the popularity of facts to be
included in the knowledge base and text• Presented a scalable inference algorithm based on
greedy hill-climbing• Demonstrated the effectiveness of the modeling
2014-09-05 Modeling Missing Data in Distant Supervision 16
References• Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke
Zettlemoyer, Daniel S. Weld. (2011) Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations. ACL-2011, pages 541–550.
• Slides and codes
• Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky. (2009) Distant supervision for relation extraction without labeled data. ACL-2009, pages 1003–1011.
2014-09-05 Modeling Missing Data in Distant Supervision 17