Information Extraction MAS.S60 Catherine Havasi Rob Speer
Feb 25, 2016
Information Extraction
MAS.S60Catherine Havasi
Rob Speer
Wikipedia as a corpus• 3.9 million English articles,
284 languages• 2 billion words– Brown has 1 million
• DBpedia and Freebase
Text reveals relations• “Various explanations of the overabundance
of carbon, oxygen, nitrogen, and other elements have been proposed.”
• “These were performed in town halls and other large buildings...”
• “The splendid artistic legacy of Angkor Wat and other Khmer monuments...”
NACLO puzzle
Would it be plausible to describe something as “danty but sloshful”?
Possible patterns• both X and Y• X but not Y• use NP to VP• [Un]fortunately, VP
Constraints using named entities
Constraints using named entities and parts of speech
TextRunner• Starts out with some seed patterns• Label: Uses those to label possible extractions
in a sentence• Learn: Using a graphical model• Extract: Using the learned pattern, extract the
sentence• Problem: 200,000 – 300,000 labeled training
points needed
ReVerb• Syntactic Constraint– Requires extraction
to match syntactic patterns
• Lexical Constraint– Phrases must have
many different arguments in the corpus
Accuracy of IE• Incoherent extractions make up 15-30% of
extracted knowledge bits• Uninformative extractions 3-7%
Tom Mitchell (NELL)• Unsupervised learning machine
Categories on Wikipedia (Dan Weld)
How Kylin Works
Word senses on Wikipedia
Named entities on Wikipedia?
[[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...
Downloading Wikipedia and other Wikimedia projects
• A 2200-article sample is available on the class web site
Lab• Find an information pattern besides the ones
we’ve listed• Run it over the Wikipedia front page corpus• Does it need a tagger? A named entity
extractor?
Assignment• Choose and refine an information extractor• Hand-tag some examples• Add a classifier for good vs. bad matches
• You are allowed to work in groups• Sharing code is fine, but one writeup per
person