Top Banner
Information Extraction MAS.S60 Catherine Havasi Rob Speer
18

Information Extraction

Feb 25, 2016

Download

Documents

meara

Information Extraction. MAS.S60 Catherine Havasi Rob Speer. Wikipedia as a corpus. 3.9 million English articles, 284 languages 2 billion words Brown has 1 million DBpedia and Freebase. Text reveals relations. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Extraction

Information Extraction

MAS.S60Catherine Havasi

Rob Speer

Page 2: Information Extraction

Wikipedia as a corpus• 3.9 million English articles,

284 languages• 2 billion words– Brown has 1 million

• DBpedia and Freebase

Page 3: Information Extraction

Text reveals relations• “Various explanations of the overabundance

of carbon, oxygen, nitrogen, and other elements have been proposed.”

• “These were performed in town halls and other large buildings...”

• “The splendid artistic legacy of Angkor Wat and other Khmer monuments...”

Page 4: Information Extraction

NACLO puzzle

Would it be plausible to describe something as “danty but sloshful”?

Page 5: Information Extraction

Possible patterns• both X and Y• X but not Y• use NP to VP• [Un]fortunately, VP

Page 6: Information Extraction

Constraints using named entities

Page 7: Information Extraction

Constraints using named entities and parts of speech

Page 8: Information Extraction

TextRunner• Starts out with some seed patterns• Label: Uses those to label possible extractions

in a sentence• Learn: Using a graphical model• Extract: Using the learned pattern, extract the

sentence• Problem: 200,000 – 300,000 labeled training

points needed

Page 9: Information Extraction

ReVerb• Syntactic Constraint– Requires extraction

to match syntactic patterns

• Lexical Constraint– Phrases must have

many different arguments in the corpus

Page 10: Information Extraction

Accuracy of IE• Incoherent extractions make up 15-30% of

extracted knowledge bits• Uninformative extractions 3-7%

Page 11: Information Extraction

Tom Mitchell (NELL)• Unsupervised learning machine

Page 12: Information Extraction

Categories on Wikipedia (Dan Weld)

Page 13: Information Extraction

How Kylin Works

Page 14: Information Extraction

Word senses on Wikipedia

Page 15: Information Extraction

Named entities on Wikipedia?

[[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...

Page 16: Information Extraction

Downloading Wikipedia and other Wikimedia projects

• A 2200-article sample is available on the class web site

Page 17: Information Extraction

Lab• Find an information pattern besides the ones

we’ve listed• Run it over the Wikipedia front page corpus• Does it need a tagger? A named entity

extractor?

Page 18: Information Extraction

Assignment• Choose and refine an information extractor• Hand-tag some examples• Add a classifier for good vs. bad matches

• You are allowed to work in groups• Sharing code is fine, but one writeup per

person