Information Extraction

Information Extraction

MAS.S60Catherine Havasi

Rob Speer

Wikipedia as a corpus• 3.9 million English articles,

284 languages• 2 billion words– Brown has 1 million

• DBpedia and Freebase

Text reveals relations• “Various explanations of the overabundance

of carbon, oxygen, nitrogen, and other elements have been proposed.”

• “These were performed in town halls and other large buildings...”

• “The splendid artistic legacy of Angkor Wat and other Khmer monuments...”

NACLO puzzle

Would it be plausible to describe something as “danty but sloshful”?

Possible patterns• both X and Y• X but not Y• use NP to VP• [Un]fortunately, VP

Constraints using named entities

Constraints using named entities and parts of speech

TextRunner• Starts out with some seed patterns• Label: Uses those to label possible extractions

in a sentence• Learn: Using a graphical model• Extract: Using the learned pattern, extract the

sentence• Problem: 200,000 – 300,000 labeled training

points needed

ReVerb• Syntactic Constraint– Requires extraction

to match syntactic patterns

• Lexical Constraint– Phrases must have

many different arguments in the corpus

Accuracy of IE• Incoherent extractions make up 15-30% of

extracted knowledge bits• Uninformative extractions 3-7%

Tom Mitchell (NELL)• Unsupervised learning machine

Categories on Wikipedia (Dan Weld)

How Kylin Works

Word senses on Wikipedia

Named entities on Wikipedia?

[[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...

Downloading Wikipedia and other Wikimedia projects

• A 2200-article sample is available on the class web site

Lab• Find an information pattern besides the ones

we’ve listed• Run it over the Wikipedia front page corpus• Does it need a tagger? A named entity

extractor?

Assignment• Choose and refine an information extractor• Hand-tag some examples• Add a classifier for good vs. bad matches

• You are allowed to work in groups• Sharing code is fine, but one writeup per

person

Information Extraction

Documents

named entities constraints

wikipedia dan weld

information pattern

possible extractions

named entity extractor

information extractionmas

information extractorhand

pigeon photography