Top Banner
Information Extraction Rayid Ghani IR Seminar - 11/28/00
23

Information Extraction

Dec 30, 2015

Download

Documents

erasmus-lopez

Information Extraction. Rayid Ghani. IR Seminar - 11/28/00. What is IE?. Analyze unrestricted text in order to extract specific types of information Attempt to convert unstructured text documents into database entries Operate at many levels of the language. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Extraction

Information Extraction

Rayid Ghani

IR Seminar - 11/28/00

Page 2: Information Extraction

What is IE? Analyze unrestricted text in order to

extract specific types of information Attempt to convert unstructured text

documents into database entries Operate at many levels of the

language

Page 3: Information Extraction

Task: Extract Speaker, Title, Location, Time, Date from Seminar Announcement

Dr. Gibbons is spending his sabbatical from Bell Labs with us.His work bridges databases, datamining and theory,with several patents and applications to commercial DBMSs.

Christos

Date: Monday, March 20, 2000Time: 3:30-5:00 (Refreshments provided)Place: 4623 Wean Hall

Phil GibbonsCarnegie Mellon University

The Aqua Approximate Query Answering System

In large data recording and warehousing environments, providing an exactanswer to a complex query can take minutes, or even hours, due to theamount of computation and disk I/O required. Moreover, given the currenttrend towards data analysis over gigabytes, terabytes, and even petabytesof data, these query response times are increasing despite improvements in

Page 4: Information Extraction

Task: Extract question/answer pairs from FAQ

X-NNTP-Poster: NewsHound v1.33Archive-name: acorn/faq/part2Frequency: monthly

2.6) What configuration of serial cable should I use?

Here follows a diagram of the necessary connections for common terminalprograms to work properly. They are as far as I know the informal standardagreed upon by commercial comms software developers for the Arc.

Pins 1, 4, and 8 must be connected together inside the 9 pin plug. Thisis to avoid the well known serial port chip bugs. The modem’s DCD (DataCarrier Detect) signal has been re-routed to the Arc’s RI (Ring Indicator)most modems broadcast a software RING signal anyway, and even then it’sreally necessary to detect it for the model to answer the call.

2.7) The sound from the speaker port seems quite muffled. How can I get unfiltered sound from an Acorn machine?

All Acorn machine are equipped with a sound filter designed to removehigh frequency harmonics from the sound output. To bypass the filter, hookinto the Unfiltered port. You need to have a capacitor. Look for LM324 (chip39) and and hook the capacitor like this:

Page 5: Information Extraction

Task: Extract Title, Author, Institution & Abstract from research paper

www.cora.whizbang.com(previously www.cora.justresearch.com)

Page 6: Information Extraction

Task: Extract Acquired and Acquiring Companies from WSJ Article

Sara Lee to Buy 30% of DIM

Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.

The investment includes the purchase of 5 million newly issued DIM shares valued at about 5 million dollars, and a loan of about 15 million dollars, it said. The loan is convertible into an additional 16 million DIM shares, it noted.

The proposed agreement is subject to approval by the French government, it said.

Page 7: Information Extraction

Types of IE systems Structured texts (such as web pages

with tabular information) Semi-structured texts (such as

online personals) Free text (such as news articles).

Page 8: Information Extraction

Problems with Manual IE Cannot adapt to domain changes Lots of human effort needed 1500 human hours (Riloff 95)

Solution: Use Machine Learning

Page 9: Information Extraction

Why is IE difficult?

There are many ways of expressing the same fact: BNC Holdings Inc named Ms G Torretta as its

new chairman. Nicholas Andrews was succeeded by Gina

Torretta as chairman of BNC Holdings Inc. Ms. Gina Torretta took the helm at BNC

Holdings Inc. After a long boardroom struggle, Mr Andrews

stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms Torretta.

Page 10: Information Extraction

Named Entity Extraction Can be either a two-step or single

step process Extraction => Classification Extraction-Classification

Classification (Collins & Singer 99)

Page 11: Information Extraction

Information Extraction with HMMs

[Seymore & McCallum ‘99][Freitag & McCallum ‘99]

Page 12: Information Extraction

Parameters = P(s|s’), P(o|s) for all states in S={s1,s2,…}

Emissions = word Training = Maximize probability of training

observations (+ prior). For IE, states indicate “database field”.

Page 13: Information Extraction

Regrets with HMMs1. Would prefer richer representation of text:

multiple overlapping features, whole chunks of text. Example line features:

length of line line is centered percent of non-alphabetics total amount of white space line contains two verbs line begins with a number line is grammatically a question

Example word features: identity of word word is in all caps word ends in “-tion” word is part of a noun phrase word is in bold font word is on left hand side of page word is under node X in WordNet

2. HMMs are generative models of the text: P({s…},{o…}).Generative models do not handle easily overlapping, non-independent features. Would prefer a conditional model: P({s…}|{o…}).

Page 14: Information Extraction

Solution:New probabilistic sequence model

P(o|s)

P(s|s’)P(s|o,s’)

Traditional HMM Maximum EntropyMarkov Model

(Represented by exponential model fit by maximum entropy)

(For the time being, capture dependency on s’ with |S| independent functions.)

Ps’(s|o)

Page 15: Information Extraction

Old graphical model New graphical model

st-1 st

ot

st-1 st

ot

P(o|s)P(s|s’) P(s|o,s’)

Standard belief propagation: forward-backward procedure.Viterbi and Baum-Welch follow naturally.

Page 16: Information Extraction

State Transition Probabilities based on Overlapping Features

Model Ps’(s|o) in terms of multiple arbitrary overlapping (binary) features.

Example observation feature tests: - o is the word “apple” - o is capitalized - o is on a left-justified line

Actual feature, f, depends on both a binary observation feature test, b,and a destination state, s.

otherwise0

and trueis if1),(,

ttttsb

ss)b(osof

Page 17: Information Extraction

Maximum Entropy Constraints

Maximum entropy is based on the principle that the best model for thedata is the one that is consistent with certain constraints derived from thetraining data, but otherwise makes the fewest possible assumptions.

Constraints:

''

1,'

'1,

'

),()|(1

),(1 s

kk

s

kk

m

k Sstsbts

s

m

kttsb

s

sofosPm

sofm

' with steps time theare ,..., where'1 sstt

ks tm

Data average Model Expectation

Page 18: Information Extraction

Maximum Entropy while Satisfying Constraints

When constraints are imposed in this way, the constraint-satisfyingprobability distribution that has maximum entropy is guaranteedto be:

(1) unique(2) the same as the maximum likelihood solution for this model(3) in exponential form:

[Della Pietra, Della Pietra, Lafferty, ‘97]

sbsbsbs sof

soZosP

,,,' ),(exp

)',(

1)|(

Learn parameters by iterative procedure: Generalized Iterative Scaling (GIS)

Page 19: Information Extraction

Experimental Data38 files belonging to 7 UseNet FAQs

Example:

<head> X-NNTP-Poster: NewsHound v1.33<head> Archive-name: acorn/faq/part2<head> Frequency: monthly<head><question> 2.6) What configuration of serial cable should I use?<answer><answer> Here follows a diagram of the necessary connection<answer> programs to work properly. They are as far as I know <answer> agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer> is to avoid the well known serial port chip bugs. The

Procedure: For each FAQ, train on one file, test on other; average.

Page 20: Information Extraction

Features in Experimentsbegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-markfirst-alpha-is-capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30

Page 21: Information Extraction

Models Tested ME-Stateless: A single maximum entropy classifier

applied to each line independently.

TokenHMM: A fully-connected HMM with four states, one for each of the line categories, each of which generates individual tokens (groups of alphanumeric characters and individual punctuation characters).

FeatureHMM: Identical to TokenHMM, only the lines in a document are first converted to sequences of features.

MEMM: The maximum entopy Markov model described in this talk.

Page 22: Information Extraction

Results

Learner Segmentationprecision

Segmentationrecall

ME-Stateless 0.038 0.362

TokenHMM 0.276 0.140

FeatureHMM 0.413 0.529

MEMM 0.867 0.681

Page 23: Information Extraction

Conclusions Presented a new probabilistic sequence

model based on maximum entropy. Handles arbitrary overlapping features Conditional model

Shown positive experimental results on FAQ segmentation.

Shown variations for factored state, reduced complexity model, and reinforcement learning.