Top Banner
Information Extraction PengBo Dec 2, 2010
88

Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Information Extraction

PengBoDec 2, 2010

Page 2: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Topics of today

IE: Information Extraction Techniques

Wrapper Induction Sliding Windows From FST to HMM

Page 3: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is IE?

Page 4: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Page 5: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Example: A Solution

Page 6: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Page 7: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Job Openings:Category = Food Services Keyword = Baker Location = Continental U.S.

Page 8: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Data Mining the Extracted Job Information

Page 9: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Two ways to manage information

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

retrieval

Query Answer Query Answer

advisor(wc,vc)advisor(yh,tm)

affil(wc,mld)affil(vc,lti)

fn(wc,``William”)fn(vc,``Vitor”)

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

inference

“ceremonial soldering” X:advisor(wc,Y)&affil(X,lti) ? {X=em; X=vc}

ANDIE

Page 10: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is Information Extraction? Recovering structured data from

formatted text

Page 11: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is Information Extraction? Recovering structured data from

formatted text Identifying fields (e.g. named entity

recognition)

Page 12: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is Information Extraction? Recovering structured data from

formatted text Identifying fields (e.g. named entity

recognition) Understanding relations between fields (e.g.

record association)

Page 13: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is Information Extraction? Recovering structured data from

formatted text Identifying fields (e.g. named entity

recognition) Understanding relations between fields (e.g.

record association) Normalization and deduplication

Page 14: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is Information Extraction?

Recovering structured data from formatted text Identifying fields (e.g. named entity

recognition) Understanding relations between fields (e.g.

record association) Normalization and deduplication

Today, focus mostly on field identification &a little on record association

Page 15: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Applications

Page 16: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

IE from Research Papers

Page 17: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

IE fromChinese Documents regarding

Weather

Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

Page 18: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.
Page 19: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.
Page 20: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Wrapper Induction

Page 21: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

“Wrappers”

If we think of things from the database point of view We want to be able to database-style queries But we have data in some horrid textual

form/content management system that doesn’t allow such querying

We need to “wrap” the data in a component that understands database-style querying

Hence the term “wrappers”

Page 23: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Wrappers:Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b”

May require preceding (pre-filler) pattern and succeeding (post-filler) pattern to identify the end of the filler. Amazon list price:

Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”

Filler pattern: “\b\$\d+(\.\d{2})?\b” Post-filler pattern: “</span>”

Page 24: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Wrapper tool-kits

Wrapper toolkits Specialized programming environments for

writing & debugging wrappers by hand Some Resources

Wrapper Development Tools LAPIS

Page 25: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Wrapper Induction

Problem description: Task: learn extraction rules based on

labeled examples Hand-writing rules is tedious, error prone, and

time consuming Learning wrappers is wrapper induction

Page 26: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Induction Learning

Rule induction: formal rules are extracted from a set of

observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

INPUT: Labeled examples: training & testing data Admissible rules (hypotheses space) Search strategy

Desired output: Rule that performs well both on training and

testing data

Page 27: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Wrapper induction

Highly regularsource documents

Relatively simple

extraction patterns

Efficient

learning algorithm

Build a training set of documents paired with human-produced filled extraction templates.

Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Page 28: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Page 29: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Learner

User gives first K positive—and thus many implicit negative examples

Page 30: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.
Page 31: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.
Page 32: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Kushmerick’s WIEN system

Earliest wrapper-learning system (published IJCAI ’97)

Special things about WIEN: Treats document as a string of characters Learns to extract a relation directly, rather

than extracting fields, then associating them together in some way

Example is a completely labeled page

Page 33: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

WIEN system: a sample wrapper

Page 34: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

l1, r1, …, lK, rK

Example: Find 4 strings<B>, </B>, <I>,

</I> l1 , r1 , l2 ,

r2

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Learning LR wrappers

Page 35: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

LR wrapper

Left delimiters L1=“<B>”, L2=“<I>”; Right R1=“</B>”, R2=“</I>”

Page 36: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r1 can be any prefix

eg </B>

Page 37: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r2 can be any prefix

eg </I>

l2 can be any suffix

eg <I>

l1 can be any suffix

eg <B>

Page 38: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

WIEN system

Assumes items are always in fixed, known order … Name: J. Doe; Address: 1 Main; Phone: 111-1111.

<p> Name: E. Poe; Address: 10 Pico; Phone: 777-1111.

<p> …

Introduces several types of wrappers

LR

Page 39: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Learning LR extraction rules

Admissible rules: prefixes & suffixes of items of interest

Search strategy: start with shortest prefix & suffix, and expand

until correct

Page 40: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Summary of WIEN

Advantages: Fast to learn & extract

Drawbacks: Cannot handle permutations and missing items Must label entire page Requires large number of examples

Page 41: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Sliding Windows

Page 42: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 43: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 44: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 45: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

Page 46: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

Page 47: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

A “Naïve Bayes” Sliding Window Model

1. Create dataset of examples like these:+(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…)- (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….)…

2. Train a NaiveBayes classifier3. If Pr(class=+|prefix,contents,suffix) > threshold, predict the

content window is a location.• To think about: what if the extracted entities aren’t consistent, eg if the

location overlaps with the speaker?

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

… …

Page 48: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Page 49: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers

Page 50: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Basic method for extracting relevant information

IE systems generally use a collection of specialized FSTs

Company Name detection Person Name detection Relationship detection

Page 51: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Frodo Baggins works for Hobbit Factory, Inc.

Text Analyzer:

Frodo – Proper Name

Baggins – Proper Name

works – Verb

for – Prep

Hobbit – UnknownCap

Factory – NounCap

Inc – CompAbbr

Page 52: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Frodo Baggins works for Hobbit Factory, Inc.

Some regular expression for finding company names:“some capitalized words, maybe a comma,

then a company abbreviation indicator”

CompanyName = (ProperName | SomeCap)+

Comma?

CompAbbr

Page 53: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Frodo Baggins works for Hobbit Factory, Inc.

1 2 3 4

word

(CAP | PN)

(CAP| PN)

CAB

comma CAB

word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string

Company Name Detection FSA

Page 54: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Frodo Baggins works for Hobbit Factory, Inc.

1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST

Page 55: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Frodo Baggins works for Hobbit Factory, Inc.

1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST

Non-deterministic!!!

Page 56: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

Several FSTs or a more complex FST can be used to find one type of information (e.g. company names)

FSTs are often compiled from regular expressions

Probabilistic (weighted) FSTs

Page 57: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Finite State Transducers for IE

FSTs mean different things to different researchers in IE.

Based on lexical items (words) Based on statistical language models Based on deep syntactic/semantic analysis

Page 58: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Example: FASTUS

Finite State Automaton Text Understanding System (SRI International)

Cascading FSTs Recognize names Recognize noun groups, verb groups etc Complex noun/verb groups are constructed Identify patterns of interest Identify and merge event structures

Page 59: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Hidden Markov Models

Page 60: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Hidden Markov Models formalism

HMM = states s1, s2, …(special start state s1 special end state sn)token alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)Widely used in many language processing tasks,

e.g., speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].

HMM = probabilistic FSA

Page 61: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Applying HMMs to IE

Document generated by a stochastic process modelled by an HMM

Token word State “reason/explanation” for a given token

‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’,

‘company’, … Extraction: via the Viterbi algorithm, a

dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.

Page 62: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

HMM for research papers: transitions [Seymore et al., 99]

Page 63: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

HMM for research papers: emissions [Seymore et al., 99]

author title institution

Trained on 2 million words of BibTeX data from the Web

...note

ICML 1997...submission to…to appear in…

stochastic optimization...reinforcement learning…model building mobile robot...

carnegie mellon university…university of californiadartmouth college

supported in part…copyright...

Page 64: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is an HMM?

Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies

between states

Page 65: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is an HMM?

Green circles are hidden states Dependent only on the previous state: Markov

process “The past is independent of the future given the

present.”

Page 66: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

What is an HMM?

Purple nodes are observed states Dependent only on their corresponding hidden

state

Page 67: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

HMM Formalism

{S, K, , P , A B} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations

SSS

KKK

S

K

S

K

Page 68: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

HMM Formalism

{S, K, , P , A B} = {P pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state

probabilities

A

B

AAA

BB

SSS

KKK

S

K

S

K

Page 69: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Need to provide structure of HMM & vocabulary Training the model (Baum-Welch algorithm)

Efficient dynamic programming algorithms exist for Finding Pr(K) The highest probability path S that maximizes Pr(K,S)

(Viterbi)

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilities

Year

A

B

C

0.6

0.3

0.1

X

B

Z

0.4

0.2

0.4

Y

A

C

0.1

0.1

0.8

Emission probabiliti

es

dddd

dd

0.8

0.2

Page 70: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Using the HMM to segment

Find highest probability path through the HMM.

Viterbi: quadratic dynamic programming algorithm

House

ot

Road

City

Pin

115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

Page 71: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Most Likely Path for a Given Sequence

The probability that the path is taken and the sequence is generated:

L

iiNL iii

axbaxx1

001 11 )()...,...Pr(

Lxx ...1

N ...0

transition

probabilities

emission

probabilities

Page 72: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Example

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

abababa

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

Page 73: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

oTo1 otot-1 ot+1

Finding the most probable path

Find the state sequence that best explains the observations

Viterbi algorithm (1967)

)|(maxarg OXPX

Page 74: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j

Page 75: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

x1 xt-1 xt xt+1

Page 76: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

)1( tj

)(ti

ija

1tjob

Viterbi : Dynamic Programming

House

ot

Road

City

Pin

No 115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

1)(max)1(

tjoijii

j batt

Page 77: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

T

)1(ˆ1

^

tXtX

t

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Page 78: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Hidden Markov Models Summary

Popular technique to detect and classify a linear sequence of information in text

Disadvantage is the need for large amounts of training data

Related Works System for extraction of gene names and locations

from scientific abstracts (Leek, 1997) NERC (Biker et al., 1997) McCallum et al. (1999) extracted document

segments that occur in a fixed or partially fixed order (title, author, journal)

Ray and Craven (2001) – extraction of proteins, locations, genes and disorders and their relationships

Page 79: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

IE Technique Landscape

Page 80: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

IE with Symbolic Techniques

Conceptual Dependency Theory Shrank, 1972; Shrank, 1975 mainly aimed to extract semantic information about individual

events from sentences at a conceptual level (i.e., the actor and an action)

Frame Theory Minsky, 1975 a frame stores the properties of characteristics of an entity,

action or event it typically consists of a number of slots to refer to the

properties named by a frame Berkeley FrameNet project

Baker, 1998; Fillmore and Baker, 2001 online lexical resource for English, based on frame semantics

and supported by corpus evidence FASTUS (Finite State Automation Text Understanding

System) Hobbs, 1996 using cascade of FSAs in a frame based information extraction

approach

Page 81: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

IE with Machine Learning Techniques

Training data: documents marked up with ground truth

In contrast to text classification, local features crucial. Features of: Contents Text just before item Text just after item Begin/end boundaries

Page 82: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Good Features for Information Extraction

Example word features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in

WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was

female next two words are “and

Associates”

begins-with-numberbegins-with-ordinalbegins-with-

punctuationbegins-with-question-

wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-

numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-

markfirst-alpha-is-

capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-

spaceonly-punctuationprev-is-blankprev-begins-with-

ordinalshorter-than-30

Creativity and Domain Knowledge Required!

Page 83: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Is Capitalized Is Mixed Caps Is All Caps Initial CapContains DigitAll lowercase Is InitialPunctuationPeriodCommaApostropheDashPreceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list (de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(“J. C. Penny”)

In list of company suffixes(Inc, & Associates, Foundation)

Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases

HTML/Formatting Features {begin, end, in} x

{<b>, <i>, <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}

{begin, end} of line

Creativity and Domain Knowledge Required!

Good Features for Information Extraction

Page 84: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Landscape of ML Techniques for IE:

Any of these models can be used to capture words, formatting or both.

Classify Candidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Wrapper Induction

<b><i>Abraham Lincoln</i></b> was born in Kentucky.

Learn and apply pattern for a website

<b>

<i>

PersonName

Page 85: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

IE History

Pre-Web Mostly news articles

De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news

wire Message Understanding Conference (MUC) DARPA

[’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models

E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie,

Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

Web AAAI ’94 Spring Symposium on “Software Agents”

Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96

Build KB’s from the Web. Wrapper Induction

Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

Page 86: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Summary

Information Extraction Sliding Window From FST(Finite State

Transducer) to HMM Wrapper Induction

Wrapper toolkits LR Wrapper

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Page 87: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Readings

[1] M. Ion, M. Steve, and K. Craig, "A hierarchical approach to wrapper induction," in Proceedings of the third annual conference on Autonomous Agents. Seattle, Washington, United States: ACM, 1999.

Page 88: Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Thank You!

Q&A