Top Banner
1 Interactive Information Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA, 70 millions workers complete forms on a regular basis. The goal of this work is to reduce the burden on the user to the largest extent possible, while ensuring the integrity of the data. Main Points Synergy of User Interface and Information Extraction Algorithm CRFs for information extraction Correction Propagation in CRFs Confidence Estimation in CRFs Expected Number of User Actions Add Contacts to Address Book • Email • Web Text document Word document • Excel Demo: Contact Assistant Input is automatically parsed and assigned to fields Data Integrity – Fast Verification Color coded correspondence, user can quickly spot errors
8

Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

Oct 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

1

Interactive InformationExtraction

Trausti Kristjansson, AronCulotta, Paul Viola, Andrew

McCallum

IBM Research

Introduction

In USA, 70 millions workers complete formson a regular basis.

The goal of this work is to reduce theburden on the user to the largest extentpossible, while ensuring the integrity ofthe data.

Main Points

• Synergy of User Interface and InformationExtraction Algorithm

• CRFs for information extraction• Correction Propagation in CRFs• Confidence Estimation in CRFs• Expected Number of User Actions

Add Contacts to Address Book

• Email• Web• Text document• Word document• Excel

Demo: Contact Assistant

Input is automaticallyparsed and assigned tofields

Data Integrity – Fast Verification

Color coded correspondence,user can quickly spot errors

Page 2: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

2

Correction Propagation

• Show live demo

Interactive Information Extraction

• UI shows automatic field assignmentresults and allows for fast verification andfast correction

• IE algorithm takes corrections intoaccount and propagates correction toother fields

• IE algorithm calculates confidence scores• UI uses confidence scores to alert user to

possible errors

Constrained ConditionalRandom Fields and

Confidence Estimation

Classes – Database Fields• Classes

– First Name– Last Name– Title– Suffix– Company Name– Phone - Business– Phone – Home– Phone – Mobile– FAX– Address Line– City– State– Postal Code– Country– Email address– Webpage URL

Classes – Database Fields• Classes

– First Name– Last Name– Title– Suffix– Company Name– Phone - Business– Phone – Home– Phone – Mobile– FAX– Address Line– City– State– Postal Code– Country– Email address– Webpage URL

Classes – Database Fields• Classes

– First Name– Last Name– Title– Suffix– Company Name– Phone - Business– Phone – Home– Phone – Mobile– FAX– Address Line– City– State– Postal Code– Country– Email address– Webpage URL

Page 3: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

3

Classes – Database Fields• Classes

– First Name– Last Name– Title– Suffix– Company Name– Phone - Business– Phone – Home– Phone – Mobile– FAX– Address Line– City– State– Postal Code– Country– Email address– Webpage URL

Token Features• Features

Capitalized All Caps In First Name

Lexicon In Last Name

Lexicon 1st Word on line 2nd Word on line 3rd Word on line Previous Token in

First Name lexicon Contains Digits Contains 5 Digits Contains Hyphen Enclosed in Brackets

… and 20000 more

( )tfk ,y

Token Features• Features

Capitalized All Caps In First Name

Lexicon In Last Name

Lexicon 1st Word on line 2nd Word on line 3rd Word on line Previous Token in

First Name lexicon Contains Digits Contains 5 Digits Contains Hyphen Enclosed in Brackets

… and 20000 more

( )tfk ,y Token Features• Features

Capitalized All Caps In First Name

Lexicon In Last Name

Lexicon 1st Word on line 2nd Word on line 3rd Word on line Previous Token in

First Name lexicon Contains Digits Contains 5 Digits Contains Hyphen Enclosed in Brackets

… and 20000 more

( )tfk ,y

Conditional Random Fields• Conditional Random Fields are globally normalized

probability models, where hidden variables areconditioned on observed variables.

• Do not model the distribution over the observedvariables, as generative models do.

• Advantage over generative models (e.g. HMMs) is thatindependence of observations not necessary

2!tx

1!tx

tx

1+tx

2+tx

2!ty

1!ty

ty

1+ty

2+ty

HiddenStates

ObservedVariables

)|( yxp

2!tx

1!tx

tx

1+tx

2+tx

2!ty

1!ty

ty

1+ty

2+ty

Conditional Random Fields• Conditional Random Fields are globally normalized

probability models, where hidden variables areconditioned on observed variables.

• Do not model the distribution over the observedvariables, as generative models do (e.g. HMMs).

• Advantage over generative models is that independenceof observations not necessary for tractability.

HiddenStates

ObservedVariables

)|( yxp

Page 4: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

4

Conditional Random Fields

2!tx

1!tx

tx

1+tx

2+tx

2!ty

1!ty

ty

1+ty

2+ty

( ) ( ) ( )!"

#$%

&+= ''''

==

T

t k

kk

T

t k

kk tftgZ

p110

,,,exp1

| yxxyx ()

HiddenStates

ObservedVariables

relate hidden state variables toobserved variables.( )tfk ,,yx

( ) kttk yxf ! largedigits 5 contains ,code postal "=

( )tgk ,x Relate adjacent hidden state variables

( ) kttk xxg ! smallcode postal,namefirst 1 "== +

( ) kttk xxg ! largenamelast ,namefirst 1 "== +

Normalizing factor, i.e. sum over all state sequences for given observation Charles Stanley 100 Charles Street

Finding the best state assignment

2!tx

1!tx

tx

1+tx

2+tx

2!ty

1!ty

ty

1+ty

2+ty

Class Variables

Feature VariablesOBSERVED

• Classes– First Name– Last Name– Title– Suffix– Company Name– Phone - Business– Phone – Home– Phone – Mobile– FAX– Address Line– City– State– Postal Code– Country– Email address– Webpage URL

• Features Capitalized All Caps In First Name

Lexicon In Last Name

Lexicon 1st Word on line 2nd Word on line 3rd Word on line Previous Token in

First Name lexicon Contains Digits Contains 5 Digits Contains Hyphen Enclosed in Brackets

Finding the best state assignment

2!tx

1!tx

tx

1+tx

2+tx

2!ty

1!ty

ty

1+ty

2+ty

State Variables

Feature VariablesOBSERVED(vectors of binaryvalues)

• Features!Capitalized

!All Caps

! In First Name Lexicon

! In Last Name Lexicon

! 1st Word on line

! 2nd Word on line

! 3rd Word on line

!Previous Token in First Name lexicon

!Contains Digits

!Contains 5 Digits

!Contains Hyphen

!Enclosed in Brackets

• Features!Capitalized

!All Caps

! In First Name Lexicon

! In Last Name Lexicon

! 1st Word on line

! 2nd Word on line

! 3rd Word on line

!Previous Token in First Name lexicon

!Contains Digits

!Contains 5 Digits

!Contains Hyphen

!Enclosed in Brackets

• Features!Capitalized

!All Caps

! In First Name Lexicon

! In Last Name Lexicon

! 1st Word on line

! 2nd Word on line

! 3rd Word on line

!Previous Token in First Name lexicon

!Contains Digits

!Contains 5 Digits

!Contains Hyphen

!Enclosed in Brackets

• Features!Capitalized

!All Caps

! In First Name Lexicon

! In Last Name Lexicon

! 1st Word on line

! 2nd Word on line

! 3rd Word on line

!Previous Token in First Name lexicon

!Contains Digits

!Contains 5 Digits

!Contains Hyphen

!Enclosed in Brackets

• Features!Capitalized

!All Caps

! In First Name Lexicon

! In Last Name Lexicon

! 1st Word on line

! 2nd Word on line

! 3rd Word on line

!Previous Token in First Name lexicon

!Contains Digits

!Contains 5 Digits

!Contains Hyphen

!Enclosed in Brackets

• Classes– First Name

– Last Name

– Title

– Suffix

– Company Name

– Phone - Business

– Phone – Home

– Phone – Mobile

– FAX

– Address Line

– City

– State

– Postal Code

– Country

– Email address

– Webpage URL

• Classes– First Name

– Last Name

– Title

– Suffix

– Company Name

– Phone - Business

– Phone – Home

– Phone – Mobile

– FAX

– Address Line

– City

– State

– Postal Code

– Country

– Email address

– Webpage URL

• Classes– First Name

– Last Name

– Title

– Suffix

– Company Name

– Phone - Business

– Phone – Home

– Phone – Mobile

– FAX

– Address Line

– City

– State

– Postal Code

– Country

– Email address

– Webpage URL

• Classes– First Name

– Last Name

– Title

– Suffix

– Company Name

– Phone - Business

– Phone – Home

– Phone – Mobile

– FAX

– Address Line

– City

– State

– Postal Code

– Country

– Email address

– Webpage URL

• Classes– First Name

– Last Name

– Title

– Suffix

– Company Name

– Phone - Business

– Phone – Home

– Phone – Mobile

– FAX

– Address Line

– City

– State

– Postal Code

– Country

– Email address

– Webpage URL

501019.0)|( !"=yxp

93.0)|( =yxp

361059.0)|( !"=yxp

Charles Stanley 100 Charles Street

Viterbi used to find best sequence

• Viterbi algorithm may return the sequenceof states shown below

2!ty

1!ty

ty

1+ty

2+ty

First Name

Last Name

Address Line

2!tx

1!tx

tx

1+tx

2+tx

Charles Stanley 100 Charles Street

Correction Propagation User Correction

• User Corrects a field, e.g. draggingStanley to the Last Name field

2!ty

1!ty

ty

1+ty

2+ty

First Name

Last Name

Address Line

2!tx

1!tx

tx

1+tx

2+tx

Charles Stanley 100 Charles Street

Page 5: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

5

Remove Paths

• User Corrects a field, e.g. draggingStanley to the Last Name field

2!ty

1!ty

ty

1+ty

2+ty

First Name

Last Name

Address Line

2!tx

1!tx

tx

1+tx

2+tx

Charles Stanley 100 Charles Street

Constrained Viterbi

• Viterbi algorithm is constrained to passthrough the designated state.

2!ty

1!ty

ty

1+ty

2+ty

First Name

Last Name

Address Line

2!tx

1!tx

tx

1+tx

2+tx

Charles Stanley 100 Charles Street

Adjacent field changed: Correction Propagation

Indicate Low Confident Confidence Estimation

• Confidence in a classification• Constrained Forward algorithm used to

calculate sum of subset of paths that“agree” and “disagree” with a classification

( )( )

paths all of Sum

tionclassifica with that paths all of Sum

tionclassificaAny

tionClassifica

agree

P

PCE

=

=

• Paths that “agree” with classification

Sum of “agreeing” statessequences

2!ty

1!ty

ty

1+ty

2+ty

First Name

Last Name

Address Line

2!tx

1!tx

tx

1+tx

2+tx

Charles Stanley 100 Charles Street

• All paths

Sum of all state sequences

2!ty

1!ty

ty

1+ty

2+ty

First Name

Last Name

Address Line

2!tx

1!tx

tx

1+tx

2+tx

Charles Stanley 100 Charles Street

Page 6: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

6

Evaluation

Standard Metrics• Standard information retrieval metrics:

• These metrics don’t relate well to the statedgoals, e.g. how much does the system speedup data acquisition.

84.9585.0584.8488.43MaxEnt

86.2488.2487.2389.73CRF

RecallPrecisionF1Token Acc.

Expected Number of User Actions

• UI designers often use the “Number of Clicks”as an objective metric.

• We would like a similar metric for measuring theeffectiveness of Correction Propagation

• We can calculate the Expected Number of UserActions (ENUA) from statistics of the number oferroneous fields in each record processed bythe system.

31.6RecordsTotal

fieldsTotal==

manualENUA

Number of Incorrect Fields

0 1 2 3 4 5 60

100

200

300

400

500

600

N

u

m

b

e

r

o

f

R

e

c

o

r

d

s

Number of Incorrect Fields in Record

CRF

73.0=ENUA

1. Fields automatically assigned2. User corrects remaining errors

Correct one field

0 1 2 3 4 5 60

100

200

300

400

500

600

N

u

m

b

e

r

o

f

R

e

c

o

r

d

s

Number of Incorrect Fields in Record

CRF

Correct one field

0 1 2 3 4 5 60

100

200

300

400

500

600

N

u

m

b

e

r

o

f

R

e

c

o

r

d

s

Number of Incorrect Fields in Record

CRF

Page 7: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

7

Correct one field

0 1 2 3 4 5 60

100

200

300

400

500

600

N

u

m

b

e

r

o

f

R

e

c

o

r

d

s

Number of Incorrect Fields in Record

CRF + one correction

0 1 2 3 4 5 60

100

200

300

400

500

600

N

u

m

b

e

r

o

f

R

e

c

o

r

d

s

Number of Incorrect Fields in Record

CRF + one correction

Run correction propagation

0 1 2 3 4 5 60

100

200

300

400

500

600

Number of Records

Number of Incorrect Fields in Record

CRF + one correctionCRF + one correction + Correction Propogation

Run correction propagation

0 1 2 3 4 5 60

100

200

300

400

500

600

Number of Records

Number of Incorrect Fields in Record

CRF + one correctionCRF + one correction + Correction Propogation

63.0=ENUA

1. Fill in fields automatically2. User corrects a field3. Correction Propagation4. User corrects remaining errors

Expected Number of User Actions

Baseline6.31Manual – UIMm

-93.1%0.63CCRF – UIM2

-88.4%0.73CRF – UIM 1

ChangeENUAModel/UI-Model

8.5x

10x -13.9%

Confidence Estimation

• 276 records had one or more errors.• If the least confident field highlighted in a

record with one or more errors, an errorwill be identified 81.9% of the time.

• If field is chosen at random, an error willbe identified 29.0% of the time.

• This illustrates the potential for usingconfidence to direct the users attention toan incorrect field.

Summary

• Synergy of User Interface and InformationExtraction Algorithm ensuring confidenceintegrity of data

• Over 88% reduction of User Actions byInformation Extraction alone

• Additional 13% reduction in User Actions due toCorrection Propagation

• Confidence Scores effective at identifyingincorrect fields.

• IIE in Microsoft Office 2007 ???

Page 8: Introduction Extractionmccallum/courses/inlp2004/lect18-confidence.… · Extraction Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew McCallum IBM Research Introduction In USA,

8

End