Top Banner
June 12, 2008 © 2008 IBM Corporation An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu) IBM Almaden Research Center
90

© 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

Dec 26, 2015

Download

Documents

Brittany Mills
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

June 12, 2008 © 2008 IBM Corporation

An Algebraic Approach to Information Extraction

June 12, 2008

The Avatar Group

(Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu)

IBM Almaden Research Center

Page 2: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation2 June 12, 2008

Information Extraction (IE)

Distill structured data from unstructured and semi-structured text

Exploit the extracted data in your applications

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..

(from Cohen’s IE tutorial, 2003)

AnnotationsAnnotations

Page 3: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation3 June 12, 2008

The Avatar Group at IBM Almaden

Working on information extraction (IE) since 2003

Main goals:

– Extract structured information from text

– Build a system that can scale IE to real enterprise apps

– Build new enterprise applications that leverage IE

Page 4: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation4 June 12, 2008

Extracting Entities in Notes 8.01 Live Text

Names, addresses, phone numbers…

Leverages the technologies discussed here

Ships with Lotus Notes 8.01

Page 5: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation5 June 12, 2008

IOPES: Extracting Relationships and Composite Entities

IOPES = IBM Omnifind Personal Email Search

Associations like name ↔ phone number

Complex entities like conference schedules, directions, signature blocks

Page 6: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation6 June 12, 2008

Road Map

An Algebraic Approachto Information Extraction

System T and the AQL Language

Annotators built with AQL

Page 7: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation7 June 12, 2008

Large number of annotators

System T(algebraic information

extraction system)

2007

2004

2005

2006

Evolution of the Avatar Project

Performance, Expressivity

Custom Code

Diverse data sets, Complex extraction tasks

RAP(CPSL-style cascading

grammar system)

Evolutionary Triggers

RAP++(RAP + Extensions outside the

scope of grammars)

2008

Page 8: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation8 June 12, 2008

Historical Perspective: Information Extraction

MUC (Message Understanding Conference) – 1987 to 1997

– Competition-style conferences organized by DARPA

Many different systems from this community– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS

[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]

Recent interest from database/search community– [Agichtein03] [Ipeirotis06] [Ramakrishnan06] [Shen07]

Page 9: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation9 June 12, 2008

An Aside: Rule-Based vs. Machine Learning

Two dominant approaches to information extraction (IE)

– Rule-Based: Define a set of extraction rules

– Machine Learning Based: Learn a parametric model

Focus of our work: Rule-based IE

Page 10: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation10 June 12, 2008

Cascading Finite-state Grammars

Most rule-based IE systems share a common formalism

– Input text viewed as a sequence of tokens– Rules expressed as regular expression patterns

over these tokens

Several levels of processing Cascading Grammars

Page 11: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation11 June 12, 2008

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Cascading Grammars By Example

Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Level 0 (Tokenize)

Level 2

Level 1

Page 12: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation12 June 12, 2008

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Common Pattern Specification Language (CPSL)

Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Level 0

Level 2

Level 1

CPSL

– A standard language for specifying cascading grammars– Created in 1998

Several known implementations

– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)

• Part of the GATE NLP framework

• Under active consideration for commercial use by several companies

CPSL

– A standard language for specifying cascading grammars– Created in 1998

Several known implementations

– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)

• Part of the GATE NLP framework

• Under active consideration for commercial use by several companies

Page 13: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation13 June 12, 2008

Experiences with Cascading Grammars

Benefits

– Big step forward from custom code

– Can express many simple concepts

Drawbacks

– Expressiveness• Multiple tokenizations• Dealing with overlap• Building complex structures

– Performance

Page 14: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation14 June 12, 2008

Example Task: Finding informal reviews in blogs

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas turpis. Proin nam ac ligula a lectus suscipit porttitor. Fusce non tellus sed urna pulvinar tincidunt.

Etiam in enim. In blandit mi sit amet lectus. Nullam adipiscing fringilla odio. In hac habitasse platea dictumst. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Ut elementum quam eget justo. In arcu leo,

We went to a OTIS concert last Thursday. Suspendisse malesuada est vel risus. Aenean sed ante fermentum dolor placerat rutrum. John Pipe plays guitar, id pellentesque pede felis a erat. Felis Marco Benevento on the Hammond organ. Curabitur sollicitudin porta velit. Donec scelerisque. Donec a magna sed sem accumsan sodales. It was SO MUCH FUN! Hes accumsan sed, aliquam eget, ornare et, metus. Integer eleifend tellus dictum nisi.

Page 15: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation15 June 12, 2008

Overlapping Annotations Example: Band Review Annotator

(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person

Person 0-5 tokens Instrument PersonPlaysInstrument

JohnPerson

playsToken

theToken

PipePerson

guitarInstrument

John PipePerson

playsToken

theToken

guitarInstrument

JohnPerson

playsToken

theToken

PipeInstrument

guitarInstrument

Person

Person

Instrument

Person Instrument

John Pipe plays the guitar

Page 16: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation16 June 12, 2008

Complex Structures Example: Signature Annotator

Laura Haas, PhDDistinguished Engineer and Director, Computer

ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

OrganizationPhone

URL

Person Organizati

onPhone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL, Email, Address}

End with one of these.

Start with Person

Within 50 tokens

Page 17: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation17 June 12, 2008

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Performance

Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Level 0

Level 2

Level 1

Page 18: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation18 June 12, 2008

Performance: Existing Solutions

Performance issues

– Complete pass through tokens for each rule

– Many of these passes are wasted work

Dominant approach: Make each pass go faster

– Faster finite state machines

– Batch processing

– Parallel processing

Doesn’t solve root problem!

Page 19: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation19 June 12, 2008

The Algebraic Approach

A different way of thinking

Identify the most basic operations

Create an operator for each basic operation

Compose operators to build complex annotators

Page 20: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation20 June 12, 2008

Example: Regular Expression Extraction Operator

\d{3}-\d{4}

DocumentInput Tuple

You can reach me at 555-1212 or 358-1237.

Output Tuple 2 Span 2Document

Span 1Output Tuple 1 Document

Regex

Page 21: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation21 June 12, 2008

Some Example Operators

Regex– Find all matches of a character-based regular

expression

Dictionary– Find all matches of an exhaustive dictionary of terms

Join– Find pairs of sub-annotations that match a predicate

Block– Identify contiguous blocks of lower-level matches

Page 22: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation22 June 12, 2008

Comparison with Cascading Grammars

Apply Name Rule

Apply Phone Rule

Apply PersonPhone

…John Smith at 555-1212…

…<Name> at <Phone>…

…<PersonPhone>…

…John Smith at 555-1212…

555-1212

John Smith at 555-1212

Grammar

Dictionary Regex

Join

Algebra

Block

JohnSmith

John Smith

Page 23: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation23 June 12, 2008

Overlapping Annotations Example: Band Review Annotator

(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person

Person 0-5 tokens Instrument PersonPlaysInstrument

JohnPerson

playsToken

theToken

PipePerson

guitarInstrument

John PipePerson

playsToken

theToken

guitarInstrument

JohnPerson

playsToken

theToken

PipeInstrument

guitarInstrument

Person

Person

Instrument

Person Instrument

John Pipe plays the guitar

Page 24: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation24 June 12, 2008

DictionaryRegex

Join

JohnPipe

Overlapping Annotations

Person Person Instrument

John Pipe plays the guitar

Block

JohnPipe

John Pipe

guitarPipe

Consolidate

JohnPipe

John Pipe

JohnguitarPipe

guitarguitar

John Pipe guitar

Explicitly remove overlap with

Consolidate operator

Explicitly remove overlap with

Consolidate operator

Retain overlapping matches by default

Retain overlapping matches by defaultRetain overlapping matches by default

Retain overlapping matches by default

Page 25: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation25 June 12, 2008

Complex Structures Example: Signature Annotator

Laura Haas, PhDDistinguished Engineer and Director, Computer

ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

OrganizationPhone

URL

Person Organizati

onPhone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL, Email, Address}

End with one of these.

Start with Person

Within 250 characters

Page 26: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation26 June 12, 2008

Complex Structures Example: Signature Annotator

Org Phone URL

Person

Join

Union

Organization Phone

URL

Organization Phone

URL

Person

Block Organization

Phone

URLPerson

SignatureJoin predicates enforce additional

constraints

Find blocks of two or more “contact info”

patterns

Page 27: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation27 June 12, 2008

Performance

Performance issues with grammars

– Complete pass through tokens for each rule

– Many of these passes are wasted work

Dominant approach: Make each pass go faster

– Doesn’t solve root problem!

Algebraic approach: Build a query optimizer!

Page 28: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation28 June 12, 2008

An Aside: Relational Query Optimization

Central concept in relational databases

– User specifies what she is looking for

– System decides how to find it

– Greatly reduces development and maintenance costs

Basic approach

– Enumerate many equivalent relational algebra expressions

– Estimate the cost of each one

– Choose the fastest

Page 29: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation29 June 12, 2008

Optimizations

Query optimization is a familiar topic in databases

What’s different in text?– Operations over sequences and spans– Document boundaries– Costs concentrated in extraction operators (dictionary,

regular expression)

Can leverage these characteristics– Text-specific optimizations– Significant performance improvements

Page 30: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation30 June 12, 2008

Example: Restricted Span Evaluation (RSE)

Leverage the sequential nature of text

– Join predicates on character or token distance

Only evaluate the inner on the relevant portions of the document

Limited applicability

– Need to guarantee exact same results

…John Smith at 555-1212…

John Smith555-1212

John Smith at 555-1212

DictionaryRegex

RSEJoin

Only look for dictionary matches in the vicinity of a

phone number.

Page 31: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation31 June 12, 2008

Experimental Results (Band Review Annotator)

Annotator Running Time

0

5000

10000

15000

20000

25000

30000

GRAMMAR ALGEBRA (Baseline) ALGEBRA (Optimized)

Ru

nn

ing

Tim

e (s

ec)

Classical query

optimization

Classical query

optimization

Text-specific optimizationsText-specific optimizations

Page 32: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation32 June 12, 2008

Road Map

An Algebraic Approachto Information Extraction

System T and the AQL Language

Annotators built with AQL

Page 33: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation33 June 12, 2008

System T

Next-generation information extraction system

Makes developing annotators like developing other enterprise software

– AQL rule language• Declarative language for building annotators

– Development environment• Provides support for building complex annotators

– Runtime environment• Deploy to corporate PCs or server farms

Page 34: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation34 June 12, 2008

Development EnvironmentDevelopment Environment

Optimizer

Rules(AQL)

ExecutionEngine

SampleDocuments

RuntimeEnvironment

RuntimeEnvironment

InputDocument

Stream

AnnotatedDocument

Stream

Plan(Algebra)

UserInterface

System T Block Diagram

Page 35: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation35 June 12, 2008

AQL

Declarative language for defining annotators

–Compiles into our algebra

Main features

–Separates semantics from performance–Familiar syntax–Full expressive power of algebra

Page 36: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation36 June 12, 2008

Within a single sentence

<Person> <PhoneNum>

0-30 chars

Contains “phone” or “at”

create view PersonPhone asselect P.name as person, N.number as phonefrom Person P, PhoneNumber N, Sentence Swhere Follows(P.name. N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, N.number));

AQL By Example

Page 37: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation37 June 12, 2008

AQL: Status

Compiler and optimizer implemented in 2007

– First generation: Heuristic optimizer– Second generation: Basic cost-based optimizer – Third generation in progress

Transitioning to several IBM products– Used in Lotus Notes 8.01 (GA on March 2008)– Next release of IOPES will be AQL-based (Notes 8.5,

Q4 2008)– Several other products in development

Page 38: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation38 June 12, 2008

System T Development Environment

Create and edit AQL annotators

Manage dictionaries and document collections

Test annotators and view results

Downloadable demo!

– (IBM internal only)

Page 39: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation39 June 12, 2008

Ongoing Work: Pattern Discovery

The Problem:

– Building dictionaries and other basic building blocks is a major part of the development process• 80% or more of the work

Solution:

– Providing tools to analyze annotations and their context to discover useful low-level patterns

Page 40: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation40 June 12, 2008

Example: Building a Phone Number Annotator

(123)4568909

1-800-124-2456

123-890-8990

789.890.8980

345-678-9012

123.345.7890

1-890-890-0890

(408)123-7898

123.456.789.189

10.50-100.00

10.10.2008

[\d()-\.]{7-15}Initial “rough” regular expression

Examples to help improve original

pattern

Run over sample

documents

Cluster results

Page 41: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation41 June 12, 2008

Example: Building a Phone Number Annotator

(123)4568909

1-800-124-2456

123-890-8990

789.890.8980

345-678-9012

123.345.7890

1-890-890-0890

(408)123-7898

123.456.789.189

10.50-100.00

10.10.2008

[\d()-\.]{7-15}

Phone #:

Phone #:

Telephone #:

Tel #:

phone number is

cell number is

call me at

call my office at

IP address is

Price range:

Open on

Left Context

Run over sample

documents

Cluster results

Cluster the text to the left (or right) of

the matches

Identify contextual “clues” that can

improve confidence…

…or indicate false positives

Page 42: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation42 June 12, 2008

Ongoing Work: Interface for Building Custom Annotators

Problem:

– Customers need to build

– AQL is too powerful

Solution:

– Simpler language with compact syntax

– GUI annotator builder

Page 43: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation43 June 12, 2008

Road Map

An Algebraic Approachto Information Extraction

System T and the AQL Language

Annotators built with AQL

Page 44: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation44 June 12, 2008

Named Entity Annotators

Developed using System T and AQL

Shipping with Lotus Notes 8.01

Will ship with IOPES, other IBM products

Statistics:

– 8 types of entities

– 327 AQL statements

– Throughput: 800+ kb/sec/core (on my laptop)

Page 45: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation45 June 12, 2008

Entities Currently Extracted

Complex entities

– Person– Address– Organization

“Simple” entities

– Phone Number – Email address– URL – Time– Date

Page 46: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation46 June 12, 2008

Languages Supported Already supported:

– English– German

Can support with straightforward extensions:– Spanish– French– other Indo-European languages

Extensions needed (ongoing work):– Japanese (with Tokyo Research Lab)– Hebrew (with Haifa?)– Chinese– Korean

Page 47: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation47 June 12, 2008

AddressOrganizationPerson

Stage 1Extract basic features

Stage 2Find composite patterns

Stage 3Filter false positivesIdentify lists

High-Level Dataflow Diagram

Stage 4Handle overlap

Page 48: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation48 June 12, 2008

Quality

Precision Recall

Person >90% 90%

Address >95% 90%

Organization >90% 90%

Phone Number > 95% > 95%

Page 49: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation49 June 12, 2008

Performance: Laptop (Intel Core 2 Duo 2.33 GHz)

Just Person and Organization All Named Entities

0

500

1000

1500

2000

2500

1 2

Number of Threads

Th

rou

gh

pu

t (kb

/se

c)

0

500

1000

1500

2000

1 2

Number of Threads

Th

rou

gh

pu

t (kb

/se

c)

Page 50: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation50 June 12, 2008

Performance: Server (4×quad-core AMD Opteron)

Just Person and Organization All Named Entities

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 2 3 4 5 6 7 8 91

01

11

21

31

41

51

6

Number of Threads

Th

rou

gh

pu

t (kb

/se

c)

0

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 91

01

11

21

31

41

51

6

Number of Threads

Th

rou

gh

pu

t (kb

/se

c)

Page 51: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

© 2008 IBM Corporation51 June 12, 2008

Thank you!

For more information…

– Read our ICDE 2008 paper (“An Algebraic Approach to Rule-Based Information Extraction”)

– Try out IOPES• http://www.alphaworks.ibm.com/tech/emailsearch

– Avatar Project home page• http://almaden.ibm.com/cs/projects/avatar/

– Download System T (IBM only)• http://fisher.almaden.ibm.com:8080/systemt

– Contact me• [email protected]

Page 52: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

June 12, 2008 © 2008 IBM Corporation

Backup Slides

Page 53: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation

BACKUP SLIDE

53 June 12, 2008

Road Map

An Algebraic Approachto Information Extraction

System T and the AQL Language

Annotators built with AQL

Page 54: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation54 June 12, 2008

BACKUP SLIDE

Extracting Information with Custom Code

“It’s just pattern matching”

– Use scripts and regular expressions

Then reality sets in…

– Dozens of rules, even for simple concepts

– Many special cases

– Convoluted logic

– Painfully slow code

Page 55: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation55 June 12, 2008

BACKUP SLIDE

Operators in the Algebra

Currently 44 operators

Categories:

– Relational: Selection, Cross product, Join, Union, …

– Span extraction: Regular expression, Dictionary, Sentence, Part of Speech…

– Span aggregation: Consolidation, Block

– Specialized: Detag HTML

– Input/Output: Document Scan, Annotation Scan, ToHTML, ToAOM, …

Page 56: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation56 June 12, 2008

BACKUP SLIDE

Multiple Tokenizations Example

I.B.M.

Extraction Task Ideal Tokenization

Identify company names

Find sentence boundaries I.B.M.

Identify abbreviations I.B.M.

Take me back!

Page 57: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation57 June 12, 2008

BACKUP SLIDE

Tokenization on Demand

…J.T. Smith works at I.B.M.…

I.B.M.

Dictionary

Regex

Join

I.B.M.J.T.

Dictionary

Smith

J.T. Smith

Regex

.

.

.

.

.Company

Names

First and Middle InitialsPunctuation

Embedded Tokenizer

Embedded Tokenizer No

Tokenization

No Tokenization

Tokenize Between “J.T.” and “Smith”

Tokenize Between “J.T.” and “Smith”

Page 58: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation58 June 12, 2008

BACKUP SLIDE

Overlapping Annotations Example: Band Review Annotator

John Pipe plays the guitar

Person Instrument

Instrument John Pipe plays the guitar Person Token Token Instrument

John Pipe plays the guitarToken Instrument Token Token Instrument

Person

Marco Benevento on the Hammond organ

Instrument

Person

(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person

Person 0-5 tokens Instrument PersonPlaysInstrument

Page 59: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation59 June 12, 2008

BACKUP SLIDE

Overlapping Annotations: Existing Solutions

Explicit rule priority

– Higher-priority rules in a level dominate lower-priority ones– Complex interactions between rules– Not enough information available in low-level rules

John Pipe plays the guitar

InstrumentInstrument

John Pipe plays the guitar

Person Instrument

Person

Marco Benevento on the Hammond organ

Person Person

Marco Benevento on the Hammond organ

Instrument

Person dominates Instrument Instrument dominates Person

Page 60: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation60 June 12, 2008

BACKUP SLIDE

DictionaryRegex

Join

John PipedocMarco Beneventodoc

Hammonddoc

docdoc

Pipeguitar

doc Hammond organ

ProperNoun Instrument ProperNoun

John Pipe plays the guitar Marco Benevento on the Hammond organ

Instrument

InstrumentProperNoun

John PipedocMarco Beneventodoc

guitarHammond organ

CapitalizedWord Instrument

Person <0-5 tokens> InstrumentOverlapping Annotations

Page 61: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation61 June 12, 2008

BACKUP SLIDE

Overlapping Annotations Example: Band Review Annotator

When John Pipe plays the guitar, the crowd

(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person

Person 0-5 tokens Instrument PersonPlaysInstrument

Which ones to retain?

CPSL standard

Page 62: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation62 June 12, 2008

BACKUP SLIDE

Consolidation

Operator that removes overlap

Several different policies

– Exact match– Longest match– Left-to-right longest– …

Consolidate only when enough information is available

Set of spans

Non-overlapping

subset

ConsolidatePolicyPolicy

Page 63: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation63 June 12, 2008

BACKUP SLIDE

Second “John Pipe” Example

When John Pipe plays the guitar, the crowd…

Regex

WhenJohnPipe

Block

JohnPipe

When JohnJohn Pipe

When

Select

JohnPipe

John Pipe

Consolidate

John Pipe

Find Capitalized

Words

Find Capitalized

WordsFilter out

Stop-Words

Filter out Stop-Words

Remove Overlap

Remove Overlap

Page 64: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation64 June 12, 2008

BACKUP SLIDE

Complex Structures: Existing Solutions

Approximate using regular expressions

Example: Signature

– Rule: (Person Token{,25} Phone (Token{,25} Contact)+) | (Person (Token{,25} Contact)+ Token{,25} Phone

(Token{,25} Contact)*)– Problems:

• Need to enumerate all possible orders of sub-annotations– What if you want at least one phone and one email?

• Does not restrict total token count

Page 65: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation65 June 12, 2008

BACKUP SLIDE

Performance: Existing Solutions

Performance issues

– Complete pass through tokens for each rule

– Many of these passes are wasted work

Dominant approach: Make each pass go faster

– Faster finite state machines

– Batch processing

– Parallel processing

Doesn’t solve root problem!

Page 66: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation66 June 12, 2008

BACKUP SLIDE

Types of Operator

Select, project, join…

Extraction operators

– Identify basic pattern matches in text– Several subtypes: Regex, Dictionary, Sentence…

Block

– Group together simpler annotations to produce complex ones

Consolidation

– Decide between overlapping matches

Page 67: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation67 June 12, 2008

BACKUP SLIDE

Multiple Tokenizations: Existing Solutions

Use a “lowest common denominator” tokenizer

– Makes rules much more complicated

Use a configurable tokenizer

– Can still need two different tokenizations

– Need to keep tokenization(s) in sync with rules

Use character-based regular expressions

– Rules need to deal with whitespace, punctuation

Page 68: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation68 June 12, 2008

BACKUP SLIDE

Shared Dictionary Matching (SDM)

Dictionary matching has 3 steps:

– Tokenize text– Hash each token– Generate matches based on hash table entry

Can share the first two steps among many dictionaries

DictD1 D2

subplan

D1

D2

subplan

Dict SDMDict

SDM Dictionary Operator

Page 69: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation69 June 12, 2008

BACKUP SLIDE

Conditional Evaluation (CE)

Leverage document-at-a-time processing

Don’t evaluate the inner operand of a join if the outer has no results

Costing plans is challenging

…John Smith at 555-1212…

John Smith 555-1212

John Smith at 555-1212

Dictionary Regex

CEJoin

Don’t evaluate this Regex when there are no dictionary

matches.

Page 70: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation70 June 12, 2008

BACKUP SLIDE

Implementing Restricted Span Evaluation (RSE)

RSE join operator

RSE extraction operator

Pass join bindings down to the inner of a join

Requires special physical operators at edges of plan

s1

R1

p(s1,s2)Dict(D,s2)

RSEDict

s1 bindings1 binding

s2’s that satisfyp(binding, s2)

s2’s that satisfyp(binding, s2) RSE

DictionaryOperator

RSEDictionaryOperator

D

p

Page 71: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation71 June 12, 2008

BACKUP SLIDE

RSE Dictionary Operator

RSE version of an operator must produce the exact same answer

– Ongoing work: RSE Regular Expression operator

RSE version of an operator must produce the exact same answer

– Ongoing work: RSE Regular Expression operator

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venenatis.

To find dictionary matches that end in this range…

…need to examine this range.

Length of longest dictionary entry

Page 72: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation72 June 12, 2008

BACKUP SLIDE

Separating Performance from Semantics

AQL Language

Optimizer

OperatorRuntime

Specify annotator semantics declaratively

Specify annotator semantics declaratively

Choose an efficient execution plan that implements semantics

Choose an efficient execution plan that implements semantics

Page 73: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation73 June 12, 2008

BACKUP SLIDEHistorical Perspective: Information Extraction

MUC (Message Understanding Conference) – 1987 to 1997

– Competition-style conferences organized by DARPA

– Shared data sets and performance metrics• News articles, Radio transcripts, Military telegraphic messages

Classical IE Tasks

– Entity and Relationship/Link extraction

– Entity resolution/matching

– Event detection (Identify a complex event such as a merger or meeting involving multiple

entities)

Several IE systems were built by this community

– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]

Page 74: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation74 June 12, 2008

BACKUP SLIDE

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Common Pattern Specification Language (CPSL)

Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Level 0

Level 2

Level 1

CPSL

– A standard language for specifying cascading grammars

– Created in 1998

CPSL

– A standard language for specifying cascading grammars

– Created in 1998

Page 75: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation75 June 12, 2008

BACKUP SLIDE

Dictionary

RegularExpression

Join

Other0%

100%

Naïve Plan Optimized

Execution Time Breakdown

Page 76: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation76 June 12, 2008

BACKUP SLIDE

AQL Syntax

select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);

select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

Page 77: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation

BACKUP SLIDE

77 June 12, 2008

Road Map

An Algebraic Approachto Information Extraction

System T and the AQL Language

Annotators built with AQL

Page 78: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation78 June 12, 2008

BACKUP SLIDE

Annotation Development Cycle

DevelopIdentify

Problems

Test

Define

Deploy

Page 79: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation79 June 12, 2008

BACKUP SLIDE

Deploy

Annotation Development Cycle

DevelopIdentify

Problems

Test

Define

RuntimeEnvironment

RuntimeEnvironment

AnnotatorDevelopmentEnvironment

Ease of development and maintenance

Page 80: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation80 June 12, 2008

BACKUP SLIDE

RepresentativeDocuments

RuntimeEnvironment

RuntimeEnvironment

DevelopmentEnvironment

DevelopmentEnvironment

OptimizerOptimizerRules(AQL)

InputDocument

Stream

AnnotatedDocument

Stream

Plan(Algebra)

System T Block Diagram

Page 81: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation81 June 12, 2008

BACKUP SLIDEUIMA Where does UIMA fit in all of this?

– UIMA is a software framework for NLP• Allows complex annotators to be composed as a pipeline of smaller building blocks

– What UIMA is not….• Does not specify how an annotator performs its extraction task• Does not provide a rule language nor a rule-matching engine

– Orthogonal to the focus of this talk

However

– The AQL runtime can be embedded inside a UIMA annotator.

UIMAAnnotator A

UIMA Annotator B

OptimizerRules(AQL)

Plan(Algebra)

AQL RuntimeAQL Runtime

UIMA Annotator A

Java CodeJava Code UIMAAnnotator A

UIMA Annotator C

Java CodeJava Code

Page 82: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation82 June 12, 2008

BACKUP SLIDE

Within a single sentence

<Person> <PhoneNum>

0-30 chars

Contains “phone” or “at”

create view PersonPhone asselect P.name as person, N.number as phonefrom Person P, PhoneNumber N, Sentence Swhere Follows(P.name. N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, N.number));

AQL

Set<Pair<Span>> extractPersonPhoneCandidate(String text) { Set<Span> Person = extractPersons(text); Set<Span> PhoneNum = extractPhoneNumber(text); Set<Span> Sentence = extractSentence(text);

Set<Pair<Span>> PersonPhoneCandidate = new HashSet<Pair<Span>>();

for (Span P : Person) { for (Span N : PhoneNum) { if (Follows(P,N,0,30)) then { String textBetween = text.substring(P.end, N.begin); Pattern R = Pattern.compile(“\\b(phone|at)\\b“); if (matches(R, textBetween) { PersonPhoneCandidate.add(new Pair<Span>(P,N)); } } } } Set<Pair<Span>> PersonPhone = new HashSet<Pair<Span>>(); for (Pair<Span> C : PersonPhoneCandiate) { for (Span S : Sentence) { if(S.contains(C)) { PersonPhone.add(C); } } }

return C;}

boolean Follows(Span first, Span second, int min, int max) { int firstEnd = first.end; int secondBegin = second.begin; int distance = (secondBegin – firstEnd);

if ((distance >= min) && (distance <= max)) { return true; } else { return false; }}

Custom Code

The AQL Rule Language

Development costs

Maintenance costs

Performance

Correctness

Page 83: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation83 June 12, 2008

BACKUP SLIDE

Example: Building a Phone Number Annotator

(123)4568909

1-800-124-2456

123-890-8990

789.890.8980

345-678-9012

123.345.7890

1-890-890-0890

(408)123-7898

123.456.789.189

10.50-100.00

10.10.2008

[\d()-\.]{7-15}

Ext 12345

x1235

ext-1230

.

.

.

.

\n

or

$

10:00am

Right Context

Run over sample

documents

Cluster results

Additional patterns for Phone Number: Extension Numbers

Page 84: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation

BACKUP SLIDE

84 June 12, 2008

Road Map

An Algebraic Approachto Information Extraction

System T and the AQL Language

Annotators built with AQL

Page 85: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation85 June 12, 2008

BACKUP SLIDE

Person Annotator

Names appear in widely varying contexts– Mr. Dabrowski received a Bachelor degree…– Dr. Jean L. Rouleau Dean of Medicine University…– …met Peter and Katie Lawton who have…– …lives in Riverdale, NY, with his wife Marie-Jeanne. He has two married

sons, James and Michael. – The Honorable Carol Boyd Hallett - Of Counsel…– Kimberly Purdy Lloyd received a Bachelor of Science degree from the

University of Texas… Additional Challenges

– Avoiding person names inside/overlap with other entities• Organization, Address

– List of person names• Attendees Ida White, Bridget McBean, Volker Hauck

Currently supports names from > 8 countries, including Israel Currently supports names from > 8 countries, including Israel

Page 86: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation86 June 12, 2008

BACKUP SLIDE

Person Annotator Outline

Stage 1: Identify individual features• <FirstName>, <LastName>, <Salutation>, <CapsPerson>, <Initial> …

– Dictionaries, Regular expressions

Stage 2 : Identify candidate persons based on strong patterns• <FirstName>(<CapsPerson>|<Initial>)?<LastName>• <Salutation>(<CapsPerson>|<Initial>)?<CapsPerson>• <LastName>, <FirstName> • …

– Joins, Selection predicates, Block

Stage 3 : Eliminate weaker matches, handle lists • Delete annotations generated by lower priority rules

– Consolidation, Minus, Selection predicates

Stage 4 : Remove matches within other entities

– Consolidation, Minus, Selection predicates

Page 87: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation87 June 12, 2008

BACKUP SLIDE

Address Annotator

USAddress has well-defined pattern– <StreetAddress> <SecondaryUnit>? <City> <State> <Zipcode>?– 1515 Pioneer Drive Harrison, AR 72601– 3607 Church Street, Suite 300 · Cincinnati, Ohio 45244– 101 S. Webster Street . PO Box 7921 . Madison, Wisconsin 53707-

7921 Challenges

– Multiple parts to the Address– Some parts are optional (e.g., Secondary Unit, Zipcode)– <City> cannot be identified using Dictionary due to resource restrictions– Handling ambiguous abbreviations

• Ms MA In state names• Dr. Row Street suffixes

Currently supports U.S. and German addresses Currently supports U.S. and German addresses

Page 88: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation88 June 12, 2008

BACKUP SLIDE

Address Annotator Outline

Stage 1 : Primary features identified• <StreetAddress>, <Secondary Unit>, <State>, <Zipcode>

– Regular Expressions, Dictionaries, Joins Stage 2 : Complete StreetAddress identified

• <StreetAddress> <SecondaryUnit>?– Join, Union

Stage 3 : StreetAddress combined with State information• <StreetAddress><SecondaryUnit>?<City><State>

– Join, Union, Selection Predicates Stage 4 : Combining with Zipcode

• <StreetAddress><SecondaryUnit>?<City><State><Zipcode>?– Join, Union, Selection Predicates

Page 89: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation89 June 12, 2008

BACKUP SLIDE

Organization Annotator

Organization names appear in wide range of

– is a graduate of Hofstra University– He joined Interactive Data in 2003– President of Foley & Lardnear LLP– Received her B.S in English from University of Wisconsin– The bill at the Savoy Hotel

Additional Challenges

– Long organization names (Q: where is the begin & end?)• The Chartered Institute of Public Finance and Accountancy

– May contain list of person names• Squar, Milner, Peterson, Miranda & Williamson, LLP• John Ortiz, James & James Ltd

– Adjacent organization names• University of Michigan Ross School of Business

– Multiple representation for the same organization & its subdivisions• Enron, Enron Corp., Enron Corporation, Enron Metals & Commodity Corp.

Page 90: © 2008 IBM Corporation June 12, 2008 An Algebraic Approach to Information Extraction June 12, 2008 The Avatar Group (Rajasekar Krishnamurthy, Yunyao Li,

BACKUP

© 2008 IBM Corporation90 June 12, 2008

BACKUP SLIDE

Organization Annotator Outline

Stage 1: Identify individual features• CommonOrganization, Suffix, Prefix, IndustryType, CapsOrg, PrepOrg

– Dictionaries, Regular expressions

Stage 2 : Identify candidate organization based on strong patterns• (<The><CapsOrg>{1,3}<Conj>)?<CapsOrg>{1,3}<Suffix>|<IndustryType>• <CapsOrg><1,3><Prefix><PrepOrg><CapsOrg>{1,2}(<Conj><CapsOrg>{1,2})?• <CommonOrganization>(<CapsOrg>{1,3}(<Suffix>|<IndustryType>))?• …

– Joins, Selection predicates, Block

Stage 3 : Eliminate weaker matches, handle lists • Delete annotations generated by lower priority rules

– Consolidation, Minus, Selection predicates

Stage 4 : Remove matches within other entities

– Consolidation, Minus, Selection predicates