June 12, 2008 © 2008 IBM Corporation
An Algebraic Approach to Information Extraction
June 12, 2008
The Avatar Group
(Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu)
IBM Almaden Research Center
© 2008 IBM Corporation2 June 12, 2008
Information Extraction (IE)
Distill structured data from unstructured and semi-structured text
Exploit the extracted data in your applications
For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
(from Cohen’s IE tutorial, 2003)
AnnotationsAnnotations
© 2008 IBM Corporation3 June 12, 2008
The Avatar Group at IBM Almaden
Working on information extraction (IE) since 2003
Main goals:
– Extract structured information from text
– Build a system that can scale IE to real enterprise apps
– Build new enterprise applications that leverage IE
© 2008 IBM Corporation4 June 12, 2008
Extracting Entities in Notes 8.01 Live Text
Names, addresses, phone numbers…
Leverages the technologies discussed here
Ships with Lotus Notes 8.01
© 2008 IBM Corporation5 June 12, 2008
IOPES: Extracting Relationships and Composite Entities
IOPES = IBM Omnifind Personal Email Search
Associations like name ↔ phone number
Complex entities like conference schedules, directions, signature blocks
© 2008 IBM Corporation6 June 12, 2008
Road Map
An Algebraic Approachto Information Extraction
System T and the AQL Language
Annotators built with AQL
© 2008 IBM Corporation7 June 12, 2008
Large number of annotators
System T(algebraic information
extraction system)
2007
2004
2005
2006
Evolution of the Avatar Project
Performance, Expressivity
Custom Code
Diverse data sets, Complex extraction tasks
RAP(CPSL-style cascading
grammar system)
Evolutionary Triggers
RAP++(RAP + Extensions outside the
scope of grammars)
2008
© 2008 IBM Corporation8 June 12, 2008
Historical Perspective: Information Extraction
MUC (Message Understanding Conference) – 1987 to 1997
– Competition-style conferences organized by DARPA
Many different systems from this community– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS
[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
Recent interest from database/search community– [Agichtein03] [Ipeirotis06] [Ramakrishnan06] [Shen07]
© 2008 IBM Corporation9 June 12, 2008
An Aside: Rule-Based vs. Machine Learning
Two dominant approaches to information extraction (IE)
– Rule-Based: Define a set of extraction rules
– Machine Learning Based: Learn a parametric model
Focus of our work: Rule-based IE
© 2008 IBM Corporation10 June 12, 2008
Cascading Finite-state Grammars
Most rule-based IE systems share a common formalism
– Input text viewed as a sequence of tokens– Rules expressed as regular expression patterns
over these tokens
Several levels of processing Cascading Grammars
© 2008 IBM Corporation11 June 12, 2008
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Cascading Grammars By Example
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0 (Tokenize)
Level 2
Level 1
© 2008 IBM Corporation12 June 12, 2008
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Common Pattern Specification Language (CPSL)
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0
Level 2
Level 1
CPSL
– A standard language for specifying cascading grammars– Created in 1998
Several known implementations
– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)
• Part of the GATE NLP framework
• Under active consideration for commercial use by several companies
CPSL
– A standard language for specifying cascading grammars– Created in 1998
Several known implementations
– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)
• Part of the GATE NLP framework
• Under active consideration for commercial use by several companies
© 2008 IBM Corporation13 June 12, 2008
Experiences with Cascading Grammars
Benefits
– Big step forward from custom code
– Can express many simple concepts
Drawbacks
– Expressiveness• Multiple tokenizations• Dealing with overlap• Building complex structures
– Performance
© 2008 IBM Corporation14 June 12, 2008
Example Task: Finding informal reviews in blogs
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas turpis. Proin nam ac ligula a lectus suscipit porttitor. Fusce non tellus sed urna pulvinar tincidunt.
Etiam in enim. In blandit mi sit amet lectus. Nullam adipiscing fringilla odio. In hac habitasse platea dictumst. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Ut elementum quam eget justo. In arcu leo,
We went to a OTIS concert last Thursday. Suspendisse malesuada est vel risus. Aenean sed ante fermentum dolor placerat rutrum. John Pipe plays guitar, id pellentesque pede felis a erat. Felis Marco Benevento on the Hammond organ. Curabitur sollicitudin porta velit. Donec scelerisque. Donec a magna sed sem accumsan sodales. It was SO MUCH FUN! Hes accumsan sed, aliquam eget, ornare et, metus. Integer eleifend tellus dictum nisi.
© 2008 IBM Corporation15 June 12, 2008
Overlapping Annotations Example: Band Review Annotator
(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person
Person 0-5 tokens Instrument PersonPlaysInstrument
JohnPerson
playsToken
theToken
PipePerson
guitarInstrument
John PipePerson
playsToken
theToken
guitarInstrument
JohnPerson
playsToken
theToken
PipeInstrument
guitarInstrument
Person
Person
Instrument
Person Instrument
John Pipe plays the guitar
© 2008 IBM Corporation16 June 12, 2008
Complex Structures Example: Signature Annotator
Laura Haas, PhDDistinguished Engineer and Director, Computer
ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
OrganizationPhone
URL
Person Organizati
onPhone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
Start with Person
Within 50 tokens
© 2008 IBM Corporation17 June 12, 2008
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Performance
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0
Level 2
Level 1
© 2008 IBM Corporation18 June 12, 2008
Performance: Existing Solutions
Performance issues
– Complete pass through tokens for each rule
– Many of these passes are wasted work
Dominant approach: Make each pass go faster
– Faster finite state machines
– Batch processing
– Parallel processing
Doesn’t solve root problem!
© 2008 IBM Corporation19 June 12, 2008
The Algebraic Approach
A different way of thinking
Identify the most basic operations
Create an operator for each basic operation
Compose operators to build complex annotators
© 2008 IBM Corporation20 June 12, 2008
Example: Regular Expression Extraction Operator
\d{3}-\d{4}
DocumentInput Tuple
…
You can reach me at 555-1212 or 358-1237.
…
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Regex
© 2008 IBM Corporation21 June 12, 2008
Some Example Operators
Regex– Find all matches of a character-based regular
expression
Dictionary– Find all matches of an exhaustive dictionary of terms
Join– Find pairs of sub-annotations that match a predicate
Block– Identify contiguous blocks of lower-level matches
© 2008 IBM Corporation22 June 12, 2008
Comparison with Cascading Grammars
Apply Name Rule
Apply Phone Rule
Apply PersonPhone
…John Smith at 555-1212…
…<Name> at <Phone>…
…<PersonPhone>…
…John Smith at 555-1212…
555-1212
John Smith at 555-1212
Grammar
Dictionary Regex
Join
Algebra
Block
JohnSmith
John Smith
© 2008 IBM Corporation23 June 12, 2008
Overlapping Annotations Example: Band Review Annotator
(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person
Person 0-5 tokens Instrument PersonPlaysInstrument
JohnPerson
playsToken
theToken
PipePerson
guitarInstrument
John PipePerson
playsToken
theToken
guitarInstrument
JohnPerson
playsToken
theToken
PipeInstrument
guitarInstrument
Person
Person
Instrument
Person Instrument
John Pipe plays the guitar
© 2008 IBM Corporation24 June 12, 2008
DictionaryRegex
Join
JohnPipe
Overlapping Annotations
Person Person Instrument
John Pipe plays the guitar
Block
JohnPipe
John Pipe
guitarPipe
Consolidate
JohnPipe
John Pipe
JohnguitarPipe
guitarguitar
John Pipe guitar
Explicitly remove overlap with
Consolidate operator
Explicitly remove overlap with
Consolidate operator
Retain overlapping matches by default
Retain overlapping matches by defaultRetain overlapping matches by default
Retain overlapping matches by default
© 2008 IBM Corporation25 June 12, 2008
Complex Structures Example: Signature Annotator
Laura Haas, PhDDistinguished Engineer and Director, Computer
ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
OrganizationPhone
URL
Person Organizati
onPhone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
Start with Person
Within 250 characters
© 2008 IBM Corporation26 June 12, 2008
Complex Structures Example: Signature Annotator
Org Phone URL
Person
Join
Union
Organization Phone
URL
Organization Phone
URL
Person
Block Organization
Phone
URLPerson
SignatureJoin predicates enforce additional
constraints
Find blocks of two or more “contact info”
patterns
© 2008 IBM Corporation27 June 12, 2008
Performance
Performance issues with grammars
– Complete pass through tokens for each rule
– Many of these passes are wasted work
Dominant approach: Make each pass go faster
– Doesn’t solve root problem!
Algebraic approach: Build a query optimizer!
© 2008 IBM Corporation28 June 12, 2008
An Aside: Relational Query Optimization
Central concept in relational databases
– User specifies what she is looking for
– System decides how to find it
– Greatly reduces development and maintenance costs
Basic approach
– Enumerate many equivalent relational algebra expressions
– Estimate the cost of each one
– Choose the fastest
© 2008 IBM Corporation29 June 12, 2008
Optimizations
Query optimization is a familiar topic in databases
What’s different in text?– Operations over sequences and spans– Document boundaries– Costs concentrated in extraction operators (dictionary,
regular expression)
Can leverage these characteristics– Text-specific optimizations– Significant performance improvements
© 2008 IBM Corporation30 June 12, 2008
Example: Restricted Span Evaluation (RSE)
Leverage the sequential nature of text
– Join predicates on character or token distance
Only evaluate the inner on the relevant portions of the document
Limited applicability
– Need to guarantee exact same results
…John Smith at 555-1212…
John Smith555-1212
John Smith at 555-1212
DictionaryRegex
RSEJoin
Only look for dictionary matches in the vicinity of a
phone number.
© 2008 IBM Corporation31 June 12, 2008
Experimental Results (Band Review Annotator)
Annotator Running Time
0
5000
10000
15000
20000
25000
30000
GRAMMAR ALGEBRA (Baseline) ALGEBRA (Optimized)
Ru
nn
ing
Tim
e (s
ec)
Classical query
optimization
Classical query
optimization
Text-specific optimizationsText-specific optimizations
© 2008 IBM Corporation32 June 12, 2008
Road Map
An Algebraic Approachto Information Extraction
System T and the AQL Language
Annotators built with AQL
© 2008 IBM Corporation33 June 12, 2008
System T
Next-generation information extraction system
Makes developing annotators like developing other enterprise software
– AQL rule language• Declarative language for building annotators
– Development environment• Provides support for building complex annotators
– Runtime environment• Deploy to corporate PCs or server farms
© 2008 IBM Corporation34 June 12, 2008
Development EnvironmentDevelopment Environment
Optimizer
Rules(AQL)
ExecutionEngine
SampleDocuments
RuntimeEnvironment
RuntimeEnvironment
InputDocument
Stream
AnnotatedDocument
Stream
Plan(Algebra)
UserInterface
System T Block Diagram
© 2008 IBM Corporation35 June 12, 2008
AQL
Declarative language for defining annotators
–Compiles into our algebra
Main features
–Separates semantics from performance–Familiar syntax–Full expressive power of algebra
© 2008 IBM Corporation36 June 12, 2008
Within a single sentence
<Person> <PhoneNum>
0-30 chars
Contains “phone” or “at”
create view PersonPhone asselect P.name as person, N.number as phonefrom Person P, PhoneNumber N, Sentence Swhere Follows(P.name. N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, N.number));
AQL By Example
© 2008 IBM Corporation37 June 12, 2008
AQL: Status
Compiler and optimizer implemented in 2007
– First generation: Heuristic optimizer– Second generation: Basic cost-based optimizer – Third generation in progress
Transitioning to several IBM products– Used in Lotus Notes 8.01 (GA on March 2008)– Next release of IOPES will be AQL-based (Notes 8.5,
Q4 2008)– Several other products in development
© 2008 IBM Corporation38 June 12, 2008
System T Development Environment
Create and edit AQL annotators
Manage dictionaries and document collections
Test annotators and view results
Downloadable demo!
– (IBM internal only)
© 2008 IBM Corporation39 June 12, 2008
Ongoing Work: Pattern Discovery
The Problem:
– Building dictionaries and other basic building blocks is a major part of the development process• 80% or more of the work
Solution:
– Providing tools to analyze annotations and their context to discover useful low-level patterns
© 2008 IBM Corporation40 June 12, 2008
Example: Building a Phone Number Annotator
(123)4568909
1-800-124-2456
123-890-8990
789.890.8980
345-678-9012
123.345.7890
1-890-890-0890
(408)123-7898
123.456.789.189
10.50-100.00
10.10.2008
[\d()-\.]{7-15}Initial “rough” regular expression
Examples to help improve original
pattern
Run over sample
documents
Cluster results
© 2008 IBM Corporation41 June 12, 2008
Example: Building a Phone Number Annotator
(123)4568909
1-800-124-2456
123-890-8990
789.890.8980
345-678-9012
123.345.7890
1-890-890-0890
(408)123-7898
123.456.789.189
10.50-100.00
10.10.2008
[\d()-\.]{7-15}
Phone #:
Phone #:
Telephone #:
Tel #:
phone number is
cell number is
call me at
call my office at
IP address is
Price range:
Open on
Left Context
Run over sample
documents
Cluster results
Cluster the text to the left (or right) of
the matches
Identify contextual “clues” that can
improve confidence…
…or indicate false positives
© 2008 IBM Corporation42 June 12, 2008
Ongoing Work: Interface for Building Custom Annotators
Problem:
– Customers need to build
– AQL is too powerful
Solution:
– Simpler language with compact syntax
– GUI annotator builder
© 2008 IBM Corporation43 June 12, 2008
Road Map
An Algebraic Approachto Information Extraction
System T and the AQL Language
Annotators built with AQL
© 2008 IBM Corporation44 June 12, 2008
Named Entity Annotators
Developed using System T and AQL
Shipping with Lotus Notes 8.01
Will ship with IOPES, other IBM products
Statistics:
– 8 types of entities
– 327 AQL statements
– Throughput: 800+ kb/sec/core (on my laptop)
© 2008 IBM Corporation45 June 12, 2008
Entities Currently Extracted
Complex entities
– Person– Address– Organization
“Simple” entities
– Phone Number – Email address– URL – Time– Date
© 2008 IBM Corporation46 June 12, 2008
Languages Supported Already supported:
– English– German
Can support with straightforward extensions:– Spanish– French– other Indo-European languages
Extensions needed (ongoing work):– Japanese (with Tokyo Research Lab)– Hebrew (with Haifa?)– Chinese– Korean
© 2008 IBM Corporation47 June 12, 2008
AddressOrganizationPerson
Stage 1Extract basic features
Stage 2Find composite patterns
Stage 3Filter false positivesIdentify lists
High-Level Dataflow Diagram
Stage 4Handle overlap
…
© 2008 IBM Corporation48 June 12, 2008
Quality
Precision Recall
Person >90% 90%
Address >95% 90%
Organization >90% 90%
Phone Number > 95% > 95%
© 2008 IBM Corporation49 June 12, 2008
Performance: Laptop (Intel Core 2 Duo 2.33 GHz)
Just Person and Organization All Named Entities
0
500
1000
1500
2000
2500
1 2
Number of Threads
Th
rou
gh
pu
t (kb
/se
c)
0
500
1000
1500
2000
1 2
Number of Threads
Th
rou
gh
pu
t (kb
/se
c)
© 2008 IBM Corporation50 June 12, 2008
Performance: Server (4×quad-core AMD Opteron)
Just Person and Organization All Named Entities
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 2 3 4 5 6 7 8 91
01
11
21
31
41
51
6
Number of Threads
Th
rou
gh
pu
t (kb
/se
c)
0
1000
2000
3000
4000
5000
6000
7000
1 2 3 4 5 6 7 8 91
01
11
21
31
41
51
6
Number of Threads
Th
rou
gh
pu
t (kb
/se
c)
© 2008 IBM Corporation51 June 12, 2008
Thank you!
For more information…
– Read our ICDE 2008 paper (“An Algebraic Approach to Rule-Based Information Extraction”)
– Try out IOPES• http://www.alphaworks.ibm.com/tech/emailsearch
– Avatar Project home page• http://almaden.ibm.com/cs/projects/avatar/
– Download System T (IBM only)• http://fisher.almaden.ibm.com:8080/systemt
– Contact me• [email protected]
BACKUP
© 2008 IBM Corporation
BACKUP SLIDE
53 June 12, 2008
Road Map
An Algebraic Approachto Information Extraction
System T and the AQL Language
Annotators built with AQL
BACKUP
© 2008 IBM Corporation54 June 12, 2008
BACKUP SLIDE
Extracting Information with Custom Code
“It’s just pattern matching”
– Use scripts and regular expressions
Then reality sets in…
– Dozens of rules, even for simple concepts
– Many special cases
– Convoluted logic
– Painfully slow code
BACKUP
© 2008 IBM Corporation55 June 12, 2008
BACKUP SLIDE
Operators in the Algebra
Currently 44 operators
Categories:
– Relational: Selection, Cross product, Join, Union, …
– Span extraction: Regular expression, Dictionary, Sentence, Part of Speech…
– Span aggregation: Consolidation, Block
– Specialized: Detag HTML
– Input/Output: Document Scan, Annotation Scan, ToHTML, ToAOM, …
BACKUP
© 2008 IBM Corporation56 June 12, 2008
BACKUP SLIDE
Multiple Tokenizations Example
I.B.M.
Extraction Task Ideal Tokenization
Identify company names
Find sentence boundaries I.B.M.
Identify abbreviations I.B.M.
Take me back!
BACKUP
© 2008 IBM Corporation57 June 12, 2008
BACKUP SLIDE
Tokenization on Demand
…J.T. Smith works at I.B.M.…
I.B.M.
Dictionary
Regex
Join
I.B.M.J.T.
Dictionary
Smith
J.T. Smith
Regex
.
.
.
.
.Company
Names
First and Middle InitialsPunctuation
Embedded Tokenizer
Embedded Tokenizer No
Tokenization
No Tokenization
Tokenize Between “J.T.” and “Smith”
Tokenize Between “J.T.” and “Smith”
BACKUP
© 2008 IBM Corporation58 June 12, 2008
BACKUP SLIDE
Overlapping Annotations Example: Band Review Annotator
John Pipe plays the guitar
Person Instrument
Instrument John Pipe plays the guitar Person Token Token Instrument
John Pipe plays the guitarToken Instrument Token Token Instrument
Person
Marco Benevento on the Hammond organ
Instrument
Person
(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person
Person 0-5 tokens Instrument PersonPlaysInstrument
BACKUP
© 2008 IBM Corporation59 June 12, 2008
BACKUP SLIDE
Overlapping Annotations: Existing Solutions
Explicit rule priority
– Higher-priority rules in a level dominate lower-priority ones– Complex interactions between rules– Not enough information available in low-level rules
John Pipe plays the guitar
InstrumentInstrument
John Pipe plays the guitar
Person Instrument
Person
Marco Benevento on the Hammond organ
Person Person
Marco Benevento on the Hammond organ
Instrument
Person dominates Instrument Instrument dominates Person
BACKUP
© 2008 IBM Corporation60 June 12, 2008
BACKUP SLIDE
DictionaryRegex
Join
John PipedocMarco Beneventodoc
Hammonddoc
docdoc
Pipeguitar
doc Hammond organ
ProperNoun Instrument ProperNoun
John Pipe plays the guitar Marco Benevento on the Hammond organ
Instrument
InstrumentProperNoun
John PipedocMarco Beneventodoc
guitarHammond organ
CapitalizedWord Instrument
Person <0-5 tokens> InstrumentOverlapping Annotations
BACKUP
© 2008 IBM Corporation61 June 12, 2008
BACKUP SLIDE
Overlapping Annotations Example: Band Review Annotator
When John Pipe plays the guitar, the crowd
(pipe | guitar | hammond organ |…) Instrument1-2 capitalized words Person
Person 0-5 tokens Instrument PersonPlaysInstrument
Which ones to retain?
CPSL standard
BACKUP
© 2008 IBM Corporation62 June 12, 2008
BACKUP SLIDE
Consolidation
Operator that removes overlap
Several different policies
– Exact match– Longest match– Left-to-right longest– …
Consolidate only when enough information is available
Set of spans
Non-overlapping
subset
ConsolidatePolicyPolicy
BACKUP
© 2008 IBM Corporation63 June 12, 2008
BACKUP SLIDE
Second “John Pipe” Example
When John Pipe plays the guitar, the crowd…
Regex
WhenJohnPipe
Block
JohnPipe
When JohnJohn Pipe
When
Select
JohnPipe
John Pipe
Consolidate
John Pipe
Find Capitalized
Words
Find Capitalized
WordsFilter out
Stop-Words
Filter out Stop-Words
Remove Overlap
Remove Overlap
BACKUP
© 2008 IBM Corporation64 June 12, 2008
BACKUP SLIDE
Complex Structures: Existing Solutions
Approximate using regular expressions
Example: Signature
– Rule: (Person Token{,25} Phone (Token{,25} Contact)+) | (Person (Token{,25} Contact)+ Token{,25} Phone
(Token{,25} Contact)*)– Problems:
• Need to enumerate all possible orders of sub-annotations– What if you want at least one phone and one email?
• Does not restrict total token count
BACKUP
© 2008 IBM Corporation65 June 12, 2008
BACKUP SLIDE
Performance: Existing Solutions
Performance issues
– Complete pass through tokens for each rule
– Many of these passes are wasted work
Dominant approach: Make each pass go faster
– Faster finite state machines
– Batch processing
– Parallel processing
Doesn’t solve root problem!
BACKUP
© 2008 IBM Corporation66 June 12, 2008
BACKUP SLIDE
Types of Operator
Select, project, join…
Extraction operators
– Identify basic pattern matches in text– Several subtypes: Regex, Dictionary, Sentence…
Block
– Group together simpler annotations to produce complex ones
Consolidation
– Decide between overlapping matches
BACKUP
© 2008 IBM Corporation67 June 12, 2008
BACKUP SLIDE
Multiple Tokenizations: Existing Solutions
Use a “lowest common denominator” tokenizer
– Makes rules much more complicated
Use a configurable tokenizer
– Can still need two different tokenizations
– Need to keep tokenization(s) in sync with rules
Use character-based regular expressions
– Rules need to deal with whitespace, punctuation
BACKUP
© 2008 IBM Corporation68 June 12, 2008
BACKUP SLIDE
Shared Dictionary Matching (SDM)
Dictionary matching has 3 steps:
– Tokenize text– Hash each token– Generate matches based on hash table entry
Can share the first two steps among many dictionaries
DictD1 D2
subplan
D1
D2
subplan
Dict SDMDict
SDM Dictionary Operator
BACKUP
© 2008 IBM Corporation69 June 12, 2008
BACKUP SLIDE
Conditional Evaluation (CE)
Leverage document-at-a-time processing
Don’t evaluate the inner operand of a join if the outer has no results
Costing plans is challenging
…John Smith at 555-1212…
John Smith 555-1212
John Smith at 555-1212
Dictionary Regex
CEJoin
Don’t evaluate this Regex when there are no dictionary
matches.
BACKUP
© 2008 IBM Corporation70 June 12, 2008
BACKUP SLIDE
Implementing Restricted Span Evaluation (RSE)
RSE join operator
RSE extraction operator
Pass join bindings down to the inner of a join
Requires special physical operators at edges of plan
s1
R1
p(s1,s2)Dict(D,s2)
RSEDict
s1 bindings1 binding
s2’s that satisfyp(binding, s2)
s2’s that satisfyp(binding, s2) RSE
DictionaryOperator
RSEDictionaryOperator
D
p
BACKUP
© 2008 IBM Corporation71 June 12, 2008
BACKUP SLIDE
RSE Dictionary Operator
RSE version of an operator must produce the exact same answer
– Ongoing work: RSE Regular Expression operator
RSE version of an operator must produce the exact same answer
– Ongoing work: RSE Regular Expression operator
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venenatis.
To find dictionary matches that end in this range…
…need to examine this range.
Length of longest dictionary entry
BACKUP
© 2008 IBM Corporation72 June 12, 2008
BACKUP SLIDE
Separating Performance from Semantics
AQL Language
Optimizer
OperatorRuntime
Specify annotator semantics declaratively
Specify annotator semantics declaratively
Choose an efficient execution plan that implements semantics
Choose an efficient execution plan that implements semantics
BACKUP
© 2008 IBM Corporation73 June 12, 2008
BACKUP SLIDEHistorical Perspective: Information Extraction
MUC (Message Understanding Conference) – 1987 to 1997
– Competition-style conferences organized by DARPA
– Shared data sets and performance metrics• News articles, Radio transcripts, Military telegraphic messages
Classical IE Tasks
– Entity and Relationship/Link extraction
– Entity resolution/matching
– Event detection (Identify a complex event such as a merger or meeting involving multiple
entities)
Several IE systems were built by this community
– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]
BACKUP
© 2008 IBM Corporation74 June 12, 2008
BACKUP SLIDE
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Common Pattern Specification Language (CPSL)
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Token[~ “John | Smith| …”]+ NameToken[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] PhoneToken[~ “[1-9]\d{2}-\d{4}”] Phone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0
Level 2
Level 1
CPSL
– A standard language for specifying cascading grammars
– Created in 1998
CPSL
– A standard language for specifying cascading grammars
– Created in 1998
BACKUP
© 2008 IBM Corporation75 June 12, 2008
BACKUP SLIDE
Dictionary
RegularExpression
Join
Other0%
100%
Naïve Plan Optimized
Execution Time Breakdown
BACKUP
© 2008 IBM Corporation76 June 12, 2008
BACKUP SLIDE
AQL Syntax
select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);
select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match
BACKUP
© 2008 IBM Corporation
BACKUP SLIDE
77 June 12, 2008
Road Map
An Algebraic Approachto Information Extraction
System T and the AQL Language
Annotators built with AQL
BACKUP
© 2008 IBM Corporation78 June 12, 2008
BACKUP SLIDE
Annotation Development Cycle
DevelopIdentify
Problems
Test
Define
Deploy
BACKUP
© 2008 IBM Corporation79 June 12, 2008
BACKUP SLIDE
Deploy
Annotation Development Cycle
DevelopIdentify
Problems
Test
Define
RuntimeEnvironment
RuntimeEnvironment
AnnotatorDevelopmentEnvironment
Ease of development and maintenance
BACKUP
© 2008 IBM Corporation80 June 12, 2008
BACKUP SLIDE
RepresentativeDocuments
RuntimeEnvironment
RuntimeEnvironment
DevelopmentEnvironment
DevelopmentEnvironment
OptimizerOptimizerRules(AQL)
InputDocument
Stream
AnnotatedDocument
Stream
Plan(Algebra)
System T Block Diagram
BACKUP
© 2008 IBM Corporation81 June 12, 2008
BACKUP SLIDEUIMA Where does UIMA fit in all of this?
– UIMA is a software framework for NLP• Allows complex annotators to be composed as a pipeline of smaller building blocks
– What UIMA is not….• Does not specify how an annotator performs its extraction task• Does not provide a rule language nor a rule-matching engine
– Orthogonal to the focus of this talk
However
– The AQL runtime can be embedded inside a UIMA annotator.
UIMAAnnotator A
UIMA Annotator B
OptimizerRules(AQL)
Plan(Algebra)
AQL RuntimeAQL Runtime
UIMA Annotator A
Java CodeJava Code UIMAAnnotator A
UIMA Annotator C
Java CodeJava Code
BACKUP
© 2008 IBM Corporation82 June 12, 2008
BACKUP SLIDE
Within a single sentence
<Person> <PhoneNum>
0-30 chars
Contains “phone” or “at”
create view PersonPhone asselect P.name as person, N.number as phonefrom Person P, PhoneNumber N, Sentence Swhere Follows(P.name. N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, N.number));
AQL
Set<Pair<Span>> extractPersonPhoneCandidate(String text) { Set<Span> Person = extractPersons(text); Set<Span> PhoneNum = extractPhoneNumber(text); Set<Span> Sentence = extractSentence(text);
Set<Pair<Span>> PersonPhoneCandidate = new HashSet<Pair<Span>>();
for (Span P : Person) { for (Span N : PhoneNum) { if (Follows(P,N,0,30)) then { String textBetween = text.substring(P.end, N.begin); Pattern R = Pattern.compile(“\\b(phone|at)\\b“); if (matches(R, textBetween) { PersonPhoneCandidate.add(new Pair<Span>(P,N)); } } } } Set<Pair<Span>> PersonPhone = new HashSet<Pair<Span>>(); for (Pair<Span> C : PersonPhoneCandiate) { for (Span S : Sentence) { if(S.contains(C)) { PersonPhone.add(C); } } }
return C;}
boolean Follows(Span first, Span second, int min, int max) { int firstEnd = first.end; int secondBegin = second.begin; int distance = (secondBegin – firstEnd);
if ((distance >= min) && (distance <= max)) { return true; } else { return false; }}
Custom Code
The AQL Rule Language
Development costs
Maintenance costs
Performance
Correctness
BACKUP
© 2008 IBM Corporation83 June 12, 2008
BACKUP SLIDE
Example: Building a Phone Number Annotator
(123)4568909
1-800-124-2456
123-890-8990
789.890.8980
345-678-9012
123.345.7890
1-890-890-0890
(408)123-7898
123.456.789.189
10.50-100.00
10.10.2008
[\d()-\.]{7-15}
Ext 12345
x1235
ext-1230
.
.
.
.
\n
or
$
10:00am
Right Context
Run over sample
documents
Cluster results
Additional patterns for Phone Number: Extension Numbers
BACKUP
© 2008 IBM Corporation
BACKUP SLIDE
84 June 12, 2008
Road Map
An Algebraic Approachto Information Extraction
System T and the AQL Language
Annotators built with AQL
BACKUP
© 2008 IBM Corporation85 June 12, 2008
BACKUP SLIDE
Person Annotator
Names appear in widely varying contexts– Mr. Dabrowski received a Bachelor degree…– Dr. Jean L. Rouleau Dean of Medicine University…– …met Peter and Katie Lawton who have…– …lives in Riverdale, NY, with his wife Marie-Jeanne. He has two married
sons, James and Michael. – The Honorable Carol Boyd Hallett - Of Counsel…– Kimberly Purdy Lloyd received a Bachelor of Science degree from the
University of Texas… Additional Challenges
– Avoiding person names inside/overlap with other entities• Organization, Address
– List of person names• Attendees Ida White, Bridget McBean, Volker Hauck
Currently supports names from > 8 countries, including Israel Currently supports names from > 8 countries, including Israel
BACKUP
© 2008 IBM Corporation86 June 12, 2008
BACKUP SLIDE
Person Annotator Outline
Stage 1: Identify individual features• <FirstName>, <LastName>, <Salutation>, <CapsPerson>, <Initial> …
– Dictionaries, Regular expressions
Stage 2 : Identify candidate persons based on strong patterns• <FirstName>(<CapsPerson>|<Initial>)?<LastName>• <Salutation>(<CapsPerson>|<Initial>)?<CapsPerson>• <LastName>, <FirstName> • …
– Joins, Selection predicates, Block
Stage 3 : Eliminate weaker matches, handle lists • Delete annotations generated by lower priority rules
– Consolidation, Minus, Selection predicates
Stage 4 : Remove matches within other entities
– Consolidation, Minus, Selection predicates
BACKUP
© 2008 IBM Corporation87 June 12, 2008
BACKUP SLIDE
Address Annotator
USAddress has well-defined pattern– <StreetAddress> <SecondaryUnit>? <City> <State> <Zipcode>?– 1515 Pioneer Drive Harrison, AR 72601– 3607 Church Street, Suite 300 · Cincinnati, Ohio 45244– 101 S. Webster Street . PO Box 7921 . Madison, Wisconsin 53707-
7921 Challenges
– Multiple parts to the Address– Some parts are optional (e.g., Secondary Unit, Zipcode)– <City> cannot be identified using Dictionary due to resource restrictions– Handling ambiguous abbreviations
• Ms MA In state names• Dr. Row Street suffixes
Currently supports U.S. and German addresses Currently supports U.S. and German addresses
BACKUP
© 2008 IBM Corporation88 June 12, 2008
BACKUP SLIDE
Address Annotator Outline
Stage 1 : Primary features identified• <StreetAddress>, <Secondary Unit>, <State>, <Zipcode>
– Regular Expressions, Dictionaries, Joins Stage 2 : Complete StreetAddress identified
• <StreetAddress> <SecondaryUnit>?– Join, Union
Stage 3 : StreetAddress combined with State information• <StreetAddress><SecondaryUnit>?<City><State>
– Join, Union, Selection Predicates Stage 4 : Combining with Zipcode
• <StreetAddress><SecondaryUnit>?<City><State><Zipcode>?– Join, Union, Selection Predicates
BACKUP
© 2008 IBM Corporation89 June 12, 2008
BACKUP SLIDE
Organization Annotator
Organization names appear in wide range of
– is a graduate of Hofstra University– He joined Interactive Data in 2003– President of Foley & Lardnear LLP– Received her B.S in English from University of Wisconsin– The bill at the Savoy Hotel
Additional Challenges
– Long organization names (Q: where is the begin & end?)• The Chartered Institute of Public Finance and Accountancy
– May contain list of person names• Squar, Milner, Peterson, Miranda & Williamson, LLP• John Ortiz, James & James Ltd
– Adjacent organization names• University of Michigan Ross School of Business
– Multiple representation for the same organization & its subdivisions• Enron, Enron Corp., Enron Corporation, Enron Metals & Commodity Corp.
BACKUP
© 2008 IBM Corporation90 June 12, 2008
BACKUP SLIDE
Organization Annotator Outline
Stage 1: Identify individual features• CommonOrganization, Suffix, Prefix, IndustryType, CapsOrg, PrepOrg
– Dictionaries, Regular expressions
Stage 2 : Identify candidate organization based on strong patterns• (<The><CapsOrg>{1,3}<Conj>)?<CapsOrg>{1,3}<Suffix>|<IndustryType>• <CapsOrg><1,3><Prefix><PrepOrg><CapsOrg>{1,2}(<Conj><CapsOrg>{1,2})?• <CommonOrganization>(<CapsOrg>{1,3}(<Suffix>|<IndustryType>))?• …
– Joins, Selection predicates, Block
Stage 3 : Eliminate weaker matches, handle lists • Delete annotations generated by lower priority rules
– Consolidation, Minus, Selection predicates
Stage 4 : Remove matches within other entities
– Consolidation, Minus, Selection predicates