2007. 11. 14
2007. 11. 14
Introduction
Information Extraction (IE) A limited form of “complete text
comprehension” Document 로부터 entity, relationship 을
추출 Relationship => fact, event
Fact: static Event: dynamic
Document => Entity-relationship or frame………
Structured object
Schematic view of IE
Information Extraction
Simple IE system Term extraction
Complex IE system Frame generation
Data Elements of IE
Entities Basic building blocks Ex) people, locations, genes, and drugs
Attributes Features of extracted entities Ex) an employment relationship between a person
and a company or phosphorylation between two proteins
Event An activity of occurrence of interest in which entities
participate such as a terrorist at, a merger between two companies, a birthday and so on
Data Elements of IE
MUC IE Tasks
MUC Message Understanding Conference Sponsored by DARPA (Defense Advanced
Research Project Agency)
MUC tasks Named Entity Recognition Template Element Task Template Relationship (TR) Task Scenario Temple (ST) Coreference Task(CO)
Named Entity Recognition
NER Identity all mentions of proper names and quantities
in the text People names, geographic locations, and organizations Dates and times Monetary amounts and percentages
Test with MUC corpora Proper names: 70%
Organization: 45~50% Location: 12~32% People: 23~39%
Dates and times: 25% Monetary amounts and percentages: 5%
Template Element Task
TE A generic object
and its attributes Person Organization Location (airport, city,
country, province, region, water, and etc)
Artifact
Template Relationship (TR) Task TR
Find the relationship that exist between the template elements extracted from text Ex) persons and companies can be related by
employee of relation
Employee_of (Fletcher Maddox, UCSD Business School)
Employee_of (Fletcher Maddox, La Jolla Genomatics)
Product_of (Geninfo, La Jolla Genomatics)
Location_of (La Jolla, La Jolla Genomatics)
Location_of (CA, La Jolla Genomactics)
Scenario Template
ST:
express “domain” and task-specific entities and relations
Coreference Task (CO)
CO:
captures information on coreferring expression (eg. Pronouns or any other mentions of a given entity
Ex David came home from
school, and saw his mother, Rachel. She told him that his father will be late.
Identified pronominal coreference (David, his, him, his) (mother, Rachel, she)
IE Examples
Architecture of IE Systems
Architecture of IE Systems Tokenization module
Splits an input document into its basic building blocks Words, sentences, and paragraphs
Morphological and lexical analysis Assign POS tags to the document various words, creating
basic phrases (like noun phrases and verb phrases), and disambiguating the sense of ambiguous words and phrases
Syntactic analysis Establish the connection between the difference parts of
each sentence by doing full parsing or shallow parsing
Domain analysis Combine all the information collected from the previous
components and creates complete frames that describe relationship between entities
Can include ‘anaphora resolution’
Information Flow in IE System Processing initial lexical content:
Tokenization and Lexical Analysis Proper name identification Shallow parsing Building relations Inferencing
Information Flow in IE System Building relations
Using domain-specific pattern Ex)
Company [Temporal] @ Announce Connector Person PersonDetail @Appoint Position
Inferencing Infer missing values to complete the identification values Ex)
John Edgar was reported to live with Nancy Leroy. His Address is 101 Forest Rd., Bethlethem, PA.
Person(John Edgar) Person(Nancy Leroy) Livetogether(John Edgar, Nancy Leroy) Address(John Edgar, 101 Forest Rd., Bethlethem, PA) Address(P2,A) :- person(P1), person(P2), livetogether(P1, P2), address(P1,A)
Anaphora Resolution
Anaphora (Coreference) resolution Process of matching pairs of NLP
expressions that refer to the same entity in the real world
Two main approaches Knowledge-based approach
Linguistic analysis of sentences Machine learning-based approach
Need Annotated corpus
Anaphora Resolution
Pronominal anaphora Reflexive/personal/possessive pronouns
Proper name coreference Apposition Predicative nominative Identical sets Function-value coreference Ordinal anaphora One-anaphora Part-whole coreference
Approaches to Anaphora Resolution
Focus on pronominal resolution
Hobbs Algorithm Also called ‘Naïve Algorithm’ Constraints
For two candidate antecedents a and b, if a is encountered before b in the search space, then a is preferred over b.
No two antecedents will have the same salience.
Approaches to Anaphora Resolution CogNIAC
Ordered Six rules
Kennedy and Boguraev Salience algorithm
Mitkov Scoring algorithm
Definiteness Giveness Indicating verbs Lexical reiteration Section Heading preference “non-prepositional” noun phrases Collocation pattern preference Immediate reference Referential distance Domain terminology preference
Approaches to Anaphora Resolution
Machine Learning Approaches Markables
NLP elements such as nouns, nouns phrases, or pronouns
Features for Markables Sentence distance Pronouns Exact match Definite noun phrase Number agreement Semantic agreement Gender agreement Proper name alias
Machine Learning Approaches Generating Training Examples
Positive examples {M1, M2, M3, M4} : same real-world entity
Positive examples: {M1, M2}, {M2, M3}, {M3, M4}
Negative examples Assume that markables a, b, c appear
between M1 and M2 Negative examples: {a, M2}, {b, M2}, {c,
M3}
Machine Learning Approaches
Machine Learning Approaches WHISK
Supervised learning algorithm that uses hand-tagged examples for learning information extraction rules using regular expression
Ex) Input:: * (Digit) ‘BR’ * ‘$’ (number) Output:: Rental {Bedrooms $1} {Price $2}
Machine Learning Approaches: BWI (Boosted Wrapper Induction)
Machine Learning Approaches: BWI (Boosted Wrapper Induction) “Boundary Detectors” are pairs of token
sequences <p,s> Detector matches a boundary iff p matches
text before boundary and s matches text after boundary
Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc.
Example: <Date:,[CapitalizedWord]> matches
beginning of
Date: Thursday, October 25
Machine Learning Approaches: (LP)2 Algorithm Inducing two set of
rules Tagging rules
Ex) stime (start time of a seminar)
Correction rules Ex) “at <stime> 4
</stime> pm => “at <stime> 4 pm
</stime>
Evaluation of IE systems
slot BWI HMM (LP)2 WHISK
Speaker 67.7% 76.6% 77.6% 18.3%
Location 76.7% 78.6% 75.0% 66.4%
Start Time 99.6% 98.5% 99.0% 92.6%
End Time 93.9% 62.1% 95.5% 86%
Structural IE
Introduction Considering structural or visual
characteristics of the text E.g) font type, size, location
A complement of conventional IE (text mining)
Called ‘Visual Information Extraction (VIE)’
Structural IE
VIE procedure Group the primitive elements into
meaningful objects (e.g., lines, paragraph, etc)
Establish the hierarchical structure among these objects
Compare the structure of the query document with the structure of the training document to find the objects corresponding to the target fields
Object Tree
Object Tree Generation
X
Y
Fit (Y, X) : A measure of how fit Y is as an additional member to X
paragraph
line
Computing Similarity in O-tree
Finding the target fields
Templates
Browsing
Topic distribution Browsing
USA, UK => acq 42/19.09%
Browsing and filtering associations
Browsing associations
Taxonomy (Topic Hierarchy) Management
Taxonomy Editor
Clustering Display using Concept Hierarchy
Query Contruction