2007. 11. 14. Introduction Information Extraction (IE) A limited form of “complete text comprehension” Document 로부터 entity, relationship 을 추출

2007. 11. 14

Introduction

Information Extraction (IE) A limited form of “complete text

comprehension” Document 로부터 entity, relationship 을

추출 Relationship => fact, event

Fact: static Event: dynamic

Document => Entity-relationship or frame………

Structured object

Schematic view of IE

Information Extraction

Simple IE system Term extraction

Complex IE system Frame generation

Data Elements of IE

Entities Basic building blocks Ex) people, locations, genes, and drugs

Attributes Features of extracted entities Ex) an employment relationship between a person

and a company or phosphorylation between two proteins

Event An activity of occurrence of interest in which entities

participate such as a terrorist at, a merger between two companies, a birthday and so on

Data Elements of IE

MUC IE Tasks

MUC Message Understanding Conference Sponsored by DARPA (Defense Advanced

Research Project Agency)

MUC tasks Named Entity Recognition Template Element Task Template Relationship (TR) Task Scenario Temple (ST) Coreference Task(CO)

Named Entity Recognition

NER Identity all mentions of proper names and quantities

in the text People names, geographic locations, and organizations Dates and times Monetary amounts and percentages

Test with MUC corpora Proper names: 70%

Organization: 45~50% Location: 12~32% People: 23~39%

Dates and times: 25% Monetary amounts and percentages: 5%

Template Element Task

TE A generic object

and its attributes Person Organization Location (airport, city,

country, province, region, water, and etc)

Artifact

Template Relationship (TR) Task TR

Find the relationship that exist between the template elements extracted from text Ex) persons and companies can be related by

employee of relation

Employee_of (Fletcher Maddox, UCSD Business School)

Employee_of (Fletcher Maddox, La Jolla Genomatics)

Product_of (Geninfo, La Jolla Genomatics)

Location_of (La Jolla, La Jolla Genomatics)

Location_of (CA, La Jolla Genomactics)

Scenario Template

ST:

express “domain” and task-specific entities and relations

Coreference Task (CO)

CO:

captures information on coreferring expression (eg. Pronouns or any other mentions of a given entity

Ex David came home from

school, and saw his mother, Rachel. She told him that his father will be late.

Identified pronominal coreference (David, his, him, his) (mother, Rachel, she)

IE Examples

Architecture of IE Systems

Architecture of IE Systems Tokenization module

Splits an input document into its basic building blocks Words, sentences, and paragraphs

Morphological and lexical analysis Assign POS tags to the document various words, creating

basic phrases (like noun phrases and verb phrases), and disambiguating the sense of ambiguous words and phrases

Syntactic analysis Establish the connection between the difference parts of

each sentence by doing full parsing or shallow parsing

Domain analysis Combine all the information collected from the previous

components and creates complete frames that describe relationship between entities

Can include ‘anaphora resolution’

Information Flow in IE System Processing initial lexical content:

Tokenization and Lexical Analysis Proper name identification Shallow parsing Building relations Inferencing

Information Flow in IE System Building relations

Using domain-specific pattern Ex)

Company [Temporal] @ Announce Connector Person PersonDetail @Appoint Position

Inferencing Infer missing values to complete the identification values Ex)

John Edgar was reported to live with Nancy Leroy. His Address is 101 Forest Rd., Bethlethem, PA.

Person(John Edgar) Person(Nancy Leroy) Livetogether(John Edgar, Nancy Leroy) Address(John Edgar, 101 Forest Rd., Bethlethem, PA) Address(P2,A) :- person(P1), person(P2), livetogether(P1, P2), address(P1,A)

Anaphora Resolution

Anaphora (Coreference) resolution Process of matching pairs of NLP

expressions that refer to the same entity in the real world

Two main approaches Knowledge-based approach

Linguistic analysis of sentences Machine learning-based approach

Need Annotated corpus

Anaphora Resolution

Pronominal anaphora Reflexive/personal/possessive pronouns

Proper name coreference Apposition Predicative nominative Identical sets Function-value coreference Ordinal anaphora One-anaphora Part-whole coreference

Approaches to Anaphora Resolution

Focus on pronominal resolution

Hobbs Algorithm Also called ‘Naïve Algorithm’ Constraints

For two candidate antecedents a and b, if a is encountered before b in the search space, then a is preferred over b.

No two antecedents will have the same salience.

Approaches to Anaphora Resolution CogNIAC

Ordered Six rules

Kennedy and Boguraev Salience algorithm

Mitkov Scoring algorithm

Definiteness Giveness Indicating verbs Lexical reiteration Section Heading preference “non-prepositional” noun phrases Collocation pattern preference Immediate reference Referential distance Domain terminology preference

Approaches to Anaphora Resolution

Machine Learning Approaches Markables

NLP elements such as nouns, nouns phrases, or pronouns

Features for Markables Sentence distance Pronouns Exact match Definite noun phrase Number agreement Semantic agreement Gender agreement Proper name alias

Machine Learning Approaches Generating Training Examples

Positive examples {M1, M2, M3, M4} : same real-world entity

Positive examples: {M1, M2}, {M2, M3}, {M3, M4}

Negative examples Assume that markables a, b, c appear

between M1 and M2 Negative examples: {a, M2}, {b, M2}, {c,

M3}

Machine Learning Approaches

Machine Learning Approaches WHISK

Supervised learning algorithm that uses hand-tagged examples for learning information extraction rules using regular expression

Ex) Input:: * (Digit) ‘BR’ * ‘$’ (number) Output:: Rental {Bedrooms $1} {Price $2}

Machine Learning Approaches: BWI (Boosted Wrapper Induction)

Machine Learning Approaches: BWI (Boosted Wrapper Induction) “Boundary Detectors” are pairs of token

sequences <p,s> Detector matches a boundary iff p matches

text before boundary and s matches text after boundary

Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc.

Example: <Date:,[CapitalizedWord]> matches

beginning of

Date: Thursday, October 25

Machine Learning Approaches: (LP)2 Algorithm Inducing two set of

rules Tagging rules

Ex) stime (start time of a seminar)

Correction rules Ex) “at <stime> 4

</stime> pm => “at <stime> 4 pm

</stime>

Evaluation of IE systems

slot BWI HMM (LP)2 WHISK

Speaker 67.7% 76.6% 77.6% 18.3%

Location 76.7% 78.6% 75.0% 66.4%

Start Time 99.6% 98.5% 99.0% 92.6%

End Time 93.9% 62.1% 95.5% 86%

Structural IE

Introduction Considering structural or visual

characteristics of the text E.g) font type, size, location

A complement of conventional IE (text mining)

Called ‘Visual Information Extraction (VIE)’

Structural IE

VIE procedure Group the primitive elements into

meaningful objects (e.g., lines, paragraph, etc)

Establish the hierarchical structure among these objects

Compare the structure of the query document with the structure of the training document to find the objects corresponding to the target fields

Object Tree

Object Tree Generation

X

Y

Fit (Y, X) : A measure of how fit Y is as an additional member to X

paragraph

line

Computing Similarity in O-tree

Finding the target fields

Templates

Browsing

Topic distribution Browsing

USA, UK => acq 42/19.09%

Browsing and filtering associations

Browsing associations

Taxonomy (Topic Hierarchy) Management

Taxonomy Editor

Clustering Display using Concept Hierarchy

Query Contruction

2007. 11. 14. Introduction Information Extraction (IE) A limited form of “complete text comprehension” Document 로부터 entity, relationship 을 추출

Documents