Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

1

Adaptive TextExtraction and Mining

Nicholas KushmerickDepartment of Computer Science

University College Dublin

[email protected]/staff/nick

Fabio CiravegnaDepartment of Computer Science

University of Sheffield

[email protected]/~fabio

ECML-2003 Tutorial

Ciravegna & Kushmerick: ECML-2003 Tutorial2

What is IE

What can we extract from the Weband why?

n Introduction: (20 minutes)

n what is IE

n What can we extract from the Web

n Why?

n Algorithms and methodologies (100 min)

n IE in practice (30 min)

n Conclusion, Future Work (10 min)

n Discussion


The ‘canonical’ IE task

n Input:n Document

n newspaper article, Web page, email message, …

n Pre-defined “information need” n frame slots, template fillers, database tuples, …

n Outputn The specific substrings/fragments of the document or labels that

satisfy the stated information need, possibly organised in a template

• DARPA’s ‘Message Understanding Conferences/Competitions’ since late1980’s; most recent: MUC-7, 1998.

• Recent interest in the machine learning and Web communities.Ciravegna & Kushmerick: ECML-2003 Tutorial4

IE Standard Tasks

n Preprocessingn Tokenizationn Morphological Analysisn Part of Speech Tagging

n Information Identificationn Named Entity Recognitionn Template Filling (from the MUC)

n Template Elementsn Template Relationsn Scenario Template


Moody’s

C$115 millionMoody's Investors Service Inc

Organisation

MNY

%

19:16 rates Province of Saskatchewan A3

said it assigned an A3 rating to the Province of Saskatchewan's bond offering that was priced today.The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp.

NE Recognition & Coreference

Date

& Coreference


19:16 Moody's rates Province of Saskatchewan A3

Moody's Investors Service Inc said it assigned an A3 rating to the Province of Saskatchewan's C$115 million bond offering that was priced today.The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp.

Template Filling

amount C$115 million

issuer Province of Saskatchewan

placement-date today

maturity February 4, 2022

rate 9.6 percent

2

Intranet

The Big Picture

queryprocessor

database

Web

IE

ontology


NYU Architecture: a MUC architecture

LexicalAnalysis

CoreferenceAnalysis

Name Recognition

PartialParsing

Scenario Pattern Matching

Local Text Analysis

Inference

Discourse Analysis

TemplateGeneration


Semantic Web

n A brain for Human Kindn From Information-based to Knowledge-Based n Processable Knowledge means:

n Better Retrievaln Reasoning

n Where can IE contribute?


Building the SW

n Document annotationn Manually associate documents (or parts) to

ontological descriptionsn Document classification for retrieval

n Where can I buy an Hamster?n Pet shop web page -> pet shop concept -> hamster

n Knowledge annotationn Where can I find a hotel in Berlin where single rooms cost

less than 400€?n The Hotel is located in central Berlin and the cost for a

single room is 300€

n Editors are currently available for manual annotation of texts


IE for Annotating Documents

n Manual annotation is n Expensiven Error prone

n IE can be used for annotating documentsn Automaticallyn Semi-Automatically

n As user support

n Advantagesn Speedn Low costn Consistencyn Can provide automatic annotation different from the one

provided by the author(!)


SW for Knowledge Management

n SW is important for everyday Internet usersn SW is necessary for large companies

n Millions of documents where knowledge is interspersed

n Most documents are now n web-basedn Available over an Intranet

n Companies are valued for theirn Tangible assets (e.g. plants)n Intangible assets (e.g. knowledge)

n Knowledge is stored in n mind of employeesn Documentation

n Companies spend 7-10% of revenues for KM

3


Why Adaptive Systems?

n Writing IE systems by hand is difficult and error pronen Extraction languages can be quite complexn Tedious write-test-debug-rewrite cycle

n Adaptive systems learn from user annotationsn the person tells the learning algorithm what to extract:

The learner figures out how

n Advantagesn Annotating text is simpler & faster than writing rules.n Domain independentn Domain experts don’t need to be linguists or programers.n Learning algorithms ensure full coverage of examples.


Algorithms and Methodologies



n Wrapper induction

n Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints



n Discussion

A dip into the details of IE for the Web


Algorithms: Outline

n Wrappersn Hand-coded wrappersn Wrapper inductionn Learning highly expressive wrappersn Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata


Wrapper induction

Highly regularsource documents

⇓Relatively simple

extraction patterns

⇓Efficient

learning algorithms


⟨ (Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34) ⟩

Wrappers: Example and Toolkits

nWrapper toolkits: Specialized programming environmentsfor writing & debugging wrappers by hand

nExamples

nWorld Wide Web Wrapper Factory[ db.cis.upenn.edu/W4F ]

nJava Extraction & Dissemination of Information[ www.darmstadt.gmd.de/oasys/projects/jedi ]


Use , , , for extraction

<HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

+

Wrappers: Delimiter- based extraction

4


procedure ExtractCountryCodeswhile there are more occurrences of 

1. extract Country between and 2. extract Code between and 

procedure ExtractAttributes:while there are more occurrence of l1

1. extract 1st attribute between l1 and r1

. . .K. extract Kth attribute between lK and rK

left delimiters right delimiters

“Left- Right” wrappers

[Kushmerick et al, IJCAI-97; Kushmerick AIJ-2000]

Left-Right wrapper ≡ 2K strings⟨l1 , r1 , …, lK , rK⟩


Thai food is spicy.Vietnamese food is spicy.German food isn’t spicy.

Ù Asian foodis spicy.

<HTML><HEAD>Some Country Codes</HEAD><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML>




wrapperÙ

examples Ù hypothesis

Wrapper induction


⟨l1, r1, …, lK, rK⟩

Example: Find 4 strings⟨, , , ⟩⟨ l1 , r1 , l2 , r2 ⟩

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>



Learning LR wrappers


LR: Finding r1


r1 can be any prefix

eg 


LR: Finding l1, l2 and r2


r2 can be any prefix

eg 

l2 can be any suffixeg 

l1 can be any suffixeg 


Distracting text in head and tail

<HTML><TITLE>Some Country Codes</TITLE><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML>

A problem with LR wrappers

5


Ignore page’s head and tail

<HTML><TITLE>Some Country Codes</TITLE><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML>

head

body

tail

}

}}

start of tail

end of head

+Head-Left-Right-Tail wrappers

One (of many) solutions: HLRT


Expressiveness

all sites

Siteswrappable

by LR

aiteswrappableby HLRT

Theorem:


Coverage

Fraction of randomly-selected “data-heavy” Web sites(search engines, retail, weather, news, finance, …)for which wrapper in a given class was learned.


Sample complexity

n The key problem with machine learning: Training data is expensive and tedious to generate

n In practice, active learning and specialized algorithms have reduced training requirements considerably

n But this isn’t theoretically satisfyingn Computational Learning Theory:

n Time complexity: Time require by an algorithm to terminate, as a function of problem parameters

n Sample complexity: Training data required by a learning algorithm to converge to correct hypothesis, as a function of problem parameters


A Model of Sample ComplexityP[correct wrapper] = f( size of documents,

number of documents,number of attributes per record )

Analyze wrapper learningtask to derive this function

(Actually, we can compute only a bound on this probability.)

Just like time/space complexity:

time[learn wrapper] = g( size of documents,number of documents,number of attributes per record )


PAC results - LR wrappers

Theorem: Suppose we learn LR wrapper W from training set E, where the longest document has length R and each record contains K attributes. If

then W is probably approximately correct

error(W) < ε

with probability at least 1-δ

|E|

6


More sophisticated wrappers

n LR & HLRT wrappers are extremely simple(though useful for ~ 2/3 of real Web sites!)

n Recent wrapper induction research has explored… n more expressive wrapper classes

[Muslea et al, Agents-98; Hsu et al, JIS-98; Thomas et al, JIS-00, …]

n Disjunctive delimitersn Sequential/landmark-based delimitersn Multiple attribute orderingsn Missing attributesn Multiple-valued attributesn Hierarchically nested data

n Wrapper verification/maintenance[Kushmerick, AAAI-1999; Kushmerick WWWJ-00;Cohen, AAAI-1999; Minton et al, AAAI-00]


One of my favorites

n Roadrunner[Valter Crescenzi et al; Univ Roma 3]

n Unsupervised wrapper inductionn They research databases, not machine learning, so

they didn’t realize training data was needed :-)

n Intuition:n Pose two different queriesn The common bits of the documents come from the

template and can be ignoredn The bits that are different are the data that we’re

looking for


Roadrunner - Example

n Common content = Part of templateVarying content = The data!

n Complications: Dynamic but unwanted content -- egadvertisements or timestamps


Algorithms: Outline

ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersn Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata


Boosted wrapper induction[Freitag & Kushmerick, AAAI-00]

n Wrapper induction is suitable only forrigidly-structured machine-generated HTML…

n … or is it?! n Can we use simple patterns to extract

from natural language documents?… Name: Dr. Jeffrey D. Hermes …… Who: Professor Manfred Paul …

... will be given by Dr. R. J. Pangborn …… Ms. Scott will be speaking …

… Karen Shriver, Dept. of ...… Maria Klawe, University of ...


BWI: The basic idea

n Learn “wrapper-like” patterns for natural textspattern = exact token sequence

n Learn many such “weak” patternsn Combine with boosting to build “strong”

ensemble patternn Of course, not all natural text is sufficiently

regular!n Demo: www.smi.ucd.ie/bwi

7


Covering Algorithms

n Generalization of Covering Algorithm for learning disjunctive rules

+++

++

++

--

- --

---

--

--

--


Covering Algorithms


+++

++

++

--

- --

---

--

--

--

Learned Rule = rule


Covering Algorithms


++

++

--

- --

---

--

--

--

Learned Rule = rule or rule


Covering Algorithms


++

--

- --

---

--

--

--

Learned Rule = rule or rule or rule


Boosting = Generalized Covering

n When learn rules on iteration t, give less weight to (but don’t entirely discard) training examples successfully handled in iterations 1, 2, …, t-1

n Equivalently: Give more weight to training data that has not yetbeen covered

++-

-

- --

---

--

--

--

+

+

+

++


D1(i) = uniform distribution overtraining examples

for t = 1, ..., Ttrain: use distribution Dt to

learn weak hypothesis:ht: X⇒R

reweight: choose a t, and modifydistribution Dt to emphasizeexamples missed by ht:Dt+1(i)= Dt(i) exp(-a tyiht(xi))

return:H(x) = sign(S ta tht(x))

h1

h2

h3...

D0

D1

D2

Boosting [Schapire & Singer, 1998]

8


Boundary Detector: [who :][dr . <Capitalized>]prefix suffix

matches (e.g.) “… Who: Dr. Richard Nixon …”

Greedy growth from null detectorPick best prefix/suffix extension at each stepStop when no further extension improves accuracy

Weightinga t = ½ ln[(W+ + e) / (W- + e)]

[Cohen & Singer, 1999]

Weak hypotheses: Boundary Detectors

Weak Learning Algorithm


input: labeled documents

Fore = Adaboost fore detectorsAft = Adaboost aft detectorsLengths = length histogram

output: Extractor =<Fore, Aft, Lengths>

Traininginput: Document, Extractor, t

F = {⟨i, ci⟩ | token i matches Forewith confidence ci}

A = {⟨j, cj⟩| token j matches Aftwith confidence cj}

output:{⟨i,j⟩ | ⟨i, ci ⟩∈F, ⟨j, cj ⟩∈A,

ci·cj·L(j-i) > t }

Execution

Boosted Wrapper Induction


<[email protected]>Type: cmu.andrew.official.cmu-newsTopic: Chem. Eng. SeminarDates: 2-May-95Time: 10:45 AMPostedBy: Bruce Gerson on 26-Apr-95 at 09:31 from andrew.cmu.eduAbstract:

The Chemical Engineering Department will offer a seminar entitled"Creating Value in the Chemical Industry," at 10:45 a.m., Tuesday, May 2in Doherty Hall 1112. The seminar will be given by Dr. R. J. (Bob) Pangborn, Director, CentralResearch and Development, The Dow Chemical Company.

[<Alph>][Dr . <Cap>][<Alph> by][<Cap>]

1.30.8

2.1 [<Cap>][( University]0.7 [<Alph>][, Director]

[ ]

2.1 0.7 0.05

0.1

10 119

Confidence of "Dr. R. J. (Bob) Pangborn" = 2.1·0.7·0.05 = 0.074

Fore Aft Length

BWI execution example


[speaker :][<Alph>][speaker <Any>][<FName>]

[<Cap>][<FName> <Any> <Punc> ibm]

Presentation Abstract Joe Cascio, IBMSet Constraints Alex Aiken (IBM, Almaden)

[. <Any>][is <ANum> <Cap>]

John C. Akbari is a Masters student atMichael A. Cusumano is an Associate Professor of

Lawrence C. Stewart is a Consultant Engineer at

Speaker: Reid Simmons, School of …

Samples of learned patterns


Evaluation

• Wrappers are usually 100% accurate, but perfection is generally impossible with natural text

• ML/IE community has a well developed evaluation methodology

• Cross-validation: Repeat many times - randomly select2/3 of the data for training, test on remaining 1/3.

• Precision: fraction of extracted items that are correct• Recall: fraction of actual items extracted• F1 = 2 / (1/P + 1/R)

• 16 IE tasks from 8 document collections

• Competitors: SRV, Rapier, HMM

seminar announcementsjob listingsReuters corporate acquisitionsCS department faculty lists

Zagats restaurant reviewsLA Times restaurant reviewsInternet Address FinderStock quote server


Results: 16 tasks x 4 algorithms

21cases

7cases

9


Boosted Wrapper Induction:Controversial(?) Conclusion

n Is the Great Web -vs- Natural Text Chasm more apparent than real?

n IE is possible if the documents contain regularities that can be exploited

n But the “reason” (eg, linguistic -vs- markup) for these regularities doesn’t much matter

n See also Soderland’s WHISK & Webfoot


Algorithms: Outline

ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersü Boosted wrapper induction§ Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata


Hidden Markov models

nPrevious discussion examine systems that use explicit extraction patterns/rulesnHMMs are a powerful alternative based on statistical token models rather than explicit extraction patterns.

[Leek, UC San Diego, 1997; Bikel et al, ANLP-97, MLJ 99; Freitag & McCallum, AAAI-99 MLIE Workshop; Seymore, McCallum & Rosenfeld, AAAI-99 MLIE Workshop; Freitag & McCallum, AAAI-2000]


HMM formalism

HMM = states s1, s2, …special start state s1special end state sntoken alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)

Widely used in many language processing tasks,e.g. speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].


Applying HMMs to IE

n Document ⇒ generated by a stochastic process modelled by an HMM

n Token ⇒ wordn State ⇒ “reason/explanation” for a given token

n ‘Background’ state emits tokens like ‘the’, ‘said’, …n ‘Money’ state emits tokens like ‘million’, ‘euro’, …n ‘Organization’ state emits tokens like ‘university’, ‘company’,

…n Extraction: The Viterbi algorithm is a dynamic

programming technique for efficiently computing the most likely sequence of states that generated a document.


HMM for research papers [Seymore et al, 99]

10


Learning HMMs

n Good news: n If training data tokens are tagged with their generating states, then

simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. (Use smoothing to avoid zero probsfor emissions/transitions absent in the training data.)

n Great news: n Baum-Welch algorithm trains HMM using unlabelled training data!

n Bad news: n How many states should the HMM contain?n How are transitions constrained?

n Insufficiently expressive ⇒ Unable to model important distinctionsn Overly-expressive ⇒ sparse training data, overfitting


HMM example

“Seminar announcements” task<[email protected]>Type: cmu.andrew.assocs.UEATopic: Re: entreprenuership speakerDates: 17-Apr-95Time: 7:00 PMPostedBy: Colin S Osburn on 15-Apr-95 at 15:11 from CMU.EDUAbstract:

hello againto reiteratethere will be a speaker on the law and startup businessthis monday evening the 17thit will be at 7pm in room 261 of GSIA in the new building, ieupstairs.please attend if you have any interest in starting your own business orare even curious.Colin


HMM example, continued

Fixed topology that captures limited context:4 “prefix” states before & 4 “suffix” after target state

pre1 pre2 pre3 pre4 suf1 suf2 suf3 suf4speaker

background5 most-probable tokens

\n . - : unknown

\nseminar

.roboticsunknown

\n:.-

unknown

\nwho

speaker:.

\n:.

with,

unknown.

drprofessormichael

\nunknown

.department

the

\nof.,

unknown

\nof

unknown.:

\n,

will(-

[Freitag, 99]Ciravegna & Kushmerick: ECML-2003 Tutorial58

Evaluation

21cases

7cases(no learning!)


Learning HMM structure [Seymore et al, 1999]

start with maximally-specific HMM (one state per observed word):

repeat(a) merge adjacent identical states

(b) eliminate redundant fan-out/in

until obtain good tradeoff between HMM accuracy and complexity

note auth auth title ⇒ note auth title

title auth

auth

auth

⇒ title auth

start note

title

auth

note

title

auth

note

auth

title

abst

abst

abst

end

abst

abst

………


Evaluation

hand-crafted HMMsimple HMM

learned HMM

(155 states)

11


Algorithms: Outline

ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersü Boosted wrapper inductionü Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata


Exploiting linguistic constraints

n IE research has its roots in the NLP communityn many extraction tasks require non-trivial linguistic

processing

n Web Documents types can range from free texts to rigid HTML documents (e.g. tables)

n Even a mixture of them!

n Is NLP robust enough tocope with such situations?

Showers in the NW by WednesdaySakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkf fkkf lkjeiikmkjflk lklk Sakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk klsdf lksjdf lkjflskdjf lkjsdf lkjf lskdjf lkjfd lksjf lksjf lksjf lksjf l Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas k,mf sd.,m.jf kdkjsflk fds Dhjjd jfjfjjfkksdl jfjuiekkf fkkf lkjeiikm kjflk lklk m,nfndmsmnsd,mn dsfmn f Sakdjlklksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkf fkkf lkjeiikmkjflk lklk Sakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlkKlkjads lkjds lkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkffkkf lkjeiikm kjflk lklk kjdfkl slksjdf lkjsd lkj dslfkjlkjlksj lkjsdlf kjlfkjlkjflskjflksjf;kl;skf;lskf;slkf;lsdkf ;lk kjijkdsfjkh kjfhsk jfhiwuei ujlkj


Current Approaches

n NLP Approaches (MUC-like Approaches)

n Ineffective on most Web-related texts:n web pages/emailsn stereotypical but ungrammatical texts

n Extra-linguistic structures convey informationn HTML tags, Document formatting, Regular stereotypical

language

n Wrapper induction systemsn Designed for rigidly structured HTML textsn Ineffective on unstructured texts

n Approaches avoid generalization over flat word sequencen Data Sparseness on free texts


Lazy NLP based Algorithm

n Learns the best level of language analysis for a specific IE task mixing deep linguistic and shallow strategies

1. Initial rules: shallow wrapper-like rules2. Linguistic Information (LI) progressively added to rules3. Addition stopped when LI becomes

n unreliable n ineffective

n Lazy NLP learns best strategy for each information/context separatelyn Example:

n Using parsing for recognising the speaker in seminar announcements,

n Using shallow approaches to spot the seminar location


(LP)2 [Ciravegna 2001 – IJCAI 01- ATEM01]

n Covering algorithm based on LazyNlpn Single tag learning (e.g. </speaker>)

n Tagging Rulesn Insert annotation in texts

n Correction Rules n Correct imprecision in information identification by

shifting tags to the correct position

TBL-like, with some fundamental differences


Tagging and Correction Rules: examples

C o n d itio n o n W o r d s

A c tio n : I n se r t Tag

t h e s e m in a r

a t < tim e > 4

p m

the seminar at <time> 4 pm </time> will

Initial rules= window of conditions on words

The seminar at 4 </time> PM will be held in Room 201

Condition Action word wrong tag correct

tag at 4 </time>

pm </time>

12


Rule Generalisationn Each instance is generalised by reducing its pattern in length n Generalizations are tested on training corpusn Best k rules generated from each instance reporting:

n Smallest error rate (wrong/matches)n Greatest number of matchesn Cover different examples

n Conditions on words are replaced by information from NLP modulesn Capitalisationn Morphological analysis

n Generalizes over gender/numbern POS tagging

n Generalizes over lexical categoriesn User-defined dictionary or gazetteern Named Entity Recognizer

Implemented as a general to specific beam search with pruning (AQ-like)


Example of generalizationthe seminar at <time> 4 pm will

Details of the algorithm in [Ciravegna 2001 - ATEM01]

lowverbwillwill

timeidlownounpm

<time>lowdigit4

lowprepatat

lownounseminarseminar

lowdetthethe

TagSemCatCaseLexCatLemmaWord

ActionAdditional KnowledgeCondition

timeid

<time>digit

at

TagSemCatCaseLexCatLemmaWord

ActionCondition


CMU: detailed results

(LP)2 BWI HMM SRV Rapier Whisk speaker 77.6 67.7 76.6 56.3 53.0 18.3 location 75.0 76.7 78.6 72.3 72.7 66.4

stime 99.0 99.6 98.5 98.5 93.4 92.6 etime 95.5 93.9 62.1 77.9 96.2 86.0

All Slots 86.0 83.9 82.0 77.1 77.3 64.9

1. Best overall accuracy 2. Best result on speaker field3. No results below 75%


Effect of Generalization(1)Effectiveness and reduction in data sparseness

Slot (LP)2 G (LP)2 NG speaker 72.1 14.5 location 74.1 58.2

stime 100 97.4 etime 96.4 87.1

All slots 89.7 78.2

With comparable effectiveness on training corpus!

} Most Interesting

0306090

120150180210240270300330360390420450480510540

0 1 2 3 4 5 6 7 8 9 10

Cases Covered

Num

ber

of R

ules

Generalisation

No Generalisation

NLP- based generalisation14% rules cover 1 case42% cover up to 2 cases

Non- NLP50% rules cover 1 case71% cover up to 2 cases


Best level of Generalization

n ITC seminar announcements (mixed Italian/English)n Date, time, location generally in Italiann Speaker, title and abstract generally in

English

n English POS also for the Italian part

n NLP-based outperforms other version

Words POS NE

speaker 74.1 75.4 84.3

title 62.8 62.4 62.8

date 90.8 93.4 93.9

time 100 100 100

location 95.0 95.0 95.5


Linguistic constraints: Conclusions

n Linguistic phenomena can’t be handled by simple wrapper-like extraction patterns

n Even shallow linguistic processing (eg POS tagging) can improve performance dramatically.n NOTE: linguistic processing must be regular, not necessarily correct!n Example (LexCat:NNP + + )<SPEAKER>(NER:<person>)none of the covered 32 examples starts actually with an NNP

n What about more sophisticated NLP techniques?n Extension to parsing and corefernce resolution?

13


Putting IE into Practice

Enabling non-experts to port IE systems


n what is IE, what can we extract from the Web and why?



n The adaptation problem (20 min)

n WEB + IE: examples of systems (10 min)


n DiscussionCiravegna & Kushmerick: ECML-2003 Tutorial74

Motivation

n Impact on the web community will come only if:n IE systems are portable by non IE expertsn Low cost porting

n Non expertsn Need specific easy to use tools to:

n Design applicationn Tune applicationn Deliver application

n Need support during the whole IE application definition process

In summarising the summary of the summary: people are a problem.

Douglas AdamsThe Restaurant at the End of the Universe

Scenariodesign

Adapting theIE system

ApplicationDelivery

ResultValidation

User Needs

Application Development Cycle


Scenario design

n Task: mapping user whishes into templatesn Necessity:

n Supporting users in:n relevant information identificationn scenario organization

n Relevant Information Identification:n Different situations:

n User with developed scenarion System: no action, but…

n User with preliminary scenario to be refinedn System helps in refining

n User with no scenarion System helps in

n Identifying relevant informationn Organising it into a scenario


Training

n User can select unrepresentative corporan Unbalanced wrt genres

n System validates corpus wrt a large corpusn Comparing formal features

n Unwanted regularities (use of keywords for selection)n System looks for unusual regularities

n Irrelevant texts (sensitive information)n No solution to stupidity


Tagging Corpora

n Problems:n Tagging texts can:

n Be difficult and boringn Take a long time

n Effect:n Mistakes in taggingn High cost

n System:n reduce/eliminate need for annotated data

n Bootstrapping: from user-defined “seed examples” to system-retrieved similar examples

n Active learning: selection of examples to annotate from unlabeled corpus

Helps in discovering new relations

Helps in focusing on unusual information shape

14


Result Validation

n How well does the system perform?n Solution:

n Facilities for:n Inspecting tagged corpusn Showing details on correctness

n Statistics on corpusn Details on errors (highlight correct/incorrect/missing)

(e.g. MUC scorer is an excellent tool)

n Influencing system behaviorn Solution

n Interface for bridging the user’s qualitative vision and the system’s numerical vision

OK. Please modify an error threshold

Try to be more

accurate!


Application Delivery

n Problem:n Incoming texts deviate from training data

n Training corpus non representativen Document features change in time

n Solution:n Monitoring application.

n Warn user if incoming texts’ features are statistically different from training corpus:n Formal features: texts length, distribution of nounsn Semantic features: distribution of template fillers


Putting IE into Practice (2)

Some examples of Adaptive User-driven IE for real world applications


Learning Pinocchio

n Commercial tool for adaptive IEn Based on the (LP)2 algorithmn Adaptable to new scenarios/applications by:

n Corpus tagging via SGMLn A user with analyst’s knowledge

n Applicationsn “Tombstone” data from Resumees (Canadian company) (E)n IE from financial news (Kataweb) (I)n IE from classified ads (Kataweb) (I)n Information highlighting (intelligence)n (Many others I have lost track of…)

n A number of licenses released around the world for application development

[Ciravegna 2001 - IJCAI]http://tcc.itc.it/research/textec/tools-resources/learningpinocchio/


Application development time

Resumees:n Scenario definition: 10 person hours n Tagging 250 texts: 14 person hours n Rule induction: 72 hours on 450MHz computer n Result validation: 4 hours

Contact:Alberto [email protected]://tcc.itc.it/research/textec/tools-resources/learningpinocchio/


Amilcare active annotation for the Semantic Web

Tool for adaptive IE from Web-related textsn Based on (LP)2

n Uses Gate and Annie for preprocessing n Effective on different text types

n From free texts to rigid docs (XML,HTML, etc.) n Integrated with

n MnM (Open University) Ontomat (University of Karlsruhe)n Gate (U Sheffield)

n Adapting Amilcare:n Define a scenario (ontology)n Define a Corpus of documentsn Annotate texts

n Via MnM, Gate, Ontomatn Train the systemn Tune the application (*)n Deliver the application

[Ciravegna 2002 -SIGIR] www.dcs.shef.ac.uk/~fabio/Amilcare.html

15


Non- Intrusive Active Learning

n Amilcare is specifically designed as companion for text annotationn It can be inserted in the usual tagging environmentn It works in the backgroundn At some point it will start helping the user in tagging


Bootstrapping Annotation

Bare text Amilcare LearnsIn the background

UserAnnotates

Bare textUserAnnotates

AmilcareAnnotates

Annotationcomparison

Use missing casesand mistakes to trigger learning

Learning to annotate


Active Annotation

n When Amilcare’s rules reach a user-defined accuracy

n WHY active annotation?n Focuses the slow and expensive user activity on uncovered

casesn Avoids annotating covered cases n Validating extracted information is

n Simpler & less error proneThan annotating bare texts speeding up the process of corpus

annotation considerably.

Bare text UserCorrects

AmilcareAnnotates

Use corrections to retrain


Is IE useful as Help for Tagging?Speaker

0

20

40

60

80

100

0 50 100 150

training examples

Stime

0

20

40

60

80

100

0 50 100 150training examples

Etime

0

20

40

60

80

100

0 50 100 150

training examples

Location

0

20

40

60

80

100

0 50 100 150

training examples

Precision Recall F-measure


Conclusions on IE and Tagging

n Integration of IE (Amilcare+Gate) and Ontology-based Annotation Tools (MnM and Ontomat)

n First step towards a new generation of OEsn Active Learning can provide an interesting interaction

modalityn User friendlyn Adaptable

Tag Amount of Texts needed for training

Prec Rec

stime 30 91 78 etime 20 96 72

location 30 82 61 speaker 100 75 70


Summary and Conclusions

The summary of the summaryWhere do we go from now?

16


Summary

n Information extraction: n core enable technology for variety of next-generation information

servicesn Data integration agentsn Semantic Webn Knowledge Management

n Scalable IE systems must be adaptiven automatically learn extraction rules from examples

n Dozens of algorithms to choose fromn State of the art is 70-100% extraction accuracy (after hand-tuning!)

across numerous domains. n Is this good enough? Depends your application.

n Yeah, but does it really work?!n Several companies sell IE products.n SW ontology editors start including IE


Open issues, Future directions

n Knob-tuning will continue to deliver substantial incremental performance increments

n Grand Unified Theory oftext “structuredness”,to automatically selectoptimal IE algorithm fora given task

natural

machine-generated

formal

spontaneous

restricted

open topic


Open Issues, Future directions

n Resource Discovery

n Cross-Document Extraction

spideringheuristics

formclassifier

Web

exampleservices

?

??

?

?

?

?

discoveredservices

candidateservices

servicecategory

input1 type

input data typetaxonomy

service categorytaxonomy

input2 type

p[input|service]

p[term|input]

term

1

inputn type

term

m

term

1

term

m… ……

…

3-level Bayesian network

To: MaryFrom: JohnSubject: meet?

Can we meet Tueat 3, and also Friat noon?

To: MaryFrom: JohnSubject: tuesday

Sorry, I’ll be anhour late Tue.

To: MaryFrom: JohnSubject: drat

Oops, I need tocancel on Fri.

3 To: MaryFrom: AliceSubject: meet?

John asked me toalso come to yourTuesday meeting.

Figure 3: Calendar management as inter-document information extraction.

1

when = 28/08@16:00who = {John,Mary}calendar

when = 28/08@15:00who = {John,Mary}

when = 28/08@16:00who = {John,Mary,Alice}

4

2

when = 01/09@12:00who = {John,Mary}

8delete


Open issues, Future directions

n Adaptive only?n Mentioned systems are designed for non experts

n E.g. do not require users to revise or contribute rules. n Is this a limitation? What about experts or even the

whole spectrum of skills?

n Future direction: making the best use of user’s knowledge

n Expressive enough?n What about filling templates?

n Coreferences(ACME is producing part for YMB Inc. The company will deliver…)

n Reasoning (if X retires then X leaves his/her company)