Top Banner
1 Adaptive Text Extraction and Mining Nicholas Kushmerick Department of Computer Science University College Dublin [email protected] www.cs.ucd.ie/staff/nick Fabio Ciravegna Department of Computer Science University of Sheffield [email protected] www.dcs.shef.ac.uk/~fabio ECML-2003 Tutorial Ciravegna & Kushmerick: ECML-2003 Tutorial 2 What is IE What can we extract from the Web and why? n Introduction: (20 minutes) n what is IE n What can we extract from the Web n Why? n Algorithms and methodologies (100 min) n IE in practice (30 min) n Conclusion, Future Work (10 min) n Discussion Ciravegna & Kushmerick: ECML-2003 Tutorial 3 The ‘canonical’ IE task n Input : n Document n newspaper article, Web page, email message, … n Pre-defined “information need” n frame slots, template fillers, database tuples, … n Output n The specific substrings/fragments of the document or labels that satisfy the stated information need, possibly organised in a template • DARPA’s ‘Message Understanding Conferences/Competitions’ since late1980’s; most recent: MUC-7, 1998. Recent interest in the machine learning and Web communities. Ciravegna & Kushmerick: ECML-2003 Tutorial 4 IE Standard Tasks n Preprocessing n Tokenization n Morphological Analysis n Part of Speech Tagging n Information Identification n Named Entity Recognition n Template Filling (from the MUC) n Template Elements n Template Relations n Scenario Template Ciravegna & Kushmerick: ECML-2003 Tutorial 5 Moody’s C$115 million Moody's Investors Service Inc Organisation MNY % 19:16 rates Province of Saskatchewan A3 said it assigned an A3 rating to the Province of Saskatchewan's bond offering that was priced today. The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp. NE Recognition & Coreference Date & Coreference Ciravegna & Kushmerick: ECML-2003 Tutorial 6 19:16 Moody's rates Province of Saskatchewan A3 Moody's Investors Service Inc said it assigned an A3 rating to the Province of Saskatchewan's C$115 million bond offering that was priced today. The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp. Template Filling amount C$115 million issuer Province of Saskatchewan placement-date today maturity February 4, 2022 rate 9.6 percent
16

Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

Nov 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

1

Adaptive TextExtraction and Mining

Nicholas KushmerickDepartment of Computer Science

University College Dublin

[email protected]/staff/nick

Fabio CiravegnaDepartment of Computer Science

University of Sheffield

[email protected]/~fabio

ECML-2003 Tutorial

Ciravegna & Kushmerick: ECML-2003 Tutorial2

What is IE

What can we extract from the Weband why?

n Introduction: (20 minutes)

n what is IE

n What can we extract from the Web

n Why?

n Algorithms and methodologies (100 min)

n IE in practice (30 min)

n Conclusion, Future Work (10 min)

n Discussion

Ciravegna & Kushmerick: ECML-2003 Tutorial3

The ‘canonical’ IE task

n Input:n Document

n newspaper article, Web page, email message, …

n Pre-defined “information need” n frame slots, template fillers, database tuples, …

n Outputn The specific substrings/fragments of the document or labels that

satisfy the stated information need, possibly organised in a template

• DARPA’s ‘Message Understanding Conferences/Competitions’ since late1980’s; most recent: MUC-7, 1998.

• Recent interest in the machine learning and Web communities.Ciravegna & Kushmerick: ECML-2003 Tutorial4

IE Standard Tasks

n Preprocessingn Tokenizationn Morphological Analysisn Part of Speech Tagging

n Information Identificationn Named Entity Recognitionn Template Filling (from the MUC)

n Template Elementsn Template Relationsn Scenario Template

Ciravegna & Kushmerick: ECML-2003 Tutorial5

Moody’s

C$115 millionMoody's Investors Service Inc

Organisation

MNY

%

19:16 rates Province of Saskatchewan A3

said it assigned an A3 rating to the Province of Saskatchewan's bond offering that was priced today.The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp.

NE Recognition & Coreference

Date

& Coreference

Ciravegna & Kushmerick: ECML-2003 Tutorial6

19:16 Moody's rates Province of Saskatchewan A3

Moody's Investors Service Inc said it assigned an A3 rating to the Province of Saskatchewan's C$115 million bond offering that was priced today.The sale is a reopening of the province's 9.6 percent bonds due February 4, 2022. Proceeds will be used for government purposes, mainly Saskatchewan Power Corp.

Template Filling

amount C$115 million

issuer Province of Saskatchewan

placement-date today

maturity February 4, 2022

rate 9.6 percent

Page 2: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

2

Intranet

The Big Picture

queryprocessor

database

Web

IE

ontology

Ciravegna & Kushmerick: ECML-2003 Tutorial8

NYU Architecture: a MUC architecture

LexicalAnalysis

CoreferenceAnalysis

Name Recognition

PartialParsing

Scenario Pattern Matching

Local Text Analysis

Inference

Discourse Analysis

TemplateGeneration

Ciravegna & Kushmerick: ECML-2003 Tutorial9

Semantic Web

n A brain for Human Kindn From Information-based to Knowledge-Based n Processable Knowledge means:

n Better Retrievaln Reasoning

n Where can IE contribute?

Ciravegna & Kushmerick: ECML-2003 Tutorial10

Building the SW

n Document annotationn Manually associate documents (or parts) to

ontological descriptionsn Document classification for retrieval

n Where can I buy an Hamster?n Pet shop web page -> pet shop concept -> hamster

n Knowledge annotationn Where can I find a hotel in Berlin where single rooms cost

less than 400€?n The Hotel is located in central Berlin and the cost for a

single room is 300€

n Editors are currently available for manual annotation of texts

Ciravegna & Kushmerick: ECML-2003 Tutorial11

IE for Annotating Documents

n Manual annotation is n Expensiven Error prone

n IE can be used for annotating documentsn Automaticallyn Semi-Automatically

n As user support

n Advantagesn Speedn Low costn Consistencyn Can provide automatic annotation different from the one

provided by the author(!)

Ciravegna & Kushmerick: ECML-2003 Tutorial12

SW for Knowledge Management

n SW is important for everyday Internet usersn SW is necessary for large companies

n Millions of documents where knowledge is interspersed

n Most documents are now n web-basedn Available over an Intranet

n Companies are valued for theirn Tangible assets (e.g. plants)n Intangible assets (e.g. knowledge)

n Knowledge is stored in n mind of employeesn Documentation

n Companies spend 7-10% of revenues for KM

Page 3: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

3

Ciravegna & Kushmerick: ECML-2003 Tutorial13

Why Adaptive Systems?

n Writing IE systems by hand is difficult and error pronen Extraction languages can be quite complexn Tedious write-test-debug-rewrite cycle

n Adaptive systems learn from user annotationsn the person tells the learning algorithm what to extract:

The learner figures out how

n Advantagesn Annotating text is simpler & faster than writing rules.n Domain independentn Domain experts don’t need to be linguists or programers.n Learning algorithms ensure full coverage of examples.

Ciravegna & Kushmerick: ECML-2003 Tutorial14

Algorithms and Methodologies

n Introduction: (20 minutes)

n Algorithms and methodologies (100 min)

n Wrapper induction

n Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints

n IE in practice (30 min)

n Conclusion, Future Work (10 min)

n Discussion

A dip into the details of IE for the Web

Ciravegna & Kushmerick: ECML-2003 Tutorial15

Algorithms: Outline

n Wrappersn Hand-coded wrappersn Wrapper inductionn Learning highly expressive wrappersn Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata

Ciravegna & Kushmerick: ECML-2003 Tutorial16

Wrapper induction

Highly regularsource documents

⇓Relatively simple

extraction patterns

⇓Efficient

learning algorithms

Ciravegna & Kushmerick: ECML-2003 Tutorial17

⟨ (Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34) ⟩

Wrappers: Example and Toolkits

nWrapper toolkits: Specialized programming environmentsfor writing & debugging wrappers by hand

nExamples

nWorld Wide Web Wrapper Factory[ db.cis.upenn.edu/W4F ]

nJava Extraction & Dissemination of Information[ www.darmstadt.gmd.de/oasys/projects/jedi ]

Ciravegna & Kushmerick: ECML-2003 Tutorial18

Use <B>, </B>, <I>, </I> for extraction

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

+

Wrappers: Delimiter- based extraction

Page 4: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

4

Ciravegna & Kushmerick: ECML-2003 Tutorial19

procedure ExtractCountryCodeswhile there are more occurrences of <B>

1. extract Country between <B> and </B>2. extract Code between <I> and </I>

procedure ExtractAttributes:while there are more occurrence of l1

1. extract 1st attribute between l1 and r1

. . .K. extract Kth attribute between lK and rK

left delimiters right delimiters

“Left- Right” wrappers

[Kushmerick et al, IJCAI-97; Kushmerick AIJ-2000]

Left-Right wrapper ≡ 2K strings⟨l1 , r1 , …, lK , rK⟩

Ciravegna & Kushmerick: ECML-2003 Tutorial20

Thai food is spicy.Vietnamese food is spicy.German food isn’t spicy.

Ù Asian foodis spicy.

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

wrapperÙ

examples Ù hypothesis

Wrapper induction

Ciravegna & Kushmerick: ECML-2003 Tutorial21

⟨l1, r1, …, lK, rK⟩

Example: Find 4 strings⟨<B>, </B>, <I>, </I>⟩⟨ l1 , r1 , l2 , r2 ⟩

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

Learning LR wrappers

Ciravegna & Kushmerick: ECML-2003 Tutorial22

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r1 can be any prefix

eg </B>

Ciravegna & Kushmerick: ECML-2003 Tutorial23

LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>

r2 can be any prefix

eg </I>

l2 can be any suffixeg <I>

l1 can be any suffixeg <B>

Ciravegna & Kushmerick: ECML-2003 Tutorial24

Distracting text in head and tail

<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

A problem with LR wrappers

Page 5: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

5

Ciravegna & Kushmerick: ECML-2003 Tutorial25

Ignore page’s head and tail

<HTML><TITLE>Some Country Codes</TITLE><BODY><B>Some Country Codes</B><P><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR><HR><B>End</B></BODY></HTML>

head

body

tail

}

}}

start of tail

end of head

+Head-Left-Right-Tail wrappers

One (of many) solutions: HLRT

Ciravegna & Kushmerick: ECML-2003 Tutorial26

Expressiveness

all sites

Siteswrappable

by LR

aiteswrappableby HLRT

Theorem:

Ciravegna & Kushmerick: ECML-2003 Tutorial27

Coverage

Fraction of randomly-selected “data-heavy” Web sites(search engines, retail, weather, news, finance, …)for which wrapper in a given class was learned.

Ciravegna & Kushmerick: ECML-2003 Tutorial28

Sample complexity

n The key problem with machine learning: Training data is expensive and tedious to generate

n In practice, active learning and specialized algorithms have reduced training requirements considerably

n But this isn’t theoretically satisfyingn Computational Learning Theory:

n Time complexity: Time require by an algorithm to terminate, as a function of problem parameters

n Sample complexity: Training data required by a learning algorithm to converge to correct hypothesis, as a function of problem parameters

Ciravegna & Kushmerick: ECML-2003 Tutorial29

A Model of Sample ComplexityP[correct wrapper] = f( size of documents,

number of documents,number of attributes per record )

Analyze wrapper learningtask to derive this function

(Actually, we can compute only a bound on this probability.)

Just like time/space complexity:

time[learn wrapper] = g( size of documents,number of documents,number of attributes per record )

Ciravegna & Kushmerick: ECML-2003 Tutorial30

PAC results - LR wrappers

Theorem: Suppose we learn LR wrapper W from training set E, where the longest document has length R and each record contains K attributes. If

then W is probably approximately correct

error(W) < ε

with probability at least 1-δ

|E|

Page 6: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

6

Ciravegna & Kushmerick: ECML-2003 Tutorial31

More sophisticated wrappers

n LR & HLRT wrappers are extremely simple(though useful for ~ 2/3 of real Web sites!)

n Recent wrapper induction research has explored… n more expressive wrapper classes

[Muslea et al, Agents-98; Hsu et al, JIS-98; Thomas et al, JIS-00, …]

n Disjunctive delimitersn Sequential/landmark-based delimitersn Multiple attribute orderingsn Missing attributesn Multiple-valued attributesn Hierarchically nested data

n Wrapper verification/maintenance[Kushmerick, AAAI-1999; Kushmerick WWWJ-00;Cohen, AAAI-1999; Minton et al, AAAI-00]

Ciravegna & Kushmerick: ECML-2003 Tutorial32

One of my favorites

n Roadrunner[Valter Crescenzi et al; Univ Roma 3]

n Unsupervised wrapper inductionn They research databases, not machine learning, so

they didn’t realize training data was needed :-)

n Intuition:n Pose two different queriesn The common bits of the documents come from the

template and can be ignoredn The bits that are different are the data that we’re

looking for

Ciravegna & Kushmerick: ECML-2003 Tutorial33

Roadrunner - Example

n Common content = Part of templateVarying content = The data!

n Complications: Dynamic but unwanted content -- egadvertisements or timestamps

Ciravegna & Kushmerick: ECML-2003 Tutorial34

Algorithms: Outline

ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersn Boosted wrapper inductionn Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata

Ciravegna & Kushmerick: ECML-2003 Tutorial35

Boosted wrapper induction[Freitag & Kushmerick, AAAI-00]

n Wrapper induction is suitable only forrigidly-structured machine-generated HTML…

n … or is it?! n Can we use simple patterns to extract

from natural language documents?… Name: Dr. Jeffrey D. Hermes …… Who: Professor Manfred Paul …

... will be given by Dr. R. J. Pangborn …… Ms. Scott will be speaking …

… Karen Shriver, Dept. of ...… Maria Klawe, University of ...

Ciravegna & Kushmerick: ECML-2003 Tutorial36

BWI: The basic idea

n Learn “wrapper-like” patterns for natural textspattern = exact token sequence

n Learn many such “weak” patternsn Combine with boosting to build “strong”

ensemble patternn Of course, not all natural text is sufficiently

regular!n Demo: www.smi.ucd.ie/bwi

Page 7: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

7

Ciravegna & Kushmerick: ECML-2003 Tutorial37

Covering Algorithms

n Generalization of Covering Algorithm for learning disjunctive rules

+++

++

++

--

- --

---

--

--

--

Ciravegna & Kushmerick: ECML-2003 Tutorial38

Covering Algorithms

n Generalization of Covering Algorithm for learning disjunctive rules

+++

++

++

--

- --

---

--

--

--

Learned Rule = rule

Ciravegna & Kushmerick: ECML-2003 Tutorial39

Covering Algorithms

n Generalization of Covering Algorithm for learning disjunctive rules

++

++

--

- --

---

--

--

--

Learned Rule = rule or rule

Ciravegna & Kushmerick: ECML-2003 Tutorial40

Covering Algorithms

n Generalization of Covering Algorithm for learning disjunctive rules

++

--

- --

---

--

--

--

Learned Rule = rule or rule or rule

Ciravegna & Kushmerick: ECML-2003 Tutorial41

Boosting = Generalized Covering

n When learn rules on iteration t, give less weight to (but don’t entirely discard) training examples successfully handled in iterations 1, 2, …, t-1

n Equivalently: Give more weight to training data that has not yetbeen covered

++-

-

- --

---

--

--

--

+

+

+

++

Ciravegna & Kushmerick: ECML-2003 Tutorial42

D1(i) = uniform distribution overtraining examples

for t = 1, ..., Ttrain: use distribution Dt to

learn weak hypothesis:ht: X⇒R

reweight: choose a t, and modifydistribution Dt to emphasizeexamples missed by ht:Dt+1(i)= Dt(i) exp(-a tyiht(xi))

return:H(x) = sign(S ta tht(x))

h1

h2

h3...

D0

D1

D2

Boosting [Schapire & Singer, 1998]

Page 8: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

8

Ciravegna & Kushmerick: ECML-2003 Tutorial43

Boundary Detector: [who :][dr . <Capitalized>]prefix suffix

matches (e.g.) “… Who: Dr. Richard Nixon …”

Greedy growth from null detectorPick best prefix/suffix extension at each stepStop when no further extension improves accuracy

Weightinga t = ½ ln[(W+ + e) / (W- + e)]

[Cohen & Singer, 1999]

Weak hypotheses: Boundary Detectors

Weak Learning Algorithm

Ciravegna & Kushmerick: ECML-2003 Tutorial44

input: labeled documents

Fore = Adaboost fore detectorsAft = Adaboost aft detectorsLengths = length histogram

output: Extractor =<Fore, Aft, Lengths>

Traininginput: Document, Extractor, t

F = {⟨i, ci⟩ | token i matches Forewith confidence ci}

A = {⟨j, cj⟩| token j matches Aftwith confidence cj}

output:{⟨i,j⟩ | ⟨i, ci ⟩∈F, ⟨j, cj ⟩∈A,

ci·cj·L(j-i) > t }

Execution

Boosted Wrapper Induction

Ciravegna & Kushmerick: ECML-2003 Tutorial45

<[email protected]>Type: cmu.andrew.official.cmu-newsTopic: Chem. Eng. SeminarDates: 2-May-95Time: 10:45 AMPostedBy: Bruce Gerson on 26-Apr-95 at 09:31 from andrew.cmu.eduAbstract:

The Chemical Engineering Department will offer a seminar entitled"Creating Value in the Chemical Industry," at 10:45 a.m., Tuesday, May 2in Doherty Hall 1112. The seminar will be given by Dr. R. J. (Bob) Pangborn, Director, CentralResearch and Development, The Dow Chemical Company.

[<Alph>][Dr . <Cap>][<Alph> by][<Cap>]

1.30.8

2.1 [<Cap>][( University]0.7 [<Alph>][, Director]

[ ]

2.1 0.7 0.05

0.1

10 119

Confidence of "Dr. R. J. (Bob) Pangborn" = 2.1·0.7·0.05 = 0.074

Fore Aft Length

BWI execution example

Ciravegna & Kushmerick: ECML-2003 Tutorial46

[speaker :][<Alph>][speaker <Any>][<FName>]

[<Cap>][<FName> <Any> <Punc> ibm]

Presentation Abstract Joe Cascio, IBMSet Constraints Alex Aiken (IBM, Almaden)

[. <Any>][is <ANum> <Cap>]

John C. Akbari is a Masters student atMichael A. Cusumano is an Associate Professor of

Lawrence C. Stewart is a Consultant Engineer at

Speaker: Reid Simmons, School of …

Samples of learned patterns

Ciravegna & Kushmerick: ECML-2003 Tutorial47

Evaluation

• Wrappers are usually 100% accurate, but perfection is generally impossible with natural text

• ML/IE community has a well developed evaluation methodology

• Cross-validation: Repeat many times - randomly select2/3 of the data for training, test on remaining 1/3.

• Precision: fraction of extracted items that are correct• Recall: fraction of actual items extracted• F1 = 2 / (1/P + 1/R)

• 16 IE tasks from 8 document collections

• Competitors: SRV, Rapier, HMM

seminar announcementsjob listingsReuters corporate acquisitionsCS department faculty lists

Zagats restaurant reviewsLA Times restaurant reviewsInternet Address FinderStock quote server

Ciravegna & Kushmerick: ECML-2003 Tutorial48

Results: 16 tasks x 4 algorithms

21cases

7cases

Page 9: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

9

Ciravegna & Kushmerick: ECML-2003 Tutorial49

Boosted Wrapper Induction:Controversial(?) Conclusion

n Is the Great Web -vs- Natural Text Chasm more apparent than real?

n IE is possible if the documents contain regularities that can be exploited

n But the “reason” (eg, linguistic -vs- markup) for these regularities doesn’t much matter

n See also Soderland’s WHISK & Webfoot

Ciravegna & Kushmerick: ECML-2003 Tutorial50

Algorithms: Outline

ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersü Boosted wrapper induction§ Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata

Ciravegna & Kushmerick: ECML-2003 Tutorial51

Hidden Markov models

nPrevious discussion examine systems that use explicit extraction patterns/rulesnHMMs are a powerful alternative based on statistical token models rather than explicit extraction patterns.

[Leek, UC San Diego, 1997; Bikel et al, ANLP-97, MLJ 99; Freitag & McCallum, AAAI-99 MLIE Workshop; Seymore, McCallum & Rosenfeld, AAAI-99 MLIE Workshop; Freitag & McCallum, AAAI-2000]

Ciravegna & Kushmerick: ECML-2003 Tutorial52

HMM formalism

HMM = states s1, s2, …special start state s1special end state sntoken alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)

Widely used in many language processing tasks,e.g. speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].

Ciravegna & Kushmerick: ECML-2003 Tutorial53

Applying HMMs to IE

n Document ⇒ generated by a stochastic process modelled by an HMM

n Token ⇒ wordn State ⇒ “reason/explanation” for a given token

n ‘Background’ state emits tokens like ‘the’, ‘said’, …n ‘Money’ state emits tokens like ‘million’, ‘euro’, …n ‘Organization’ state emits tokens like ‘university’, ‘company’,

…n Extraction: The Viterbi algorithm is a dynamic

programming technique for efficiently computing the most likely sequence of states that generated a document.

Ciravegna & Kushmerick: ECML-2003 Tutorial54

HMM for research papers [Seymore et al, 99]

Page 10: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

10

Ciravegna & Kushmerick: ECML-2003 Tutorial55

Learning HMMs

n Good news: n If training data tokens are tagged with their generating states, then

simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. (Use smoothing to avoid zero probsfor emissions/transitions absent in the training data.)

n Great news: n Baum-Welch algorithm trains HMM using unlabelled training data!

n Bad news: n How many states should the HMM contain?n How are transitions constrained?

n Insufficiently expressive ⇒ Unable to model important distinctionsn Overly-expressive ⇒ sparse training data, overfitting

Ciravegna & Kushmerick: ECML-2003 Tutorial56

HMM example

“Seminar announcements” task<[email protected]>Type: cmu.andrew.assocs.UEATopic: Re: entreprenuership speakerDates: 17-Apr-95Time: 7:00 PMPostedBy: Colin S Osburn on 15-Apr-95 at 15:11 from CMU.EDUAbstract:

hello againto reiteratethere will be a speaker on the law and startup businessthis monday evening the 17thit will be at 7pm in room 261 of GSIA in the new building, ieupstairs.please attend if you have any interest in starting your own business orare even curious.Colin

Ciravegna & Kushmerick: ECML-2003 Tutorial57

HMM example, continued

Fixed topology that captures limited context:4 “prefix” states before & 4 “suffix” after target state

pre1 pre2 pre3 pre4 suf1 suf2 suf3 suf4speaker

background5 most-probable tokens

\n . - : unknown

\nseminar

.roboticsunknown

\n:.-

unknown

\nwho

speaker:.

\n:.

with,

unknown.

drprofessormichael

\nunknown

.department

the

\nof.,

unknown

\nof

unknown.:

\n,

will(-

[Freitag, 99]Ciravegna & Kushmerick: ECML-2003 Tutorial58

Evaluation

21cases

7cases(no learning!)

Ciravegna & Kushmerick: ECML-2003 Tutorial59

Learning HMM structure [Seymore et al, 1999]

start with maximally-specific HMM (one state per observed word):

repeat(a) merge adjacent identical states

(b) eliminate redundant fan-out/in

until obtain good tradeoff between HMM accuracy and complexity

note auth auth title ⇒ note auth title

title auth

auth

auth

⇒ title auth

start note

title

auth

note

title

auth

note

auth

title

abst

abst

abst

end

abst

abst

………

Ciravegna & Kushmerick: ECML-2003 Tutorial60

Evaluation

hand-crafted HMMsimple HMM

learned HMM

(155 states)

Page 11: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

11

Ciravegna & Kushmerick: ECML-2003 Tutorial61

Algorithms: Outline

ü Wrappersü Hand-coded wrappersü Wrapper inductionü Learning highly expressive wrappersü Boosted wrapper inductionü Hidden Markov modelsn Exploiting linguistic constraints

naturaltext

structureddata

Ciravegna & Kushmerick: ECML-2003 Tutorial62

Exploiting linguistic constraints

n IE research has its roots in the NLP communityn many extraction tasks require non-trivial linguistic

processing

n Web Documents types can range from free texts to rigid HTML documents (e.g. tables)

n Even a mixture of them!

n Is NLP robust enough tocope with such situations?

Showers in the NW by WednesdaySakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkf fkkf lkjeiikmkjflk lklk Sakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk klsdf lksjdf lkjflskdjf lkjsdf lkjf lskdjf lkjfd lksjf lksjf lksjf lksjf l Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas k,mf sd.,m.jf kdkjsflk fds Dhjjd jfjfjjfkksdl jfjuiekkf fkkf lkjeiikm kjflk lklk m,nfndmsmnsd,mn dsfmn f Sakdjlklksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlk Klkjads lkjdslkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkf fkkf lkjeiikmkjflk lklk Sakdjlk lksajdlkaj sdlkasdjlkasjd lkjd lakjdlkajsdlkKlkjads lkjds lkajdlkajdlkajdlkasjd jdjdjd slkdjas Dhjjd jfj fjjfkksdl jfjuiekkffkkf lkjeiikm kjflk lklk kjdfkl slksjdf lkjsd lkj dslfkjlkjlksj lkjsdlf kjlfkjlkjflskjflksjf;kl;skf;lskf;slkf;lsdkf ;lk kjijkdsfjkh kjfhsk jfhiwuei ujlkj

Ciravegna & Kushmerick: ECML-2003 Tutorial63

Current Approaches

n NLP Approaches (MUC-like Approaches)

n Ineffective on most Web-related texts:n web pages/emailsn stereotypical but ungrammatical texts

n Extra-linguistic structures convey informationn HTML tags, Document formatting, Regular stereotypical

language

n Wrapper induction systemsn Designed for rigidly structured HTML textsn Ineffective on unstructured texts

n Approaches avoid generalization over flat word sequencen Data Sparseness on free texts

Ciravegna & Kushmerick: ECML-2003 Tutorial64

Lazy NLP based Algorithm

n Learns the best level of language analysis for a specific IE task mixing deep linguistic and shallow strategies

1. Initial rules: shallow wrapper-like rules2. Linguistic Information (LI) progressively added to rules3. Addition stopped when LI becomes

n unreliable n ineffective

n Lazy NLP learns best strategy for each information/context separatelyn Example:

n Using parsing for recognising the speaker in seminar announcements,

n Using shallow approaches to spot the seminar location

Ciravegna & Kushmerick: ECML-2003 Tutorial65

(LP)2 [Ciravegna 2001 – IJCAI 01- ATEM01]

n Covering algorithm based on LazyNlpn Single tag learning (e.g. </speaker>)

n Tagging Rulesn Insert annotation in texts

n Correction Rules n Correct imprecision in information identification by

shifting tags to the correct position

TBL-like, with some fundamental differences

Ciravegna & Kushmerick: ECML-2003 Tutorial66

Tagging and Correction Rules: examples

C o n d itio n o n W o r d s

A c tio n : I n se r t Tag

t h e s e m in a r

a t < tim e > 4

p m

the seminar at <time> 4 pm </time> will

Initial rules= window of conditions on words

The seminar at 4 </time> PM will be held in Room 201

Condition Action word wrong tag correct

tag at 4 </time>

pm </time>

Page 12: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

12

Ciravegna & Kushmerick: ECML-2003 Tutorial67

Rule Generalisationn Each instance is generalised by reducing its pattern in length n Generalizations are tested on training corpusn Best k rules generated from each instance reporting:

n Smallest error rate (wrong/matches)n Greatest number of matchesn Cover different examples

n Conditions on words are replaced by information from NLP modulesn Capitalisationn Morphological analysis

n Generalizes over gender/numbern POS tagging

n Generalizes over lexical categoriesn User-defined dictionary or gazetteern Named Entity Recognizer

Implemented as a general to specific beam search with pruning (AQ-like)

Ciravegna & Kushmerick: ECML-2003 Tutorial68

Example of generalizationthe seminar at <time> 4 pm will

Details of the algorithm in [Ciravegna 2001 - ATEM01]

lowverbwillwill

timeidlownounpm

<time>lowdigit4

lowprepatat

lownounseminarseminar

lowdetthethe

TagSemCatCaseLexCatLemmaWord

ActionAdditional KnowledgeCondition

timeid

<time>digit

at

TagSemCatCaseLexCatLemmaWord

ActionCondition

Ciravegna & Kushmerick: ECML-2003 Tutorial69

CMU: detailed results

(LP)2 BWI HMM SRV Rapier Whisk speaker 77.6 67.7 76.6 56.3 53.0 18.3 location 75.0 76.7 78.6 72.3 72.7 66.4

stime 99.0 99.6 98.5 98.5 93.4 92.6 etime 95.5 93.9 62.1 77.9 96.2 86.0

All Slots 86.0 83.9 82.0 77.1 77.3 64.9

1. Best overall accuracy 2. Best result on speaker field3. No results below 75%

Ciravegna & Kushmerick: ECML-2003 Tutorial70

Effect of Generalization(1)Effectiveness and reduction in data sparseness

Slot (LP)2 G (LP)2 NG speaker 72.1 14.5 location 74.1 58.2

stime 100 97.4 etime 96.4 87.1

All slots 89.7 78.2

With comparable effectiveness on training corpus!

} Most Interesting

0306090

120150180210240270300330360390420450480510540

0 1 2 3 4 5 6 7 8 9 10

Cases Covered

Num

ber

of R

ules

Generalisation

No Generalisation

NLP- based generalisation14% rules cover 1 case42% cover up to 2 cases

Non- NLP50% rules cover 1 case71% cover up to 2 cases

Ciravegna & Kushmerick: ECML-2003 Tutorial71

Best level of Generalization

n ITC seminar announcements (mixed Italian/English)n Date, time, location generally in Italiann Speaker, title and abstract generally in

English

n English POS also for the Italian part

n NLP-based outperforms other version

Words POS NE

speaker 74.1 75.4 84.3

title 62.8 62.4 62.8

date 90.8 93.4 93.9

time 100 100 100

location 95.0 95.0 95.5

Ciravegna & Kushmerick: ECML-2003 Tutorial72

Linguistic constraints: Conclusions

n Linguistic phenomena can’t be handled by simple wrapper-like extraction patterns

n Even shallow linguistic processing (eg POS tagging) can improve performance dramatically.n NOTE: linguistic processing must be regular, not necessarily correct!n Example (LexCat:NNP + <BR> + <BR>)<SPEAKER>(NER:<person>)none of the covered 32 examples starts actually with an NNP

n What about more sophisticated NLP techniques?n Extension to parsing and corefernce resolution?

Page 13: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

13

Ciravegna & Kushmerick: ECML-2003 Tutorial73

Putting IE into Practice

Enabling non-experts to port IE systems

n Introduction: (20 minutes)

n what is IE, what can we extract from the Web and why?

n Algorithms and methodologies (100 min)

n IE in practice (30 min)

n The adaptation problem (20 min)

n WEB + IE: examples of systems (10 min)

n Conclusion, Future Work (10 min)

n DiscussionCiravegna & Kushmerick: ECML-2003 Tutorial74

Motivation

n Impact on the web community will come only if:n IE systems are portable by non IE expertsn Low cost porting

n Non expertsn Need specific easy to use tools to:

n Design applicationn Tune applicationn Deliver application

n Need support during the whole IE application definition process

In summarising the summary of the summary: people are a problem.

Douglas AdamsThe Restaurant at the End of the Universe

Scenariodesign

Adapting theIE system

ApplicationDelivery

ResultValidation

User Needs

Application Development Cycle

Ciravegna & Kushmerick: ECML-2003 Tutorial76

Scenario design

n Task: mapping user whishes into templatesn Necessity:

n Supporting users in:n relevant information identificationn scenario organization

n Relevant Information Identification:n Different situations:

n User with developed scenarion System: no action, but…

n User with preliminary scenario to be refinedn System helps in refining

n User with no scenarion System helps in

n Identifying relevant informationn Organising it into a scenario

Ciravegna & Kushmerick: ECML-2003 Tutorial77

Training

n User can select unrepresentative corporan Unbalanced wrt genres

n System validates corpus wrt a large corpusn Comparing formal features

n Unwanted regularities (use of keywords for selection)n System looks for unusual regularities

n Irrelevant texts (sensitive information)n No solution to stupidity

Ciravegna & Kushmerick: ECML-2003 Tutorial78

Tagging Corpora

n Problems:n Tagging texts can:

n Be difficult and boringn Take a long time

n Effect:n Mistakes in taggingn High cost

n System:n reduce/eliminate need for annotated data

n Bootstrapping: from user-defined “seed examples” to system-retrieved similar examples

n Active learning: selection of examples to annotate from unlabeled corpus

Helps in discovering new relations

Helps in focusing on unusual information shape

Page 14: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

14

Ciravegna & Kushmerick: ECML-2003 Tutorial79

Result Validation

n How well does the system perform?n Solution:

n Facilities for:n Inspecting tagged corpusn Showing details on correctness

n Statistics on corpusn Details on errors (highlight correct/incorrect/missing)

(e.g. MUC scorer is an excellent tool)

n Influencing system behaviorn Solution

n Interface for bridging the user’s qualitative vision and the system’s numerical vision

OK. Please modify an error threshold

Try to be more

accurate!

Ciravegna & Kushmerick: ECML-2003 Tutorial80

Application Delivery

n Problem:n Incoming texts deviate from training data

n Training corpus non representativen Document features change in time

n Solution:n Monitoring application.

n Warn user if incoming texts’ features are statistically different from training corpus:n Formal features: texts length, distribution of nounsn Semantic features: distribution of template fillers

Ciravegna & Kushmerick: ECML-2003 Tutorial81

Putting IE into Practice (2)

Some examples of Adaptive User-driven IE for real world applications

Ciravegna & Kushmerick: ECML-2003 Tutorial82

Learning Pinocchio

n Commercial tool for adaptive IEn Based on the (LP)2 algorithmn Adaptable to new scenarios/applications by:

n Corpus tagging via SGMLn A user with analyst’s knowledge

n Applicationsn “Tombstone” data from Resumees (Canadian company) (E)n IE from financial news (Kataweb) (I)n IE from classified ads (Kataweb) (I)n Information highlighting (intelligence)n (Many others I have lost track of…)

n A number of licenses released around the world for application development

[Ciravegna 2001 - IJCAI]http://tcc.itc.it/research/textec/tools-resources/learningpinocchio/

Ciravegna & Kushmerick: ECML-2003 Tutorial83

Application development time

Resumees:n Scenario definition: 10 person hours n Tagging 250 texts: 14 person hours n Rule induction: 72 hours on 450MHz computer n Result validation: 4 hours

Contact:Alberto [email protected]://tcc.itc.it/research/textec/tools-resources/learningpinocchio/

Ciravegna & Kushmerick: ECML-2003 Tutorial84

Amilcare active annotation for the Semantic Web

Tool for adaptive IE from Web-related textsn Based on (LP)2

n Uses Gate and Annie for preprocessing n Effective on different text types

n From free texts to rigid docs (XML,HTML, etc.) n Integrated with

n MnM (Open University) Ontomat (University of Karlsruhe)n Gate (U Sheffield)

n Adapting Amilcare:n Define a scenario (ontology)n Define a Corpus of documentsn Annotate texts

n Via MnM, Gate, Ontomatn Train the systemn Tune the application (*)n Deliver the application

[Ciravegna 2002 -SIGIR] www.dcs.shef.ac.uk/~fabio/Amilcare.html

Page 15: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

15

Ciravegna & Kushmerick: ECML-2003 Tutorial85

Non- Intrusive Active Learning

n Amilcare is specifically designed as companion for text annotationn It can be inserted in the usual tagging environmentn It works in the backgroundn At some point it will start helping the user in tagging

Ciravegna & Kushmerick: ECML-2003 Tutorial86

Bootstrapping Annotation

Bare text Amilcare LearnsIn the background

UserAnnotates

Bare textUserAnnotates

AmilcareAnnotates

Annotationcomparison

Use missing casesand mistakes to trigger learning

Learning to annotate

Ciravegna & Kushmerick: ECML-2003 Tutorial87

Active Annotation

n When Amilcare’s rules reach a user-defined accuracy

n WHY active annotation?n Focuses the slow and expensive user activity on uncovered

casesn Avoids annotating covered cases n Validating extracted information is

n Simpler & less error proneThan annotating bare texts speeding up the process of corpus

annotation considerably.

Bare text UserCorrects

AmilcareAnnotates

Use corrections to retrain

Ciravegna & Kushmerick: ECML-2003 Tutorial88

Is IE useful as Help for Tagging?Speaker

0

20

40

60

80

100

0 50 100 150

training examples

Stime

0

20

40

60

80

100

0 50 100 150training examples

Etime

0

20

40

60

80

100

0 50 100 150

training examples

Location

0

20

40

60

80

100

0 50 100 150

training examples

Precision Recall F-measure

Ciravegna & Kushmerick: ECML-2003 Tutorial89

Conclusions on IE and Tagging

n Integration of IE (Amilcare+Gate) and Ontology-based Annotation Tools (MnM and Ontomat)

n First step towards a new generation of OEsn Active Learning can provide an interesting interaction

modalityn User friendlyn Adaptable

Tag Amount of Texts needed for training

Prec Rec

stime 30 91 78 etime 20 96 72

location 30 82 61 speaker 100 75 70

Ciravegna & Kushmerick: ECML-2003 Tutorial90

Summary and Conclusions

The summary of the summaryWhere do we go from now?

Page 16: Adaptive Text Extraction and Miningstaff€¦ · Analysis Coreference Analysis Name Recognition Partial Parsing Scenario Pattern Matching Local Text Analysis ... SW for Knowledge

16

Ciravegna & Kushmerick: ECML-2003 Tutorial91

Summary

n Information extraction: n core enable technology for variety of next-generation information

servicesn Data integration agentsn Semantic Webn Knowledge Management

n Scalable IE systems must be adaptiven automatically learn extraction rules from examples

n Dozens of algorithms to choose fromn State of the art is 70-100% extraction accuracy (after hand-tuning!)

across numerous domains. n Is this good enough? Depends your application.

n Yeah, but does it really work?!n Several companies sell IE products.n SW ontology editors start including IE

Ciravegna & Kushmerick: ECML-2003 Tutorial92

Open issues, Future directions

n Knob-tuning will continue to deliver substantial incremental performance increments

n Grand Unified Theory oftext “structuredness”,to automatically selectoptimal IE algorithm fora given task

natural

machine-generated

formal

spontaneous

restricted

open topic

Ciravegna & Kushmerick: ECML-2003 Tutorial93

Open Issues, Future directions

n Resource Discovery

n Cross-Document Extraction

spideringheuristics

formclassifier

Web

exampleservices

?

??

?

?

?

?

discoveredservices

candidateservices

servicecategory

input1 type

input data typetaxonomy

service categorytaxonomy

input2 type

p[input|service]

p[term|input]

term

1

inputn type

term

m

term

1

term

m… ……

3-level Bayesian network

To: MaryFrom: JohnSubject: meet?

Can we meet Tueat 3, and also Friat noon?

To: MaryFrom: JohnSubject: tuesday

Sorry, I’ll be anhour late Tue.

To: MaryFrom: JohnSubject: drat

Oops, I need tocancel on Fri.

3 To: MaryFrom: AliceSubject: meet?

John asked me toalso come to yourTuesday meeting.

Figure 3: Calendar management as inter-document information extraction.

1

when = 28/08@16:00who = {John,Mary}calendar

when = 28/08@15:00who = {John,Mary}

when = 28/08@16:00who = {John,Mary,Alice}

4

2

when = 01/09@12:00who = {John,Mary}

8delete

Ciravegna & Kushmerick: ECML-2003 Tutorial94

Open issues, Future directions

n Adaptive only?n Mentioned systems are designed for non experts

n E.g. do not require users to revise or contribute rules. n Is this a limitation? What about experts or even the

whole spectrum of skills?

n Future direction: making the best use of user’s knowledge

n Expressive enough?n What about filling templates?

n Coreferences(ACME is producing part for YMB Inc. The company will deliver…)

n Reasoning (if X retires then X leaves his/her company)