Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web

An Experimental Comparison

[Etzioni et al., 2004]

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion


Outline



Introduction

• Information extraction from the web (~web mining)• A good prerequisite for this talk:

Information granularity

fine coarse

1 information 1 locations.

(job posting)

10 information100 locations

(HP digital camera)

1,000 infos 100,000 locations

(cities of the world)


Outline



Outline



Paper’ structure

• Presentation of an existing WebMining system• Author’ intuition of a « Recall problem »• Proposition of three possible improvements• Definition of a metric for the « quantification of success »• Evaluation of proposed improvements


Outline



Outline



KnowItAll System

• Autonomous, domain-independant system that extract facts, concepts, and relationships from the Web.

1

2

3

4

Focus (e.g.: city)

Patterns instanciation: NP1 such as NP2 = « city such as »Plural(NP1) such as NP2-List = « cities such as »

Search + passage retrieval: … a city such as Sudbury, at north of the Great Lakes……cities such as Chicago, New York, Atlanta and Orlando …

Assessor: PMI-IR Hits(Atlanta AND city) / Hits (Atlanta)


Outline



Outline



Rule Learning (RL)

• Goal: increase the recall of KnowItAll

“city, such as Boston”“mega-city such as Mexico”“within a city, such as Rice University”

PMI(Boston, city) = 0,60PMI(Mexico, city) = 0,56PMI(Rice University, city) = 0,24

Patterns

Facts

(with likelihood)


Rule Learning (RL)

New patterns

Facts

(most probable)

of Boston Collegethe Boston Globea Boston Parking Spaceheadhquartered in BostonCrime in Mexico continues Mexico City Hotels headhquartered in Mexico

Headhquartered in NP


Rule Learning (RL)

• Estimating rule quality

• Heuristic 1: remove all substring that appear in a single seed.

• Heuristic 2: rule precision =

– c is the number of time the rule match a seed

– n is the number of time the rule match a known negative example

– k / m is the prior estimate of the rule (PMI tests)

mnc

kc


Outline



Outline



Subclass Extraction (SE)


Focus: scientist

Pattern: « scientist such as NP»

… scientist such as Arthur Noyes

… scientist such as Isaac Newton

… scientist such as Sandra Steingraber


Subclass Extraction (SE)

• Using found facts, apply the reverse pattern:

«N such as Arthur Noyes »

« chemist such as Arthur Noyes »

« biologist such as Sandra Steingraber »

• Assess subclasses by PMI trick and morphology test (« ist »)


Outline



Outline



List Extraction (LE)


Find web pages with set (k=4) of random facts.

« chicago AND boston AND mexico AND buenos aires »

repeat 5,000-10,000 times

In each document, try to find « a list »



Use a web page « wrapper » i.e. a classifier that identify

positive nodes (element of the list)

and negative nodes (all the remaining html markup)



• Quality of new fact == number of list in which it appears!• PMI can also be use to assess the quality (LE+A)


Outline



Outline



Experiments

• How to calculate the recall improvement?• Cannot calculate the true recall (unknown)• Can use the size of the set of facts

• But how to make sure the set is pure?– Sort facts by probability– Use only high-quality facts (e.g.: prob > 0.9)– Manually assert a sample


Experiments


Experiments


Outline



Outline



Conclusion

• KnowItAll is an Information extraction system (coarse IE)• The only input is a 1-word « focus » (city, scientist, movie, …)• Pattern instanciation, passage retrieval, PMI-IR test

• RL, SE and LE improve extraction recall• Overall LE gives the greatest improvement• SE was notably good on the « scientist » task


Conclusionhttp://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp

http://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp

Methods for Domain-Independant Information Extraction from the Web

Documents