Top Banner
Methods for Domain- Independant Information Extraction from the Web An Experimental Comparison [Etzioni et al., 2004]
33

Methods for Domain-Independant Information Extraction from the Web

Jan 03, 2016

Download

Documents

bridget-ireland

Methods for Domain-Independant Information Extraction from the Web. An Experimental Comparison [Etzioni et al., 2004]. Outline. Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion. Outline. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web

An Experimental Comparison

[Etzioni et al., 2004]

Page 2: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 3: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 4: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Introduction

• Information extraction from the web (~web mining)• A good prerequisite for this talk:

Information granularity

fine coarse

1 information 1 locations.

(job posting)

10 information100 locations

(HP digital camera)

1,000 infos 100,000 locations

(cities of the world)

Page 5: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 6: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 7: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Paper’ structure

• Presentation of an existing WebMining system• Author’ intuition of a « Recall problem »• Proposition of three possible improvements• Definition of a metric for the « quantification of success »• Evaluation of proposed improvements

Page 8: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 9: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 10: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

KnowItAll System

• Autonomous, domain-independant system that extract facts, concepts, and relationships from the Web.

1

2

3

4

Focus (e.g.: city)

Patterns instanciation: NP1 such as NP2 = « city such as »Plural(NP1) such as NP2-List = « cities such as »

Search + passage retrieval: … a city such as Sudbury, at north of the Great Lakes……cities such as Chicago, New York, Atlanta and Orlando …

Assessor: PMI-IR Hits(Atlanta AND city) / Hits (Atlanta)

Page 11: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 12: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 13: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Rule Learning (RL)

• Goal: increase the recall of KnowItAll

“city, such as Boston”“mega-city such as Mexico”“within a city, such as Rice University”

PMI(Boston, city) = 0,60PMI(Mexico, city) = 0,56PMI(Rice University, city) = 0,24

Patterns

Facts

(with likelihood)

Page 14: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Rule Learning (RL)

New patterns

Facts

(most probable)

of Boston Collegethe Boston Globea Boston Parking Spaceheadhquartered in BostonCrime in Mexico continues Mexico City Hotels headhquartered in Mexico

Headhquartered in NP

Page 15: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Rule Learning (RL)

• Estimating rule quality

• Heuristic 1: remove all substring that appear in a single seed.

• Heuristic 2: rule precision =

– c is the number of time the rule match a seed

– n is the number of time the rule match a known negative example

– k / m is the prior estimate of the rule (PMI tests)

mnc

kc

Page 16: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 17: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 18: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Subclass Extraction (SE)

• Goal: increase the recall of KnowItAll

Focus: scientist

Pattern: « scientist such as NP»

… scientist such as Arthur Noyes

… scientist such as Isaac Newton

… scientist such as Sandra Steingraber

Page 19: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Subclass Extraction (SE)

• Using found facts, apply the reverse pattern:

«N such as Arthur Noyes »

« chemist such as Arthur Noyes »

« biologist such as Sandra Steingraber »

• Assess subclasses by PMI trick and morphology test (« ist »)

Page 20: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 21: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 22: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

List Extraction (LE)

• Goal: increase the recall of KnowItAll

Find web pages with set (k=4) of random facts.

« chicago AND boston AND mexico AND buenos aires »

repeat 5,000-10,000 times

In each document, try to find « a list »

Page 23: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

List Extraction (LE)

Use a web page « wrapper » i.e. a classifier that identify

positive nodes (element of the list)

and negative nodes (all the remaining html markup)

Page 24: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

List Extraction (LE)

• Quality of new fact == number of list in which it appears!• PMI can also be use to assess the quality (LE+A)

Page 25: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 26: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 27: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Experiments

• How to calculate the recall improvement?• Cannot calculate the true recall (unknown)• Can use the size of the set of facts

• But how to make sure the set is pure?– Sort facts by probability– Use only high-quality facts (e.g.: prob > 0.9)– Manually assert a sample

Page 28: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Experiments

Page 29: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Experiments

Page 30: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 31: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Outline

• Introduction• Paper structure• KnowItAll System• Rule Learning (RL)• Subclass Extraction (SE)• List Extraction (LE)• Experiments• Conclusion

Page 32: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Conclusion

• KnowItAll is an Information extraction system (coarse IE)• The only input is a 1-word « focus » (city, scientist, movie, …)• Pattern instanciation, passage retrieval, PMI-IR test

• RL, SE and LE improve extraction recall• Overall LE gives the greatest improvement• SE was notably good on the « scientist » task

Page 33: Methods for Domain-Independant Information Extraction from the Web

Methods for Domain-Independant Information Extraction from the Web.

Conclusionhttp://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp