YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Character-Level Analysis of Semi-Structured Documents for Set Expansion

Richard C. Wang and William W. Cohen

Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA

Page 2: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Summary

We illustrated…1. the construction of character-based

wrappers used in SEAL2. a method to extend SEAL to learn

binary relational concepts

We showed that…1. character-based wrappers perform

better than HTML-based2. binary SEAL has good performance

Page 3: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Background – SEAL

Set Expander for Any Language Wang & Cohen, ICDM 2007

An example of set expansionGiven an input query (seeds):

{ survivor, amazing race }

The output answer is: { american idol, big brother, ... }

Page 4: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Features Independent of human & markup language

Support seeds in English, Chinese, Japanese, ... Accept documents in HTML, XML, SGML, TeX, …

Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web

Research contributions Automatically construct wrappers for

extracting candidate items Rank candidates using random walk

Page 5: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Fetcher: Download web pages containing all seeds

Extractor: Learn and construct wrappers

Ranker: Rank candidate items using Random Walk

CanonNikonOlympus

PentaxSonyKodakMinoltaPanasonicCasioLeicaFujiSamsung…

SEAL’s Architecture

Page 6: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Wrapper Learner Current WL only learns unary relation

e.g., x is a mayorA unary wrapper consists of a pair of left (L)

and right (R) context stringExtracts all strings between L, R

Extended WL learns binary relatione.g., x is the mayor of city yA binary wrapper has an additional middle (M)

context stringExtracts string pairs between L, M and M, R

Page 7: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Unary Relation Wrapper Construction

Page 8: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Real Unary Wrappers

Given seeds: Ford, Nissan, Toyota Examples of wrappers and extractions:

Page 9: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Mock Unary Example

Given seeds: Ford, Nissan, Toyota Example document written in an

unknown mark-up language:

Page 10: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Context tries for mock example:

Constructed unary wrappers:

Page 11: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007)

Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper

Result – stricter wrappers perform worse

Unary SEAL Evaluation

Page 12: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Binary Wrapper Construction Keep track of all middle contexts:

In the unary code, replace Intersect with:

Page 13: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Real Binary Wrappers

Page 14: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Page 15: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Binary SEAL Evaluation

Relational DatasetsSurveyed more than a dozenRandomly selected five:

Bootstrap results ten times using iSEAL (an iterative version of SEAL) Wang & Cohen, ICDM 2008

Page 16: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

P erformanc e vs . Wrapper T ypes

50

55

60

65

70

75

80

85

90

95

1 2 3 4 5

Wrapper T ypes (1 is leas t s tric t)

Mea

n A

vera

ge P

reci

sion

(%) - B oots trap

+ B oots trap