Top Banner
Character-Level Analysis of Semi- Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA
16

Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Character-Level Analysis of Semi-Structured Documents for Set Expansion

Richard C. Wang and William W. Cohen

Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA

Page 2: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Summary

We illustrated…1. the construction of character-based

wrappers used in SEAL2. a method to extend SEAL to learn

binary relational concepts

We showed that…1. character-based wrappers perform

better than HTML-based2. binary SEAL has good performance

Page 3: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Background – SEAL

Set Expander for Any Language Wang & Cohen, ICDM 2007

An example of set expansionGiven an input query (seeds):

{ survivor, amazing race }

The output answer is: { american idol, big brother, ... }

Page 4: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Features Independent of human & markup language

Support seeds in English, Chinese, Japanese, ... Accept documents in HTML, XML, SGML, TeX, …

Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web

Research contributions Automatically construct wrappers for

extracting candidate items Rank candidates using random walk

Page 5: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Fetcher: Download web pages containing all seeds

Extractor: Learn and construct wrappers

Ranker: Rank candidate items using Random Walk

CanonNikonOlympus

PentaxSonyKodakMinoltaPanasonicCasioLeicaFujiSamsung…

SEAL’s Architecture

Page 6: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Wrapper Learner Current WL only learns unary relation

e.g., x is a mayorA unary wrapper consists of a pair of left (L)

and right (R) context stringExtracts all strings between L, R

Extended WL learns binary relatione.g., x is the mayor of city yA binary wrapper has an additional middle (M)

context stringExtracts string pairs between L, M and M, R

Page 7: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Unary Relation Wrapper Construction

Page 8: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Real Unary Wrappers

Given seeds: Ford, Nissan, Toyota Examples of wrappers and extractions:

Page 9: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Mock Unary Example

Given seeds: Ford, Nissan, Toyota Example document written in an

unknown mark-up language:

Page 10: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Context tries for mock example:

Constructed unary wrappers:

Page 11: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007)

Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper

Result – stricter wrappers perform worse

Unary SEAL Evaluation

Page 12: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Binary Wrapper Construction Keep track of all middle contexts:

In the unary code, replace Intersect with:

Page 13: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Real Binary Wrappers

Page 14: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Page 15: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Binary SEAL Evaluation

Relational DatasetsSurveyed more than a dozenRandomly selected five:

Bootstrap results ten times using iSEAL (an iterative version of SEAL) Wang & Cohen, ICDM 2008

Page 16: Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

P erformanc e vs . Wrapper T ypes

50

55

60

65

70

75

80

85

90

95

1 2 3 4 5

Wrapper T ypes (1 is leas t s tric t)

Mea

n A

vera

ge P

reci

sion

(%) - B oots trap

+ B oots trap