Character-Level Analysis of Semi- Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA
Dec 21, 2015
Character-Level Analysis of Semi-Structured Documents for Set Expansion
Richard C. Wang and William W. Cohen
Language Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA
Summary
We illustrated…1. the construction of character-based
wrappers used in SEAL2. a method to extend SEAL to learn
binary relational concepts
We showed that…1. character-based wrappers perform
better than HTML-based2. binary SEAL has good performance
Background – SEAL
Set Expander for Any Language Wang & Cohen, ICDM 2007
An example of set expansionGiven an input query (seeds):
{ survivor, amazing race }
The output answer is: { american idol, big brother, ... }
Features Independent of human & markup language
Support seeds in English, Chinese, Japanese, ... Accept documents in HTML, XML, SGML, TeX, …
Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web
Research contributions Automatically construct wrappers for
extracting candidate items Rank candidates using random walk
Fetcher: Download web pages containing all seeds
Extractor: Learn and construct wrappers
Ranker: Rank candidate items using Random Walk
CanonNikonOlympus
PentaxSonyKodakMinoltaPanasonicCasioLeicaFujiSamsung…
SEAL’s Architecture
Wrapper Learner Current WL only learns unary relation
e.g., x is a mayorA unary wrapper consists of a pair of left (L)
and right (R) context stringExtracts all strings between L, R
Extended WL learns binary relatione.g., x is the mayor of city yA binary wrapper has an additional middle (M)
context stringExtracts string pairs between L, M and M, R
Mock Unary Example
Given seeds: Ford, Nissan, Toyota Example document written in an
unknown mark-up language:
Metric – Mean Average Precision Dataset – 36 datasets (Wang & Cohen, ICDM 2007)
Evaluated on 5 types of wrappers Type 1 is least strict – SEAL’s default Type 5 is most strict – less strict than any HTML wrapper
Result – stricter wrappers perform worse
Unary SEAL Evaluation
Binary Wrapper Construction Keep track of all middle contexts:
In the unary code, replace Intersect with:
Binary SEAL Evaluation
Relational DatasetsSurveyed more than a dozenRandomly selected five:
Bootstrap results ten times using iSEAL (an iterative version of SEAL) Wang & Cohen, ICDM 2008