Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Building Knowledge Bases from the Web

Rajeev RastogiYahoo! Labs Bangalore

The Web is a vast repository of human knowledge

Basic premise

Diverse information spanning multiple verticals

• Wikipedia, Product, Business, People, …

Grand challenge

Mine the Web to build knowledge bases (KBs) of people, places, things, events,…

Name Address Phone

Chinese Mirch 120 Lexington Ave (between 28th St & 29th St) New York, NY 10016

(212) 532-3663

Camera Aspect Ratio

Mega-pixels

Canon Powershot 600 4:3 0.5

Olympus D-300L 4:3 0.8

Product Name List Price

Sale Price

Apple iPod nano 8 GB Black (5th Generation)

$145.00 $139.99

Name Affiliation # connections

Rajeev Rastogi Yahoo! Labs Bangalore

142

What did search look like in the past?

Search results of the future: Structured abstracts

yelp.com

babycenter

epicurious

answers.com

LinkedIn

webmd

New York Times

Gawker

Rank by price

Comparison shopping

Product near me

Topic entity pages

Celebrity Music Videos

Related Topics

Relevant Multi-media content including music, videos, information from Wiki pedia etc.

A topic based page automatically generated in real time

Up to the minute: Latest info using News feeds, blogs, Twitter, Flickr to stay up to date on Madonna

Noise

• Billions of pages with diverse structure, conflicting information, noise

Building KBs from the Web is a hard problem

yelp.com superpages.com

Page content/structure changes constantly

Old

New• ~2% of sites change each day

KB creation pipeline

Acquire content from the Web

Extract structured data for entities from Web pages

Identify and integrate data for each entity

Roma Bistro Paris

Roma Bistro Paris

Information extractionContent acquisition Disambiguation &Integration

Reviews

IE example

Name

AddressCuisine

PhonePrice

Name Address Phone

Chinese Mirch

120 Lexington Ave New York, NY 10016

(212) 532-3663

Template-based Web pages

• From head/torso sites

• Pages have similarstructure

• ~30% of crawled Web pages

• Information rich: 31% of search results

Hand-crafted pages

• Mainly from tail sites

• Pages have diversestructures

Browse pages

Similar-structuredrecords

Unstructured text

Web extraction landscape

Site structure Page structure Structure

Content

Content Redundancy

Content Features

Context Pattern-basedPattern-based

WrapperWrapper Record Identification

Record Identification

Content MatchingContent

Matching

Machine Learning ModelsMachine Learning Models

Unstructuredtext

Template-based pages

Hand-crafted, browse pages

Unstructured

Snowball [AG 00]

HCRF [ZNWZM 06]MLN [YCWZZM 09]

RoadRunner [CMM 01] DEPTA [ZL 05]

[KWD 97][MMK 99]

[GRST 10]



Content

Content Redundancy

Content Features





Matching

ML ModelsML Models

Unstructuredtext



Unstructured

Wrapper induction

Learn AnnotatePages

Sample pages

Websitepages

LearnRules

Records

XPathRules

Annotations

Extract Websitepages

Cluster

• Technique for extraction from template-based pages

MonitorRules

ApplyRules

Site change

Clustering pages

• Group structurally similar pages using shingle signatures

Page shingle signature

html body @id textarea @id div /div /textarea … br/ /body /html

Windows

Hash

Min

Tags

Page signature: Vector of shingles

Shingle: 5

55 5 20 30

Rule learning

/html/body/div/div/div/div/div/div/span[@class=“tel”] //span[@class=“tel”]XPath Generalization

Learning robust XPaths

//*//*

//h1//h1//span//span

//span[@class=tel]//span[@class=tel]

//*[@class=tel]//*[@class=tel]

SPEC

IALI

ZESP

ECIA

LIZE

Most general XPath that matches all the annotated values and none of the un-annotated values

Most general XPath

Use Apriori to generate candidate XPaths

Detecting site changes

During Learn

For each cluster, store the page signature and extracted record for a

small number of pages

Monitoring

Crawl the pages daily and compare page signatures and extracted records

Day 0

Signature & RecordMatch

Day n

Signature/ Record Mismatch

Day m

Wrapper system deployed in Yahoo!

• 250M extractions from 200 sites (product, business)• Avg num of clusters per site: 24• Avg num of pages annotated per cluster: 1.6

86

88

90

92

94

96

98

100

102

Average Precision / Recall (%)

Precision

Recall

Limitations of wrappers

• Won’t work across Web sites due to different page layouts

• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites

can be time-consuming & expensive

Holy grail of IE research

• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site

• OK to annotate pages from a few sites initially to create training data



Content

Content Redundancy

Content Features





Matching

ML ModelsML Models

Unstructuredtext



Unstructured

Key observation

yelp.com superpages.com

• Web sites contain redundant content (that is, pages for same entity)

Content matching approach

• Step 1: Populate seed database from few initial sites

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

Seed DB

Wrappers


• Step 2: Match values in page with seed record values

Name Address



Seed DB

New site Web page


Name Address



21 Club 21 W 52nd St New York, NY 10019

Seed DBNew site Web pages

• Step 3: Use matched values to extract records, expand seed database

Wrappers

New record

Key challenge 1

• Diverse attribute value representations (impacts recall)

Name Address



Spellingerror

Variant

Key challenge 2

• Noisy attribute value matches (impacts precision)

Name Address



Noisymatch

Baseline similarity measure

• Use q-grams to handle spelling errors

Weak Similarity = Cosine-similarity between IDF-weighted q-grams

String 3-grams

chinese mirch

{ chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}

chinese mirrch

{ chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}

• Weight of a q-gram (attribute-specific) = Sum of the IDFs of the words it appears in

Strong similarity

Address (Seed DB) Address (Web site) WS

120 Lexington AvenueNew York, NY 10016

120 Lexington Ave (between 28th and 29th St) New York, NY 10016

0.53

312 W 34th StreetNew York, NY 10001

312 W 34th St (between 8th and 9th Ave) New York, NY 10001

0.49

Strong similarity is defined between two sets of strings1.Calculate the matching pattern between weakly similar pairs in the two sets2.Pick matching patterns with sufficient “support”3.Use only portions selected by the matching pattern in the final similarity calculation

Templatized content

Computing matching pattern

120 Lexington Avenue New York NY 10016

120 Lexington Ave (Between 28th And 29th St) New York NY 10016

1 1 1 1 1 1

1. Perform max-weight bipartite matching to find matching words• Edge weight = Jaccard similarity over 3-grams

2. Form segments by grouping contiguous matching words3. Assign each segment si a label

• 0 if non-matching• j if matching segment s’j

Matching pattern:103 103

s1 s2 s3

s’1 s’2 s’3

1 0 3

1 0 3

Strong similarity score computation

Strong similarity: similarity between matching segments of values

Support of matching pattern: # distinct matching segmentsSupport(103 103) = 2

Strong similarity only computed for patterns with support

Need for support of a matching pattern

Support(010 010): = 1Hence Strong Similarity = Weak Similarity

Pruning noisy matches

Name Address



✓

✓

✗

• Match combinations of values in page• Prune combinations that don’t match attribute values in any seed record

X2X2

X1X1

X3

X3

Apriori-style enumeration

Round 1:<Name, X1> (sup=2)<Addr, X2> (sup=2)<Name, X3> (sup=2)

Round 2:<Name, X1> <Addr, X2> (sup=2)<Name, X3> <Addr, X2> (sup=0)

• Prune attribute position combinations with low support– support = # pages in which values at positions match attribute values in a seed record

Experimental results

Datasets

Attributes Restaurant Bibliography

Name (core) Title (core)

Address (core) Author (core)

Phone Source

Payment

Cuisine

Strong vs Weak similarity

• Extraction precision of WS and SS are comparable, precision increases with threshold• Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds

Strong similarity scores

SS boosts the similarity scores of TPs over a range of WS scores without boosting that of FPs

String 1 String 2 WS SS

980 n michigan ave 14th floorchicago il

980 n michigan avechicago il 60611

0.57 1

1100 e north ave westchicago il 60185

300 w north ave westchicago il 60185

0.74 0.74

Extraction Precision

Coverage

Seed data size (Restaurant)

Summary

• Web is a vast repository of human knowledge• Building (structured) knowledge base can improve

search, help users find relevant information• Key challenge: Unsupervised information extraction

from Web pages• Content redundancy on Web can be used for

unsupervised extraction with high precision• Future work

– Handling numeric attributes, browse pages– Detecting and integrating records for the same entity

Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Documents

pages unstructured slide

day slide

handcrafted pages

similar pages

madonna slide

templatebased web pages

unstructured text slide

headtorso sites pages