Top Banner
Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore
48

Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Mar 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Building Knowledge Bases from the Web

Rajeev RastogiYahoo! Labs Bangalore

Page 2: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

The Web is a vast repository of human knowledge

Basic premise

Page 3: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Diverse information spanning multiple verticals

• Wikipedia, Product, Business, People, …

Page 4: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Grand challenge

Mine the Web to build knowledge bases (KBs) of people, places, things, events,…

Name Address Phone

Chinese Mirch 120 Lexington Ave (between 28th St & 29th St) New York, NY 10016

(212) 532-3663

Camera Aspect Ratio

Mega-pixels

Canon Powershot 600 4:3 0.5

Olympus D-300L 4:3 0.8

Product Name List Price

Sale Price

Apple iPod nano 8 GB Black (5th Generation)

$145.00 $139.99

Name Affiliation # connections

Rajeev Rastogi Yahoo! Labs Bangalore

142

Page 5: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

What did search look like in the past?

Page 6: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Search results of the future: Structured abstracts

yelp.com

babycenter

epicurious

answers.com

LinkedIn

webmd

New York Times

Gawker

Page 7: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Rank by price

Comparison shopping

Page 8: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Product near me

Page 9: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Topic entity pages

Celebrity Music Videos

Related Topics

Relevant Multi-media content including music, videos, information from Wiki pedia etc.

A topic based page automatically generated in real time

Up to the minute: Latest info using News feeds, blogs, Twitter, Flickr to stay up to date on Madonna

Page 10: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Noise

• Billions of pages with diverse structure, conflicting information, noise

Building KBs from the Web is a hard problem

yelp.com superpages.com

Page 11: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Page content/structure changes constantly

Old

New• ~2% of sites change each day

Page 12: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

KB creation pipeline

Acquire content from the Web

Extract structured data for entities from Web pages

Identify and integrate data for each entity

Roma Bistro Paris

Roma Bistro Paris

Information extractionContent acquisition Disambiguation &Integration

Page 13: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Reviews

IE example

Name

AddressCuisine

PhonePrice

Name Address Phone

Chinese Mirch

120 Lexington Ave New York, NY 10016

(212) 532-3663

Page 14: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Template-based Web pages

• From head/torso sites

• Pages have similarstructure

• ~30% of crawled Web pages

• Information rich: 31% of search results

Page 15: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Hand-crafted pages

• Mainly from tail sites

• Pages have diversestructures

Page 16: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Browse pages

Similar-structuredrecords

Page 17: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Unstructured text

Page 18: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Web extraction landscape

Site structure Page structure Structure

Content

Content Redundancy

Content Features

Context Pattern-basedPattern-based

WrapperWrapper Record Identification

Record Identification

Content MatchingContent

Matching

Machine Learning ModelsMachine Learning Models

Unstructuredtext

Template-based pages

Hand-crafted, browse pages

Unstructured

Snowball [AG 00]

HCRF [ZNWZM 06]MLN [YCWZZM 09]

RoadRunner [CMM 01] DEPTA [ZL 05]

[KWD 97][MMK 99]

[GRST 10]

Page 19: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Web extraction landscape

Site structure Page structure Structure

Content

Content Redundancy

Content Features

Context Pattern-basedPattern-based

WrapperWrapper Record Identification

Record Identification

Content MatchingContent

Matching

ML ModelsML Models

Unstructuredtext

Template-based pages

Hand-crafted, browse pages

Unstructured

Page 20: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Wrapper induction

Learn AnnotatePages

Sample pages

Websitepages

LearnRules

Records

XPathRules

Annotations

Extract Websitepages

Cluster

• Technique for extraction from template-based pages

MonitorRules

ApplyRules

Site change

Page 21: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Clustering pages

• Group structurally similar pages using shingle signatures

Page 22: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Page shingle signature

html body @id textarea @id div /div /textarea … br/ /body /html

Windows

Hash

Min

Tags

Page signature: Vector of shingles

Shingle: 5

55 5 20 30

Page 23: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Rule learning

/html/body/div/div/div/div/div/div/span[@class=“tel”] //span[@class=“tel”]XPath Generalization

Page 24: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Learning robust XPaths

//*//*

//h1//h1//span//span

//span[@class=tel]//span[@class=tel]

//*[@class=tel]//*[@class=tel]

SPEC

IALI

ZESP

ECIA

LIZE

Most general XPath that matches all the annotated values and none of the un-annotated values

Most general XPath

Use Apriori to generate candidate XPaths

Page 25: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Detecting site changes

During Learn

For each cluster, store the page signature and extracted record for a

small number of pages

Monitoring

Crawl the pages daily and compare page signatures and extracted records

Day 0

Signature & RecordMatch

Day n

Signature/ Record Mismatch

Day m

Page 26: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Wrapper system deployed in Yahoo!

• 250M extractions from 200 sites (product, business)• Avg num of clusters per site: 24• Avg num of pages annotated per cluster: 1.6

86

88

90

92

94

96

98

100

102

Average Precision / Recall (%)

Precision

Recall

Page 27: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Limitations of wrappers

• Won’t work across Web sites due to different page layouts

• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites

can be time-consuming & expensive

Page 28: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Holy grail of IE research

• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site

• OK to annotate pages from a few sites initially to create training data

Page 29: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Web extraction landscape

Site structure Page structure Structure

Content

Content Redundancy

Content Features

Context Pattern-basedPattern-based

WrapperWrapper Record Identification

Record Identification

Content MatchingContent

Matching

ML ModelsML Models

Unstructuredtext

Template-based pages

Hand-crafted, browse pages

Unstructured

Page 30: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Key observation

yelp.com superpages.com

• Web sites contain redundant content (that is, pages for same entity)

Page 31: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Content matching approach

• Step 1: Populate seed database from few initial sites

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

Seed DB

Wrappers

Page 32: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Content matching approach

• Step 2: Match values in page with seed record values

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

Seed DB

New site Web page

Page 33: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Content matching approach

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

21 Club 21 W 52nd St New York, NY 10019

Seed DBNew site Web pages

• Step 3: Use matched values to extract records, expand seed database

Wrappers

New record

Page 34: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Key challenge 1

• Diverse attribute value representations (impacts recall)

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

Spellingerror

Variant

Page 35: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Key challenge 2

• Noisy attribute value matches (impacts precision)

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

Noisymatch

Page 36: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Baseline similarity measure

• Use q-grams to handle spelling errors

Weak Similarity = Cosine-similarity between IDF-weighted q-grams

String 3-grams

chinese mirch

{ chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}

chinese mirrch

{ chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}

• Weight of a q-gram (attribute-specific) = Sum of the IDFs of the words it appears in

Page 37: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Strong similarity

Address (Seed DB) Address (Web site) WS

120 Lexington AvenueNew York, NY 10016

120 Lexington Ave (between 28th and 29th St) New York, NY 10016

0.53

312 W 34th StreetNew York, NY 10001

312 W 34th St (between 8th and 9th Ave) New York, NY 10001

0.49

Strong similarity is defined between two sets of strings1.Calculate the matching pattern between weakly similar pairs in the two sets2.Pick matching patterns with sufficient “support”3.Use only portions selected by the matching pattern in the final similarity calculation

Templatized content

Page 38: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Computing matching pattern

120 Lexington Avenue New York NY 10016

120 Lexington Ave (Between 28th And 29th St) New York NY 10016

1 1 1 1 1 1

1. Perform max-weight bipartite matching to find matching words• Edge weight = Jaccard similarity over 3-grams

2. Form segments by grouping contiguous matching words3. Assign each segment si a label

• 0 if non-matching• j if matching segment s’j

Matching pattern:103 103

s1 s2 s3

s’1 s’2 s’3

1 0 3

1 0 3

Page 39: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Strong similarity score computation

Strong similarity: similarity between matching segments of values

Support of matching pattern: # distinct matching segmentsSupport(103 103) = 2

Strong similarity only computed for patterns with support

Page 40: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Need for support of a matching pattern

Support(010 010): = 1Hence Strong Similarity = Weak Similarity

Page 41: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Pruning noisy matches

Name Address

Chinese Mirrch 120 Lexington Ave, New York, NY 10016

Tiffin Wallah 127 E 28th St New York, NY 10079

• Match combinations of values in page• Prune combinations that don’t match attribute values in any seed record

Page 42: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

X2X2

X1X1

X3

X3

Apriori-style enumeration

Round 1:<Name, X1> (sup=2)<Addr, X2> (sup=2)<Name, X3> (sup=2)

Round 2:<Name, X1> <Addr, X2> (sup=2)<Name, X3> <Addr, X2> (sup=0)

• Prune attribute position combinations with low support– support = # pages in which values at positions match attribute values in a seed record

Page 43: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Experimental results

Datasets

Attributes Restaurant Bibliography

Name (core) Title (core)

Address (core) Author (core)

Phone Source

Payment

Cuisine

Page 44: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Strong vs Weak similarity

• Extraction precision of WS and SS are comparable, precision increases with threshold• Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds

Page 45: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Strong similarity scores

SS boosts the similarity scores of TPs over a range of WS scores without boosting that of FPs

String 1 String 2 WS SS

980 n michigan ave 14th floorchicago il

980 n michigan avechicago il 60611

0.57 1

1100 e north ave westchicago il 60185

300 w north ave westchicago il 60185

0.74 0.74

Page 46: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Extraction Precision

Page 47: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Coverage

Seed data size (Restaurant)

Page 48: Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Summary

• Web is a vast repository of human knowledge• Building (structured) knowledge base can improve

search, help users find relevant information• Key challenge: Unsupervised information extraction

from Web pages• Content redundancy on Web can be used for

unsupervised extraction with high precision• Future work

– Handling numeric attributes, browse pages– Detecting and integrating records for the same entity