Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore
Mar 26, 2015
Building Knowledge Bases from the Web
Rajeev RastogiYahoo! Labs Bangalore
The Web is a vast repository of human knowledge
Basic premise
Diverse information spanning multiple verticals
• Wikipedia, Product, Business, People, …
Grand challenge
Mine the Web to build knowledge bases (KBs) of people, places, things, events,…
Name Address Phone
Chinese Mirch 120 Lexington Ave (between 28th St & 29th St) New York, NY 10016
(212) 532-3663
Camera Aspect Ratio
Mega-pixels
Canon Powershot 600 4:3 0.5
Olympus D-300L 4:3 0.8
Product Name List Price
Sale Price
Apple iPod nano 8 GB Black (5th Generation)
$145.00 $139.99
Name Affiliation # connections
Rajeev Rastogi Yahoo! Labs Bangalore
142
What did search look like in the past?
Search results of the future: Structured abstracts
yelp.com
babycenter
epicurious
answers.com
webmd
New York Times
Gawker
Rank by price
Comparison shopping
Product near me
Topic entity pages
Celebrity Music Videos
Related Topics
Relevant Multi-media content including music, videos, information from Wiki pedia etc.
A topic based page automatically generated in real time
Up to the minute: Latest info using News feeds, blogs, Twitter, Flickr to stay up to date on Madonna
Noise
• Billions of pages with diverse structure, conflicting information, noise
Building KBs from the Web is a hard problem
yelp.com superpages.com
Page content/structure changes constantly
Old
New• ~2% of sites change each day
KB creation pipeline
Acquire content from the Web
Extract structured data for entities from Web pages
Identify and integrate data for each entity
Roma Bistro Paris
Roma Bistro Paris
Information extractionContent acquisition Disambiguation &Integration
Reviews
IE example
Name
AddressCuisine
PhonePrice
Name Address Phone
Chinese Mirch
120 Lexington Ave New York, NY 10016
(212) 532-3663
Template-based Web pages
• From head/torso sites
• Pages have similarstructure
• ~30% of crawled Web pages
• Information rich: 31% of search results
Hand-crafted pages
• Mainly from tail sites
• Pages have diversestructures
Browse pages
Similar-structuredrecords
Unstructured text
Web extraction landscape
Site structure Page structure Structure
Content
Content Redundancy
Content Features
Context Pattern-basedPattern-based
WrapperWrapper Record Identification
Record Identification
Content MatchingContent
Matching
Machine Learning ModelsMachine Learning Models
Unstructuredtext
Template-based pages
Hand-crafted, browse pages
Unstructured
Snowball [AG 00]
HCRF [ZNWZM 06]MLN [YCWZZM 09]
RoadRunner [CMM 01] DEPTA [ZL 05]
[KWD 97][MMK 99]
[GRST 10]
Web extraction landscape
Site structure Page structure Structure
Content
Content Redundancy
Content Features
Context Pattern-basedPattern-based
WrapperWrapper Record Identification
Record Identification
Content MatchingContent
Matching
ML ModelsML Models
Unstructuredtext
Template-based pages
Hand-crafted, browse pages
Unstructured
Wrapper induction
Learn AnnotatePages
Sample pages
Websitepages
LearnRules
Records
XPathRules
Annotations
Extract Websitepages
Cluster
• Technique for extraction from template-based pages
MonitorRules
ApplyRules
Site change
Clustering pages
• Group structurally similar pages using shingle signatures
Page shingle signature
html body @id textarea @id div /div /textarea … br/ /body /html
Windows
Hash
Min
Tags
Page signature: Vector of shingles
Shingle: 5
55 5 20 30
Rule learning
/html/body/div/div/div/div/div/div/span[@class=“tel”] //span[@class=“tel”]XPath Generalization
Learning robust XPaths
//*//*
//h1//h1//span//span
//span[@class=tel]//span[@class=tel]
//*[@class=tel]//*[@class=tel]
SPEC
IALI
ZESP
ECIA
LIZE
Most general XPath that matches all the annotated values and none of the un-annotated values
Most general XPath
Use Apriori to generate candidate XPaths
Detecting site changes
During Learn
For each cluster, store the page signature and extracted record for a
small number of pages
Monitoring
Crawl the pages daily and compare page signatures and extracted records
Day 0
Signature & RecordMatch
Day n
Signature/ Record Mismatch
Day m
Wrapper system deployed in Yahoo!
• 250M extractions from 200 sites (product, business)• Avg num of clusters per site: 24• Avg num of pages annotated per cluster: 1.6
86
88
90
92
94
96
98
100
102
Average Precision / Recall (%)
Precision
Recall
Limitations of wrappers
• Won’t work across Web sites due to different page layouts
• Scaling to thousands of sites can be a challenge– Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites
can be time-consuming & expensive
Holy grail of IE research
• Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site
• OK to annotate pages from a few sites initially to create training data
Web extraction landscape
Site structure Page structure Structure
Content
Content Redundancy
Content Features
Context Pattern-basedPattern-based
WrapperWrapper Record Identification
Record Identification
Content MatchingContent
Matching
ML ModelsML Models
Unstructuredtext
Template-based pages
Hand-crafted, browse pages
Unstructured
Key observation
yelp.com superpages.com
• Web sites contain redundant content (that is, pages for same entity)
Content matching approach
• Step 1: Populate seed database from few initial sites
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Seed DB
Wrappers
Content matching approach
• Step 2: Match values in page with seed record values
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Seed DB
New site Web page
Content matching approach
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
21 Club 21 W 52nd St New York, NY 10019
Seed DBNew site Web pages
• Step 3: Use matched values to extract records, expand seed database
Wrappers
New record
Key challenge 1
• Diverse attribute value representations (impacts recall)
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Spellingerror
Variant
Key challenge 2
• Noisy attribute value matches (impacts precision)
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
Noisymatch
Baseline similarity measure
• Use q-grams to handle spelling errors
Weak Similarity = Cosine-similarity between IDF-weighted q-grams
String 3-grams
chinese mirch
{ chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}
chinese mirrch
{ chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}
• Weight of a q-gram (attribute-specific) = Sum of the IDFs of the words it appears in
Strong similarity
Address (Seed DB) Address (Web site) WS
120 Lexington AvenueNew York, NY 10016
120 Lexington Ave (between 28th and 29th St) New York, NY 10016
0.53
312 W 34th StreetNew York, NY 10001
312 W 34th St (between 8th and 9th Ave) New York, NY 10001
0.49
Strong similarity is defined between two sets of strings1.Calculate the matching pattern between weakly similar pairs in the two sets2.Pick matching patterns with sufficient “support”3.Use only portions selected by the matching pattern in the final similarity calculation
Templatized content
Computing matching pattern
120 Lexington Avenue New York NY 10016
120 Lexington Ave (Between 28th And 29th St) New York NY 10016
1 1 1 1 1 1
1. Perform max-weight bipartite matching to find matching words• Edge weight = Jaccard similarity over 3-grams
2. Form segments by grouping contiguous matching words3. Assign each segment si a label
• 0 if non-matching• j if matching segment s’j
Matching pattern:103 103
s1 s2 s3
s’1 s’2 s’3
1 0 3
1 0 3
Strong similarity score computation
Strong similarity: similarity between matching segments of values
Support of matching pattern: # distinct matching segmentsSupport(103 103) = 2
Strong similarity only computed for patterns with support
Need for support of a matching pattern
Support(010 010): = 1Hence Strong Similarity = Weak Similarity
Pruning noisy matches
Name Address
Chinese Mirrch 120 Lexington Ave, New York, NY 10016
Tiffin Wallah 127 E 28th St New York, NY 10079
✓
✓
✗
• Match combinations of values in page• Prune combinations that don’t match attribute values in any seed record
X2X2
X1X1
X3
X3
Apriori-style enumeration
Round 1:<Name, X1> (sup=2)<Addr, X2> (sup=2)<Name, X3> (sup=2)
Round 2:<Name, X1> <Addr, X2> (sup=2)<Name, X3> <Addr, X2> (sup=0)
• Prune attribute position combinations with low support– support = # pages in which values at positions match attribute values in a seed record
Experimental results
Datasets
Attributes Restaurant Bibliography
Name (core) Title (core)
Address (core) Author (core)
Phone Source
Payment
Cuisine
Strong vs Weak similarity
• Extraction precision of WS and SS are comparable, precision increases with threshold• Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds
Strong similarity scores
SS boosts the similarity scores of TPs over a range of WS scores without boosting that of FPs
String 1 String 2 WS SS
980 n michigan ave 14th floorchicago il
980 n michigan avechicago il 60611
0.57 1
1100 e north ave westchicago il 60185
300 w north ave westchicago il 60185
0.74 0.74
Extraction Precision
Coverage
Seed data size (Restaurant)
Summary
• Web is a vast repository of human knowledge• Building (structured) knowledge base can improve
search, help users find relevant information• Key challenge: Unsupervised information extraction
from Web pages• Content redundancy on Web can be used for
unsupervised extraction with high precision• Future work
– Handling numeric attributes, browse pages– Detecting and integrating records for the same entity