Top Banner
Related Entity Finding on the Web Peter Mika Senior Research Scientist Yahoo! Research Joint work with B. Barla Cambazoglu and Roi Blan
46

Related Entity Finding on the Web

Jan 20, 2015

Download

Technology

Peter Mika

Keynote at the Web of Linked Entities (WoLE) 2013 workshop.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Related Entity Finding on the Web

Related Entity Finding on the Web

Peter Mika

Senior Research Scientist

Yahoo! Research

Joint work with B. Barla Cambazoglu and Roi Blanco

Page 2: Related Entity Finding on the Web

- 2 -

Search is really fast, without necessarily being intelligent

Page 3: Related Entity Finding on the Web

- 3 -

Why Semantic Search? Part I

• Improvements in IR are harder and harder to come by

– Machine learning using hundreds of features

• Text-based features for matching

• Graph-based features provide authority

– Heavy investment in computational power, e.g. real-time indexing and instant search

• Remaining challenges are not computational, but in modeling user cognition

– Need a deeper understanding of the query, the content and/or the world at large

– Could Watson explain why the answer is Toronto?

Page 4: Related Entity Finding on the Web

- 4 -

What it’s like to be a machine?

Roi Blanco

Page 5: Related Entity Finding on the Web

- 5 -

What it’s like to be a machine?

✜Θ♬♬ţğ

✜Θ♬♬ţğ √∞ ®ÇĤĪ✜★♬☐✓✓ţğ★✜

✚✜✪ ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ≠=⅚ ©§ ✓★ ♪ΒΓΕ℠

✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ⏎⌥°¶§ ΥΦΦΦ ✗✕☐

Page 6: Related Entity Finding on the Web

- 7 -

Ambiguity

Page 7: Related Entity Finding on the Web

- 8 -

Why Semantic Search? Part II

• The Semantic Web is here

– Data

• Large amounts of RDF data

• Heterogeneous schemas

• Diverse quality

– End users

• Not skilled in writing complex queries (e.g. SPARQL)

• Not familiar with the data

• Novel applications

– Complementing document search

• Rich Snippets, related entities, direct answers

– Other novel search tasks

Page 8: Related Entity Finding on the Web

- 9 -

Semantic Web data

• Linked Data

– Data published as RDF documents linked to other RDF documents and/or using SPARQL end-points

– Community effort to re-publish large public datasets (e.g. Dbpedia, open government data)

• RDFa and microdata

– Data embedded inside HTML pages

– Schema.org collaboration among Bing, Google, Yahoo and Yandex

– Facebook Open Graph Protocol (OGP)

Page 9: Related Entity Finding on the Web

- 10 -

Other novel applications

• Aggregation of search results

– e.g. price comparison across websites

• Analysis and prediction

– e.g. world temperature by 2020

• Semantic profiling

– Ontology-based modeling of user interests

• Semantic log analysis

– Linking query and navigation logs to ontologies

• Task completion

– e.g. booking a vacation using a combination of services

• Conversational search

– e.g. PARLANCE EU FP7 project

Web usage mining with Semantic Analysis Fri 3pm

Web usage mining with Semantic Analysis Fri 3pm

Page 10: Related Entity Finding on the Web

- 11 -

Interactive search and task completion

Page 11: Related Entity Finding on the Web

Semantic Search

Page 12: Related Entity Finding on the Web

- 13 -

Semantic Search: a definition

• Semantic search is a retrieval paradigm that

– Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content

– Exploits this understanding at some part of the search process

• Emerging field of research

– Exploiting Semantic Annotations in Information Retrieval (2008-2012)

– Semantic Search (SemSearch) workshop series (2008-2011)

– Entity-oriented search workshop (2010-2011)

– Joint Intl. Workshop on Semantic and Entity-oriented Search (2012)

– SIGIR 2012 tracks on Structured Data and Entities

• Related fields:

– XML retrieval, Keyword search in databases, NL retrieval

Page 13: Related Entity Finding on the Web

- 14 -

Search is required in the presence of ambiguity

Query

Data

KeywordsKeywords NL Questions

NL Questions

Form- / facet-based InputsForm- / facet-based Inputs

Structured Queries (SPARQL)

Structured Queries (SPARQL)

OWL ontologies with rich, formal semantics

OWL ontologies with rich, formal semantics

Structured RDF dataStructured RDF data

Semi-Structured RDF data

Semi-Structured RDF data

RDF data embedded in text (RDFa)

RDF data embedded in text (RDFa)

Ambiguities: interpretation

Ambiguities: interpretation, extraction errors, data quality, confidence/trust

Page 14: Related Entity Finding on the Web

- 15 -

list search

related entity finding

entity searchSemSearch 2010/11

list completion

SemSearch 2011

TREC ELC taskTREC REF-LOD task

semantic search

Common tasks in Semantic Search

Page 15: Related Entity Finding on the Web

Related entity ranking in web search

Page 16: Related Entity Finding on the Web

- 17 -

Motivation

• Some users are short on time

– Need for direct answers

– Query expansion, question-answering, information boxes, rich results…

• Other users have time at their hand

– Long term interests such as sports, celebrities, movies and music

– Long running tasks such as travel planning

Page 17: Related Entity Finding on the Web

- 18 -

Example user sessions

Page 18: Related Entity Finding on the Web

- 19 -

Spark: related entity recommendations in web search

• A search assistance tool for exploration

• Recommend related entities given the user’s current query

– Cf. Entity Search at SemSearch, TREC Entity Track

• Ranking explicit relations in a Knowledge Base

– Cf. TREC Related Entity Finding in LOD (REF-LOD) task

• A previous version of the system live since 2010• van Zwol et al.: Faceted exploration of image search results. WWW

2010: 961-970

Page 19: Related Entity Finding on the Web

- 20 -

Spark example I.

Page 20: Related Entity Finding on the Web

- 21 -

Spark example II.

Page 21: Related Entity Finding on the Web

- 22 -

How does it work?

Page 22: Related Entity Finding on the Web

- 23 -

High-Level Architecture View

Page 23: Related Entity Finding on the Web

- 24 -

Spark Architecture

Page 24: Related Entity Finding on the Web

- 25 -

Data Preprocessing

Page 25: Related Entity Finding on the Web

- 26 -

Entity graph

• 3.4 million entities, 160 million relations• Locations: Internet Locality, Wikipedia, Yahoo! Travel

• Athletes, teams: Yahoo! Sports

• People, characters, movies, TV shows, albums: Dbpedia

• Example entities• Dbpedia Brad_Pitt Brad Pitt Movie_Actor

• Dbpedia Brad_Pitt Brad Pitt Movie_Producer

• Dbpedia Brad_Pitt Brad Pitt Person

• Dbpedia Brad_Pitt Brad Pitt TV_Actor

• Dbpedia Brad_Pitt_(boxer) Brad Pitt Person

• Example relations• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person

• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor

• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor

Page 26: Related Entity Finding on the Web

- 27 -

Entity graph challenges

• Coverage of the query volume

– New entities and entity types

– Additional inference

– International data

– Aliases, e.g. jlo, big apple, thomas cruise mapother iv

• Freshness

– People query for a movie long before it’s released

• Irrelevant entity and relation types

– E.g. voice actors who co-acted in a movie, cities in a continent

• Data quality

– United States Senate career of Barack Obama is not a person

– Andy Lau has never acted in Iron Man 3

Page 27: Related Entity Finding on the Web

- 28 -

Feature extraction

Page 28: Related Entity Finding on the Web

- 29 -

Feature extraction from text

• Text sources

– Query terms

– Query sessions

– Flickr tags

– Tweets

• Common representation

Input tweet:

Brad Pitt married to Angelina Jolie in Las Vegas

Output event:

Brad Pitt + Angelina Jolie

Brad Pitt + Las Vegas

Angelina Jolie + Las Vegas

Page 29: Related Entity Finding on the Web

- 30 -

Features

• Unary

– Popularity features from text: probability, entropy, wiki id popularity …

– Graph features: PageRank on the entity graph, wikipedia, web graph

– Type features: entity type

• Binary

– Co-occurrence features from text: conditional probability, joint probability …

– Graph features: common neighbors …

– Type features: relation type

Page 30: Related Entity Finding on the Web

- 31 -

Feature extraction challenges

• Efficiency of text tagging

– Hadoop Map/Reduce

• More features are not always better

– Can lead to over-fitting without sufficient training data

Page 31: Related Entity Finding on the Web

- 32 -

Model Learning

Page 32: Related Entity Finding on the Web

- 33 -

Model Learning

• Training data created by editors (five grades)400 Brandi adriana lima Brad Pitt person Embarassing

1397 David H. andy garcia Brad Pitt person Mostly Related

3037 Jennifer benicio del toro Brad Pitt person Somewhat Related

4615 Sarah burn after reading Brad Pitt person Excellent

9853 Jennifer fight club movie Brad Pitt person Perfect

• Join between the editorial data and the feature file

• Trained a regression model using GBDT

–Gradient Boosted Decision Trees

• 10-fold cross validation optimizing NDCG and tuning

•number of trees

•number of nodes per tree

Page 33: Related Entity Finding on the Web

- 34 -

Feature importance

RANK FEATURE IMPORTANCE

1 Relation type 100

2 PageRank (Related entity) 99.6075

3 Entropy – Flickr 94.7832

4 Probability – Flickr 82.6172

5 Probability – Query terms 78.9377

6 Shared connections 68.296

7 Cond. Probability – Flickr 68.0496

8 PageRank (Entity) 57.6078

9 KL divergence – Flickr 55.4604

10 KL divergence – Query terms 55.0662

Page 34: Related Entity Finding on the Web

- 35 -

Impact of training data

Number of training instances (judged relations)

Page 35: Related Entity Finding on the Web

- 36 -

Performance by query-entity type

•High overall performance but some types are more difficult

•Locations

– Editors downgrade popular entities such as businesses

NDCG by type of the query entity

Page 36: Related Entity Finding on the Web

- 37 -

Model Learning challenges

• Editorial preferences not necessarily coincide with usage

– Users click a lot more on people than expected

– Image bias?

• Alternative: optimize for usage data

– Clicks turned into labels or preferences

– Size of the data is not a concern

– Gains are computed from normalized CTR/COEC

– See van Zwol et al. Ranking Entity Facets Based on User Click Feedback. ICSC 2010: 192-199.

Page 37: Related Entity Finding on the Web

- 38 -

Ranking and Disambiguation

Page 38: Related Entity Finding on the Web

- 39 -

Ranking and Disambiguation

• We apply the ranking function offline to the data

• Disambiguation – How many times a given wiki id was retrieved for queries containing the entity

name?

Brad Pitt Brad_Pitt 21158

Brad Pitt Brad_Pitt_(boxer) 247

XXX XXX_(movie) 1775

XXX XXX_(Asia_album) 89

XXX XXX_(ZZ_Top_album) 87

XXX XXX_(Danny_Brown_album) 67

– PageRank for disambiguating locations (wiki ids are not available)

• Expansion to query patterns

– Entity name + context, e.g. brad pitt actor

Page 39: Related Entity Finding on the Web

- 40 -

Ranking and Disambiguation challenges

• Disambiguation cases that are too close to call

– Fargo Fargo_(film) 3969

– Fargo Fargo,_North_Dakota 4578

• Disambiguation across Wikipedia and other sources

Page 40: Related Entity Finding on the Web

- 41 -

Evaluation #2: Side-by-side testing

• Comparing two systems

– A/B comparison, e.g. current system under development and production system

– Scale: A is better, B is better

• Separate tests for relevance and image quality

– Image quality can significantly influence user perceptions

– Images can violate safe search rules

• Classification of errors

– Results: missing important results/contains irrelevant results, too few results, entities are not fresh, more/less diverse, should not have triggered

– Images: bad photo choice, blurry, group shots, nude/racy etc.

• Notes

– Borderline, set one entities relate to the movie Psy but the query is most likely about Gangnam style

– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s musicians.

– There is absolutely no relation between Finland and California.

Page 41: Related Entity Finding on the Web

- 42 -

Evaluation #3: Bucket testing

• Also called online evaluation

– Comparing against baseline version of the system

– Baseline does not change during the test

• Small % of search traffic redirected to test system, another small % to the baseline system

• Data collection over at least a week, looking for stat. significant differences that are also stable over time

• Metrics in web search

– Coverage and Click-through Rate (CTR)

– Searches per browser-cookie (SPBC)

– Other key metrics should not impacted negatively, e.g. Abandonment and retry rate, Daily Active Users (DAU), Revenue Per Search (RPS), etc.

Page 42: Related Entity Finding on the Web

- 43 -

Coverage before and after the new system

Before release:

Flat, lower

After release:

Flat, higher

Page 43: Related Entity Finding on the Web

- 44 -

Click-through rate (CTR) before and after the new system

Before release:

Gradually degrading performance due to lack of fresh data

After release:

Learning effect:users are starting to use the tool again

Page 44: Related Entity Finding on the Web

- 45 -

Summary

• Spark

– System for related entity recommendations

• Knowledge base

• Extraction of signals from query logs and other user-generated content

• Machine learned ranking

• Evaluation

• Other applications

– Recommendations on topic-entity pages

Page 45: Related Entity Finding on the Web

- 46 -

Future work

• New query types

– Queries with multiple entities

• adele skyfall

– Question-answering on keyword queries

• brad pitt movies

• brad pitt movies 2010

• Extending coverage

– Spark now live in CA, UK, AU, NZ, TW, HK, ES

• Even fresher data

– Stream processing of query log data

• Data quality improvements

• Online ranking with post-retrieval features

Page 46: Related Entity Finding on the Web

- 47 -

The End

• Many thanks to

– Barla Cambazoglu and Roi Blanco (Barcelona)

– Nicolas Torzec (US)

– Libby Lin (Product Manager, US)

– Search engineering (Taiwan)

• Contact

[email protected]

– @pmika