Related Entity Finding on the Web

Related Entity Finding on the Web

Peter Mika

Senior Research Scientist

Yahoo! Research

Joint work with B. Barla Cambazoglu and Roi Blanco

- 2 -

Search is really fast, without necessarily being intelligent

- 3 -

Why Semantic Search? Part I

• Improvements in IR are harder and harder to come by

– Machine learning using hundreds of features

• Text-based features for matching

• Graph-based features provide authority

– Heavy investment in computational power, e.g. real-time indexing and instant search

• Remaining challenges are not computational, but in modeling user cognition

– Need a deeper understanding of the query, the content and/or the world at large

– Could Watson explain why the answer is Toronto?

- 4 -

What it’s like to be a machine?

Roi Blanco

- 5 -

What it’s like to be a machine?

✜Θ♬♬ţğ

✜Θ♬♬ţğ √∞ ®ÇĤĪ✜★♬☐✓✓ţğ★✜

✚✜✪ ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ≠=⅚ ©§ ✓★ ♪ΒΓΕ℠

✖Γ♫⅜±⏎↵⏏☐ģğğğμλκσςτ⏎⌥°¶§ ΥΦΦΦ ✗✕☐

- 7 -

Ambiguity

- 8 -

Why Semantic Search? Part II

• The Semantic Web is here

– Data

• Large amounts of RDF data

• Heterogeneous schemas

• Diverse quality

– End users

• Not skilled in writing complex queries (e.g. SPARQL)

• Not familiar with the data

• Novel applications

– Complementing document search

• Rich Snippets, related entities, direct answers

– Other novel search tasks

- 9 -

Semantic Web data

• Linked Data

– Data published as RDF documents linked to other RDF documents and/or using SPARQL end-points

– Community effort to re-publish large public datasets (e.g. Dbpedia, open government data)

• RDFa and microdata

– Data embedded inside HTML pages

– Schema.org collaboration among Bing, Google, Yahoo and Yandex

– Facebook Open Graph Protocol (OGP)

- 10 -

Other novel applications

• Aggregation of search results

– e.g. price comparison across websites

• Analysis and prediction

– e.g. world temperature by 2020

• Semantic profiling

– Ontology-based modeling of user interests

• Semantic log analysis

– Linking query and navigation logs to ontologies

• Task completion

– e.g. booking a vacation using a combination of services

• Conversational search

– e.g. PARLANCE EU FP7 project

Web usage mining with Semantic Analysis Fri 3pm

Web usage mining with Semantic Analysis Fri 3pm

- 11 -

Interactive search and task completion

Semantic Search

- 13 -

Semantic Search: a definition

• Semantic search is a retrieval paradigm that

– Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content

– Exploits this understanding at some part of the search process

• Emerging field of research

– Exploiting Semantic Annotations in Information Retrieval (2008-2012)

– Semantic Search (SemSearch) workshop series (2008-2011)

– Entity-oriented search workshop (2010-2011)

– Joint Intl. Workshop on Semantic and Entity-oriented Search (2012)

– SIGIR 2012 tracks on Structured Data and Entities

• Related fields:

– XML retrieval, Keyword search in databases, NL retrieval

- 14 -

Search is required in the presence of ambiguity

Query

Data

KeywordsKeywords NL Questions

NL Questions

Form- / facet-based InputsForm- / facet-based Inputs

Structured Queries (SPARQL)

Structured Queries (SPARQL)

OWL ontologies with rich, formal semantics

OWL ontologies with rich, formal semantics

Structured RDF dataStructured RDF data

Semi-Structured RDF data

Semi-Structured RDF data

RDF data embedded in text (RDFa)

RDF data embedded in text (RDFa)

Ambiguities: interpretation

Ambiguities: interpretation, extraction errors, data quality, confidence/trust

- 15 -

list search

related entity finding

entity searchSemSearch 2010/11

list completion

SemSearch 2011

TREC ELC taskTREC REF-LOD task

semantic search

Common tasks in Semantic Search

Related entity ranking in web search

- 17 -

Motivation

• Some users are short on time

– Need for direct answers

– Query expansion, question-answering, information boxes, rich results…

• Other users have time at their hand

– Long term interests such as sports, celebrities, movies and music

– Long running tasks such as travel planning

- 18 -

Example user sessions

- 19 -

Spark: related entity recommendations in web search

• A search assistance tool for exploration

• Recommend related entities given the user’s current query

– Cf. Entity Search at SemSearch, TREC Entity Track

• Ranking explicit relations in a Knowledge Base

– Cf. TREC Related Entity Finding in LOD (REF-LOD) task

• A previous version of the system live since 2010• van Zwol et al.: Faceted exploration of image search results. WWW

2010: 961-970

- 20 -

Spark example I.

- 21 -

Spark example II.

- 22 -

How does it work?

- 23 -

High-Level Architecture View

- 24 -

Spark Architecture

- 25 -

Data Preprocessing

- 26 -

Entity graph

• 3.4 million entities, 160 million relations• Locations: Internet Locality, Wikipedia, Yahoo! Travel

• Athletes, teams: Yahoo! Sports

• People, characters, movies, TV shows, albums: Dbpedia

• Example entities• Dbpedia Brad_Pitt Brad Pitt Movie_Actor

• Dbpedia Brad_Pitt Brad Pitt Movie_Producer

• Dbpedia Brad_Pitt Brad Pitt Person

• Dbpedia Brad_Pitt Brad Pitt TV_Actor

• Dbpedia Brad_Pitt_(boxer) Brad Pitt Person

• Example relations• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person

• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor

• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor

- 27 -

Entity graph challenges

• Coverage of the query volume

– New entities and entity types

– Additional inference

– International data

– Aliases, e.g. jlo, big apple, thomas cruise mapother iv

• Freshness

– People query for a movie long before it’s released

• Irrelevant entity and relation types

– E.g. voice actors who co-acted in a movie, cities in a continent

• Data quality

– United States Senate career of Barack Obama is not a person

– Andy Lau has never acted in Iron Man 3

- 28 -

Feature extraction

- 29 -

Feature extraction from text

• Text sources

– Query terms

– Query sessions

– Flickr tags

– Tweets

• Common representation

Input tweet:

Brad Pitt married to Angelina Jolie in Las Vegas

Output event:

Brad Pitt + Angelina Jolie

Brad Pitt + Las Vegas

Angelina Jolie + Las Vegas

- 30 -

Features

• Unary

– Popularity features from text: probability, entropy, wiki id popularity …

– Graph features: PageRank on the entity graph, wikipedia, web graph

– Type features: entity type

• Binary

– Co-occurrence features from text: conditional probability, joint probability …

– Graph features: common neighbors …

– Type features: relation type

- 31 -

Feature extraction challenges

• Efficiency of text tagging

– Hadoop Map/Reduce

• More features are not always better

– Can lead to over-fitting without sufficient training data

- 32 -

Model Learning

- 33 -

Model Learning

• Training data created by editors (five grades)400 Brandi adriana lima Brad Pitt person Embarassing

1397 David H. andy garcia Brad Pitt person Mostly Related

3037 Jennifer benicio del toro Brad Pitt person Somewhat Related

4615 Sarah burn after reading Brad Pitt person Excellent

9853 Jennifer fight club movie Brad Pitt person Perfect

• Join between the editorial data and the feature file

• Trained a regression model using GBDT

–Gradient Boosted Decision Trees

• 10-fold cross validation optimizing NDCG and tuning

•number of trees

•number of nodes per tree

- 34 -

Feature importance

RANK FEATURE IMPORTANCE

1 Relation type 100

2 PageRank (Related entity) 99.6075

3 Entropy – Flickr 94.7832

4 Probability – Flickr 82.6172

5 Probability – Query terms 78.9377

6 Shared connections 68.296

7 Cond. Probability – Flickr 68.0496

8 PageRank (Entity) 57.6078

9 KL divergence – Flickr 55.4604

10 KL divergence – Query terms 55.0662

- 35 -

Impact of training data

Number of training instances (judged relations)

- 36 -

Performance by query-entity type

•High overall performance but some types are more difficult

•Locations

– Editors downgrade popular entities such as businesses

NDCG by type of the query entity

- 37 -

Model Learning challenges

• Editorial preferences not necessarily coincide with usage

– Users click a lot more on people than expected

– Image bias?

• Alternative: optimize for usage data

– Clicks turned into labels or preferences

– Size of the data is not a concern

– Gains are computed from normalized CTR/COEC

– See van Zwol et al. Ranking Entity Facets Based on User Click Feedback. ICSC 2010: 192-199.

- 38 -

Ranking and Disambiguation

- 39 -

Ranking and Disambiguation

• We apply the ranking function offline to the data

• Disambiguation – How many times a given wiki id was retrieved for queries containing the entity

name?

Brad Pitt Brad_Pitt 21158

Brad Pitt Brad_Pitt_(boxer) 247

XXX XXX_(movie) 1775

XXX XXX_(Asia_album) 89

XXX XXX_(ZZ_Top_album) 87

XXX XXX_(Danny_Brown_album) 67

– PageRank for disambiguating locations (wiki ids are not available)

• Expansion to query patterns

– Entity name + context, e.g. brad pitt actor

- 40 -

Ranking and Disambiguation challenges

• Disambiguation cases that are too close to call

– Fargo Fargo_(film) 3969

– Fargo Fargo,_North_Dakota 4578

• Disambiguation across Wikipedia and other sources

- 41 -

Evaluation #2: Side-by-side testing

• Comparing two systems

– A/B comparison, e.g. current system under development and production system

– Scale: A is better, B is better

• Separate tests for relevance and image quality

– Image quality can significantly influence user perceptions

– Images can violate safe search rules

• Classification of errors

– Results: missing important results/contains irrelevant results, too few results, entities are not fresh, more/less diverse, should not have triggered

– Images: bad photo choice, blurry, group shots, nude/racy etc.

• Notes

– Borderline, set one entities relate to the movie Psy but the query is most likely about Gangnam style

– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s musicians.

– There is absolutely no relation between Finland and California.

- 42 -

Evaluation #3: Bucket testing

• Also called online evaluation

– Comparing against baseline version of the system

– Baseline does not change during the test

• Small % of search traffic redirected to test system, another small % to the baseline system

• Data collection over at least a week, looking for stat. significant differences that are also stable over time

• Metrics in web search

– Coverage and Click-through Rate (CTR)

– Searches per browser-cookie (SPBC)

– Other key metrics should not impacted negatively, e.g. Abandonment and retry rate, Daily Active Users (DAU), Revenue Per Search (RPS), etc.

- 43 -

Coverage before and after the new system

Before release:

Flat, lower

After release:

Flat, higher

- 44 -

Click-through rate (CTR) before and after the new system

Before release:

Gradually degrading performance due to lack of fresh data

After release:

Learning effect:users are starting to use the tool again

- 45 -

Summary

• Spark

– System for related entity recommendations

• Knowledge base

• Extraction of signals from query logs and other user-generated content

• Machine learned ranking

• Evaluation

• Other applications

– Recommendations on topic-entity pages

- 46 -

Future work

• New query types

– Queries with multiple entities

• adele skyfall

– Question-answering on keyword queries

• brad pitt movies

• brad pitt movies 2010

• Extending coverage

– Spark now live in CA, UK, AU, NZ, TW, HK, ES

• Even fresher data

– Stream processing of query log data

• Data quality improvements

• Online ranking with post-retrieval features

- 47 -

The End

• Many thanks to

– Barla Cambazoglu and Roi Blanco (Barcelona)

– Nicolas Torzec (US)

– Libby Lin (Product Manager, US)

– Search engineering (Taiwan)

• Contact

– [email protected]

– @pmika

Related Entity Finding on the Web

Technology

definition semantic

keyword search

instant search

interactive search

semantic web data

novel search tasks

search assistance tool

semantic annotations