Top Banner
Web-scale semantic search Edgar Meij, Yahoo Labs | October 31, 2014
62

Web-scale semantic search

Jul 02, 2015

Download

Technology

Edgar Meij

Most web search engine users are increasingly expecting direct and contextually relevant answers to their information needs rather than mere links to documents. In order to arrive at such answers we need to tackle several issues, including (but not limited to) entity linking, entity retrieval, entity reconciliation, intent classification, and personalization, all without losing sight of efficiency. In this talk I will give some background on how such an end-to-end pipeline for semantic search is being implemented and improved at Yahoo.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Web-scale semantic search

Web-scale semant ic search

Edgar Meij, Yahoo Labs | October 31, 2014

Page 2: Web-scale semantic search

Introduction

2

▪ Web search queries are increasingly entity-oriented › fact finding › transactional › exploratory › …

▪ Users expect increasingly “smarter” results › geared towards a user’s personal profile and current context • device type/user agent, day/time, location, …

› with sub-second response time › fresh/up to date, buzzy › more than “just 10 blue links”

Page 3: Web-scale semantic search

Semantic search

3

▪ “Plain” IR works well for simple queries and unambiguous terms ▪ But it gets harder when there is additional context, (implicit) intent, … › “barack obama height” › “best philips tv” > “sony tv” > “panasonic” › “pizza” › “pizza near me”

▪ Semantic search deals with “meaning” › essential ingredient: linking text to unambiguous “concepts” (~entitites) › Add: • query understanding • ranking • result presentation

Page 4: Web-scale semantic search

Semantic search

4

▪ Query understanding › entities, relations, types, … › intent detection ~ answer type prediction • kg-oriented

– entity/ies

– property/ies

– relations

– summaries

– ...

• web pages • news • POIs • ...

Page 5: Web-scale semantic search

Semantic search

5

▪ Ranking › (relevance) › freshness › authoritativeness › buzziness › distance › personalization › …

▪ Result presentation › serendipity › targeted answers › …

Page 6: Web-scale semantic search

Semantic search

6

▪ “Improve search accuracy by understanding searcher intent and the contextual meaning of queries/terms/documents/results/…” › “eiffel tower height” › “brad pitt married” › “chinese food” › “obama net worth”

▪ Uncertainties in › the source(s) › the query › the user › the intent › …

Page 7: Web-scale semantic search

Semantic search

7

▪ Combines IR/NLP/DB/SW ▪ Tasks › information extraction › information reconciliation/tracking › entity linking › query understanding/intent classification › retrieving/ranking entities/attributes/relations › interleaving/federated search: dedupe, merge, and rank (at runtime) › UI/UX › personalization › …

Page 8: Web-scale semantic search

8

Knowledge Graphs

Page 9: Web-scale semantic search

Knowledge graphs

9

▪ The “backbone” of semantic search ▪ They define › entities › attributes › types › relations › (provenance, sometimes) › and more • external links, homepages, features, …

Page 10: Web-scale semantic search

Entities at the core

10

Page 11: Web-scale semantic search

Entities are often not directly observed

11

▪ They appear in different forms, in different types of data !

▪ Unstructured › queries, documents, tweets, web pages, snippets, …

▪ Semistructured › inline XML, RDFa, schema.org, …

▪ Structured › Relational DBs, RDF, …

Often organized around entities

Retrieval systemData collection(s)

Information need

Result(s)

Page 12: Web-scale semantic search

KG vision @ Yahoo

12

▪ A unified knowledge graph for Yahoo › all entities and topics relevant to Yahoo (users) › rich information about entities: facts, relationships, features › identifiers, interlinking across data sources, and links to relevant services

!▪ To power knowledge-based services at Yahoo › search: display, and search for, information about entities › discovery: relate entities, interconnect data sources, link to relevant services › understanding: recognize entities in queries and text

!▪ Managed and served by a central knowledge team / platform

Page 13: Web-scale semantic search

Steps

13

Knowledge Acquisition

Knowledge Integration

Knowledge Consumption

Ongoing information extraction, from complementary sources.

Reconciliation into a unified knowledge repository.

Enrichment and serving…

Page 14: Web-scale semantic search

14

Key Tasks

Knowledge Acquisition Knowledge Integration Knowledge Consumption

Data Quality Monitoring

Knowledge Repository (common ontology)

Data Acquisition

Information Extraction

Schema Mapping

Blending

Entity ReconciliationEnrichment

Editorial Curation

ExportServing

Page 15: Web-scale semantic search

15

Data Acquisition

Knowledge Acquisition Knowledge Integration Knowledge Consumption

Data Quality Monitoring

Knowledge Repository (common ontology)

Data Acquisition

Information Extraction

Schema Mapping

Blending

Entity ReconciliationEnrichment

Editorial Curation

ExportServing

Page 16: Web-scale semantic search

Issues

16

▪ Multiple complementary data sources › combine and cross-validate data from authoritative sources › reference data sources such as Wikipedia and Freebase form our backbone › specialized data sources such as TMS and Musicbrainz adds breadth/depth › optimize for relevance, comprehensiveness, correctness, freshness, consistency

▪ Ongoing acquisition of raw data › feed acquisition from open data sources and paid providers › web/Targeted crawling, online fetching, ad hoc acquisition (e.g. Wikipedia monitoring) › deal w/ operational complexity: data size, bandwidth, update frequency, license, ©

Page 17: Web-scale semantic search

17

Information Extraction

Knowledge Acquisition Knowledge Integration Knowledge Consumption

Data Quality Monitoring

Knowledge Repository (common ontology)

Data Acquisition

Information Extraction

Schema Mapping

Blending

Entity ReconciliationEnrichment

Editorial Curation

ExportServing

Page 18: Web-scale semantic search

Information Extraction

18

▪ Extraction of entities, attributes, relationships, features › deal w/ scale, volatility, heterogeneity, inconsistency, schema complexity, breakage › expensive to build and maintain (i.e. declarative rules, expert’s knowledge, ML…) › being able to measure and monitor data quality is key

!▪ Mixed approach › parsing of large data feeds and online data APIs › structured data extraction on the Web: markup, Web scraping, Wrapper induction, › Wikipedia mining, Web mining, News mining, open information extraction

Page 19: Web-scale semantic search

19

Key Tasks

Knowledge Acquisition Knowledge Integration Knowledge Consumption

Data Quality Monitoring

Knowledge Repository (common ontology)

Data Acquisition

Information Extraction

Schema Mapping

Blending

Entity ReconciliationEnrichment

Editorial Curation

ExportServing

Page 20: Web-scale semantic search

20

Entity Reconciliation & Blending

▪ Disambiguate and merge entities across/within data sourcesBlocking Select candidates most likely to refer to the same real world entity Fast approx. similarity, hashing

Scoring Compute similarity score between all pair of candidates ML classifier or heuristics

Clustering Decide which candidates refer to the same entity and interlink them ML clustering or heuristics

Merging Build a unified object for each cluster. Populate with best properties ML selection or heuristics

▪ Challenges › not trivial! › scale and adapt to new entity types, data sources, data sizes, update frequencies… › ongoing reconciliation/blending/evaluation. Need for consistent entity IDs. Provenance.

Page 21: Web-scale semantic search

Knowledge graphs

21

Page 22: Web-scale semantic search

Knowledge graphs…

22

▪ … are not perfect ▪ Or: the importance of

human editors

Page 23: Web-scale semantic search

Knowledge graphs…

23

▪ … are not perfect ▪ Or: the importance of

human editors

Page 24: Web-scale semantic search

Knowledge graphs…

24

▪ … are not perfect ▪ Or: the importance of

human editors

Page 25: Web-scale semantic search

Knowledge graphs…

25

▪ … are not perfect ▪ Or: the importance of

human editors

Page 26: Web-scale semantic search

26

Entity Linking

Page 27: Web-scale semantic search

Knowledge graphs

27

Page 28: Web-scale semantic search

Entity linking

28

▪ Typical steps › mention detection – which part(s) should we link? › candidate ranking/selection – where should they be linked to? › disambiguation – maximize a “global” objective function

▪ Entity linking for web search queries › pre-retrieval • needs to be fast, space-efficient, and accurate

› queries are short and noisy • high level of ambiguity • limited context

Page 29: Web-scale semantic search

Entity linking for web search

29

▪ Approach › probabilistic model • unsupervised • large-scale set of aliases from Wikipedia and from click logs

› contextual relevance model based on neural language models › state-of-the-art hashing and bit encoding techniques

Page 30: Web-scale semantic search

Entity linking for web search

30

▪ Idea: jointly model mention detection (segmentation) and entity selection › compute probabilistic score for each segment-entity pair › optimize the score of the whole query

plemented e�ciently using dynamic programming in O(k2),where k is the number of query terms.

2. MODELING ENTITY LINKINGFor our entity linking model we establish a connection

between entities and their aliases (which are their textualrepresentations, also known as surface forms) by leveraginganchor text or user queries leading to a click into the Webpage that represents the entity. In the context of this pa-per we focus on using Wikipedia as KB and therefore onlyconsider anchor text within Wikipedia and clicks from websearch results on Wikipedia results—although it is generalenough to accommodate for other sources of information. Theproblem we address consists of automatically segmenting thequery and simultaneously selecting the right entity for eachsegment. Our Fast Entity Linker (FEL) tackles this problemby computing a probabilistic score for each segment-entitypair and then optimizing the score of the whole query. Notethat we do not employ any supervision and let the model anddata operate in a parameterless fashion; it is however possibleto add an additional layer that makes use of human-labeledtraining data in order to enhance the performance of themodel. This is the approach followed in Alley-oop where theranking model described in this paper is used to perform afirst-phase ranking, followed by a second phase ranking usinga supervised, machine-learned model.To describe our model we use the following random vari-

ables, assuming as an event space S ⇥ E where S is the setof all sequences and E the set of all entities known to thesystem:

s a sequence of terms s 2 s drawnfrom the set S, s ⇠ Multinomial(✓s)

e a set of entities e 2 e, where each e isdrawn from the set E, e ⇠ Multinomial(✓e)

as ⇠ Bernoulli(✓sa) indicates if s is an aliasas,e ⇠ Bernoulli(✓s,ea ) indicates if s is an

alias pointing (linking/clicked) to ec indicates which collection acts as a source

of information—query logs or Wikipedia (cq or cw)n(s, c) count of s in cn(e, c) count of e in c

Let q be the input query, which we represent with the setSq of all possible segmentations of its tokens t1 · · · tk. Thealgorithm will return the set of entities e, along with theirscores, that maximizes:

argmaxe2E

logP (e|q) = argmaxe2E,s2S

q

Pe2e logP (e|s) (1)

s.t. s 2 s ,S

s ✓ s ,T

s = ;. (2)

In Eq. 1 we assume independence of the entities given a querysegment, and in Eq. 2 we impose that the segmentations aredisjoint. Each individual entity/segment probability is then

estimated as:

P (e|s) =X

c2{cq

,cw

}

P (c|s)P (e|c, s) (3)

=X

c2{cq

,cw

}

P (c|s)X

as

={0,1}

P (as|c, s)P (e|as, c, s)

=X

c2{cq

,cw

}

P (c|s)P (as = 0|c, s)P (e|as = 0, c, s)

+ P (as = 1|c, s)P (e|as = 1, c, s))

�. (4)

The maximum likelihood probabilities are (note that in thiscase P (e|as = 0, c, s) = 0 and therefore the right hand sideof the summation cancels out):

P (c|s) = n(s, c)Pc0 n(s, c

0)(5)

P (as = 1|c, s) =P

s:as

=1 n(s, c)

n(s, c)(6)

P (e|as = 1, c, s) =

Ps:a

s,e

=1 n(s, c)Ps:a

s

=1 n(s, c)(7)

Those maximum likelihood probabilities can be smoothed ap-propriately using an entity prior. Using Dirichlet smoothing,the probability results in:

P (e|c) = n(e, c)|E|+

Pe2E n(e, c)

(8)

P (e|as, c, s) =

Ps:a

s,e

=1 n(s, c) + µc · p(e|c)µc +

Ps:a

s

=1 n(s, c)(9)

In this case P (e|c) = P (e|as = 0, c, s), and P (as = 0|c, s) =1�P (as = 1|c, s). Similarly, we smooth P (c|s) using Laplacesmoothing. An alternative to Eq. 1 would be to select thesegmentation that optimizes the score of the top rankedentity:

argmaxe2E,s2S

q

maxe2e,s2s

P (e|s) (10)

Both Eq. 1 and Eq. 10 are instances of the same generalsegmentation problem, defined as follows. Given a sequence ofterms t = t1 · · · tk, denote any segment of the sequence with[titi+1 . . . ti+j�1] 8i, j � 0. Let �(s) be any scoring functionthat maps segments to real numbers, then the maximumscore of a segmentation is defined as follows:

m(t1, t2, . . . , tk) =

max

✓� (m(t1),m(t2, . . . , tk)) ,� (�([t1t2]),m(t3 . . . tk))

, . . . ,� (�([t1 . . . , tk�1]) ,m(tk)), �([t1 . . . tk])

◆, (11)

wherem(t1) = �([t1]) and �(a, b) is an associative aggregationfunction, such as �(a, b) = a + b in the case of Eq. 1 and�(a, b) = max(a, b) in the case of Eq. 10. Since the scoringfunction s(·) only depends on the given segment and not onthe others, the segmentation with maximum score can becomputed in O(k2) time using dynamic programming.

We instantiate the above problem with the scoring function�(s) = highestscore(s, q) = maxe2E logP (e|s, q), that,

Page 31: Web-scale semantic search

▪ Use dynamic programming to solve

Main idea

31

plemented e�ciently using dynamic programming in O(k2),where k is the number of query terms.

2. MODELING ENTITY LINKINGFor our entity linking model we establish a connection

between entities and their aliases (which are their textualrepresentations, also known as surface forms) by leveraginganchor text or user queries leading to a click into the Webpage that represents the entity. In the context of this pa-per we focus on using Wikipedia as KB and therefore onlyconsider anchor text within Wikipedia and clicks from websearch results on Wikipedia results—although it is generalenough to accommodate for other sources of information. Theproblem we address consists of automatically segmenting thequery and simultaneously selecting the right entity for eachsegment. Our Fast Entity Linker (FEL) tackles this problemby computing a probabilistic score for each segment-entitypair and then optimizing the score of the whole query. Notethat we do not employ any supervision and let the model anddata operate in a parameterless fashion; it is however possibleto add an additional layer that makes use of human-labeledtraining data in order to enhance the performance of themodel. This is the approach followed in Alley-oop where theranking model described in this paper is used to perform afirst-phase ranking, followed by a second phase ranking usinga supervised, machine-learned model.To describe our model we use the following random vari-

ables, assuming as an event space S ⇥ E where S is the setof all sequences and E the set of all entities known to thesystem:

s a sequence of terms s 2 s drawnfrom the set S, s ⇠ Multinomial(✓s)

e a set of entities e 2 e, where each e isdrawn from the set E, e ⇠ Multinomial(✓e)

as ⇠ Bernoulli(✓sa) indicates if s is an aliasas,e ⇠ Bernoulli(✓s,ea ) indicates if s is an

alias pointing (linking/clicked) to ec indicates which collection acts as a source

of information—query logs or Wikipedia (cq or cw)n(s, c) count of s in cn(e, c) count of e in c

Let q be the input query, which we represent with the setSq of all possible segmentations of its tokens t1 · · · tk. Thealgorithm will return the set of entities e, along with theirscores, that maximizes:

argmaxe2E

logP (e|q) = argmaxe2E,s2S

q

Pe2e logP (e|s) (1)

s.t. s 2 s ,S

s ✓ s ,T

s = ;. (2)

In Eq. 1 we assume independence of the entities given a querysegment, and in Eq. 2 we impose that the segmentations aredisjoint. Each individual entity/segment probability is then

estimated as:

P (e|s) =X

c2{cq

,cw

}

P (c|s)P (e|c, s)

=X

c2{cq

,cw

}

P (c|s)X

as

={0,1}

P (as|c, s)P (e|as, c, s)

=X

c2{cq

,cw

}

P (c|s)P (as = 0|c, s)P (e|as = 0, c, s)

+ P (as = 1|c, s)P (e|as = 1, c, s))

�.

The maximum likelihood probabilities are (note that in thiscase P (e|as = 0, c, s) = 0 and therefore the right hand sideof the summation cancels out):

P (c|s) = n(s, c)Pc0 n(s, c

0)(3)

P (as = 1|c, s) =P

s:as

=1 n(s, c)

n(s, c)(4)

P (e|as = 1, c, s) =

Ps:a

s,e

=1 n(s, c)Ps:a

s

=1 n(s, c)(5)

Those maximum likelihood probabilities can be smoothed ap-propriately using an entity prior. Using Dirichlet smoothing,the probability results in:

P (e|c) = n(e, c)|E|+

Pe2E n(e, c)

(6)

P (e|as, c, s) =

Ps:a

s,e

=1 n(s, c) + µc · p(e|c)µc +

Ps:a

s

=1 n(s, c)(7)

In this case P (e|c) = P (e|as = 0, c, s), and P (as = 0|c, s) =1�P (as = 1|c, s). Similarly, we smooth P (c|s) using Laplacesmoothing. An alternative to Eq. 1 would be to select thesegmentation that optimizes the score of the top rankedentity:

argmaxe2E,s2S

q

maxe2e,s2s

P (e|s) (8)

Both Eq. 1 and Eq. 10 are instances of the same generalsegmentation problem, defined as follows. Given a sequence ofterms t = t1 · · · tk, denote any segment of the sequence with[titi+1 . . . ti+j�1] 8i, j � 0. Let �(s) be any scoring functionthat maps segments to real numbers, then the maximumscore of a segmentation is defined as follows:

m(t1, t2, . . . , tk) =

max

✓� (m(t1),m(t2, . . . , tk)) ,� (�([t1t2]),m(t3 . . . tk))

, . . . ,� (�([t1 . . . , tk�1]) ,m(tk)), �([t1 . . . tk])

◆, (9)

wherem(t1) = �([t1]) and �(a, b) is an associative aggregationfunction, such as �(a, b) = a + b in the case of Eq. 1 and�(a, b) = max(a, b) in the case of Eq. 10. Since the scoringfunction s(·) only depends on the given segment and not onthe others, the segmentation with maximum score can becomputed in O(k2) time using dynamic programming.

We instantiate the above problem with the scoring function�(s) = highestscore(s, q) = maxe2E logP (e|s, q), that,

plemented e�ciently using dynamic programming in O(k2),where k is the number of query terms.

2. MODELING ENTITY LINKINGFor our entity linking model we establish a connection

between entities and their aliases (which are their textualrepresentations, also known as surface forms) by leveraginganchor text or user queries leading to a click into the Webpage that represents the entity. In the context of this pa-per we focus on using Wikipedia as KB and therefore onlyconsider anchor text within Wikipedia and clicks from websearch results on Wikipedia results—although it is generalenough to accommodate for other sources of information. Theproblem we address consists of automatically segmenting thequery and simultaneously selecting the right entity for eachsegment. Our Fast Entity Linker (FEL) tackles this problemby computing a probabilistic score for each segment-entitypair and then optimizing the score of the whole query. Notethat we do not employ any supervision and let the model anddata operate in a parameterless fashion; it is however possibleto add an additional layer that makes use of human-labeledtraining data in order to enhance the performance of themodel. This is the approach followed in Alley-oop where theranking model described in this paper is used to perform afirst-phase ranking, followed by a second phase ranking usinga supervised, machine-learned model.To describe our model we use the following random vari-

ables, assuming as an event space S ⇥ E where S is the setof all sequences and E the set of all entities known to thesystem:

s a sequence of terms s 2 s drawnfrom the set S, s ⇠ Multinomial(✓s)

e a set of entities e 2 e, where each e isdrawn from the set E, e ⇠ Multinomial(✓e)

as ⇠ Bernoulli(✓sa) indicates if s is an aliasas,e ⇠ Bernoulli(✓s,ea ) indicates if s is an

alias pointing (linking/clicked) to ec indicates which collection acts as a source

of information—query logs or Wikipedia (cq or cw)n(s, c) count of s in cn(e, c) count of e in c

Let q be the input query, which we represent with the setSq of all possible segmentations of its tokens t1 · · · tk. Thealgorithm will return the set of entities e, along with theirscores, that maximizes:

argmaxe2E

logP (e|q) = argmaxe2E,s2S

q

Pe2e logP (e|s) (1)

s.t. s 2 s ,S

s ✓ s ,T

s = ;. (2)

In Eq. 1 we assume independence of the entities given a querysegment, and in Eq. 2 we impose that the segmentations aredisjoint. Each individual entity/segment probability is then

estimated as:

P (e|s) =X

c2{cq

,cw

}

P (c|s)P (e|c, s)

=X

c2{cq

,cw

}

P (c|s)X

as

={0,1}

P (as|c, s)P (e|as, c, s)

=X

c2{cq

,cw

}

P (c|s)P (as = 0|c, s)P (e|as = 0, c, s)

+ P (as = 1|c, s)P (e|as = 1, c, s))

�.

The maximum likelihood probabilities are (note that in thiscase P (e|as = 0, c, s) = 0 and therefore the right hand sideof the summation cancels out):

P (c|s) = n(s, c)Pc0 n(s, c

0)(3)

P (as = 1|c, s) =P

s:as

=1 n(s, c)

n(s, c)(4)

P (e|as = 1, c, s) =

Ps:a

s,e

=1 n(s, c)Ps:a

s

=1 n(s, c)(5)

Those maximum likelihood probabilities can be smoothed ap-propriately using an entity prior. Using Dirichlet smoothing,the probability results in:

P (e|c) = n(e, c)|E|+

Pe2E n(e, c)

(6)

P (e|as, c, s) =

Ps:a

s,e

=1 n(s, c) + µc · p(e|c)µc +

Ps:a

s

=1 n(s, c)(7)

In this case P (e|c) = P (e|as = 0, c, s), and P (as = 0|c, s) =1�P (as = 1|c, s). Similarly, we smooth P (c|s) using Laplacesmoothing. An alternative to Eq. 1 would be to select thesegmentation that optimizes the score of the top rankedentity:

argmaxe2E,s2S

q

maxe2e,s2s

P (e|s) (8)

Both Eq. 1 and Eq. 10 are instances of the same generalsegmentation problem, defined as follows. Given a sequence ofterms t = t1 · · · tk, denote any segment of the sequence with[titi+1 . . . ti+j�1] 8i, j � 0. Let �(s) be any scoring functionthat maps segments to real numbers, then the maximumscore of a segmentation is defined as follows:

m(t1, t2, . . . , tk) =

max

✓� (m(t1),m(t2, . . . , tk)) ,� (�([t1t2]),m(t3 . . . tk))

, . . . ,� (�([t1 . . . , tk�1]) ,m(tk)), �([t1 . . . tk])

◆, (9)

wherem(t1) = �([t1]) and �(a, b) is an associative aggregationfunction, such as �(a, b) = a + b in the case of Eq. 1 and�(a, b) = max(a, b) in the case of Eq. 10. Since the scoringfunction s(·) only depends on the given segment and not onthe others, the segmentation with maximum score can becomputed in O(k2) time using dynamic programming.

We instantiate the above problem with the scoring function�(s) = highestscore(s, q) = maxe2E logP (e|s, q), that,

plemented e�ciently using dynamic programming in O(k2),where k is the number of query terms.

2. MODELING ENTITY LINKINGFor our entity linking model we establish a connection

between entities and their aliases (which are their textualrepresentations, also known as surface forms) by leveraginganchor text or user queries leading to a click into the Webpage that represents the entity. In the context of this pa-per we focus on using Wikipedia as KB and therefore onlyconsider anchor text within Wikipedia and clicks from websearch results on Wikipedia results—although it is generalenough to accommodate for other sources of information. Theproblem we address consists of automatically segmenting thequery and simultaneously selecting the right entity for eachsegment. Our Fast Entity Linker (FEL) tackles this problemby computing a probabilistic score for each segment-entitypair and then optimizing the score of the whole query. Notethat we do not employ any supervision and let the model anddata operate in a parameterless fashion; it is however possibleto add an additional layer that makes use of human-labeledtraining data in order to enhance the performance of themodel. This is the approach followed in Alley-oop where theranking model described in this paper is used to perform afirst-phase ranking, followed by a second phase ranking usinga supervised, machine-learned model.To describe our model we use the following random vari-

ables, assuming as an event space S ⇥ E where S is the setof all sequences and E the set of all entities known to thesystem:

s a sequence of terms s 2 s drawnfrom the set S, s ⇠ Multinomial(✓s)

e a set of entities e 2 e, where each e isdrawn from the set E, e ⇠ Multinomial(✓e)

as ⇠ Bernoulli(✓sa) indicates if s is an aliasas,e ⇠ Bernoulli(✓s,ea ) indicates if s is an

alias pointing (linking/clicked) to ec indicates which collection acts as a source

of information—query logs or Wikipedia (cq or cw)n(s, c) count of s in cn(e, c) count of e in c

Let q be the input query, which we represent with the setSq of all possible segmentations of its tokens t1 · · · tk. Thealgorithm will return the set of entities e, along with theirscores, that maximizes:

argmaxe2E

logP (e|q) = argmaxe2E,s2S

q

Pe2e logP (e|s) (1)

s.t. s 2 s ,S

s ✓ s ,T

s = ;. (2)

In Eq. 1 we assume independence of the entities given a querysegment, and in Eq. 2 we impose that the segmentations aredisjoint. Each individual entity/segment probability is then

estimated as:

P (e|s) =X

c2{cq

,cw

}

P (c|s)P (e|c, s)

=X

c2{cq

,cw

}

P (c|s)X

as

={0,1}

P (as|c, s)P (e|as, c, s)

=X

c2{cq

,cw

}

P (c|s)P (as = 0|c, s)P (e|as = 0, c, s)

+ P (as = 1|c, s)P (e|as = 1, c, s))

�.

The maximum likelihood probabilities are (note that in thiscase P (e|as = 0, c, s) = 0 and therefore the right hand sideof the summation cancels out):

P (c|s) = n(s, c)Pc0 n(s, c

0)(3)

P (as = 1|c, s) =P

s:as

=1 n(s, c)

n(s, c)(4)

P (e|as = 1, c, s) =

Ps:a

s,e

=1 n(s, c)Ps:a

s

=1 n(s, c)(5)

Those maximum likelihood probabilities can be smoothed ap-propriately using an entity prior. Using Dirichlet smoothing,the probability results in:

P (e|c) = n(e, c)|E|+

Pe2E n(e, c)

(6)

P (e|as, c, s) =

Ps:a

s,e

=1 n(s, c) + µc · p(e|c)µc +

Ps:a

s

=1 n(s, c)(7)

In this case P (e|c) = P (e|as = 0, c, s), and P (as = 0|c, s) =1�P (as = 1|c, s). Similarly, we smooth P (c|s) using Laplacesmoothing. An alternative to Eq. 1 would be to select thesegmentation that optimizes the score of the top rankedentity:

argmaxe2E,s2S

q

maxe2e,s2s

P (e|s) (8)

Both Eq. 1 and Eq. 10 are instances of the same generalsegmentation problem, defined as follows. Given a sequence ofterms t = t1 · · · tk, denote any segment of the sequence with[titi+1 . . . ti+j�1] 8i, j � 0. Let �(s) be any scoring functionthat maps segments to real numbers, then the maximumscore of a segmentation is defined as follows:

m(t1, t2, . . . , tk) =

max

✓� (m(t1),m(t2, . . . , tk)) ,� (�([t1t2]),m(t3 . . . tk))

, . . . ,� (�([t1 . . . , tk�1]) ,m(tk)), �([t1 . . . tk])

◆, (9)

wherem(t1) = �([t1]) and �(a, b) is an associative aggregationfunction, such as �(a, b) = a + b in the case of Eq. 1 and�(a, b) = max(a, b) in the case of Eq. 10. Since the scoringfunction s(·) only depends on the given segment and not onthe others, the segmentation with maximum score can becomputed in O(k2) time using dynamic programming.

We instantiate the above problem with the scoring function�(s) = highestscore(s, q) = maxe2E logP (e|s, q), that,

Page 32: Web-scale semantic search

Contextual relevance model

32

▪ Note: P(e|s) is independent of other s’s in each query › fast, but might be suboptimal › e.g., “hollywood lyrics”

Page 33: Web-scale semantic search

Contextual relevance model

33

▪ Note: P(e|s) is independent of other s’s in each query › fast, but might be suboptimal › e.g., “hollywood lyrics”

▪ Solution: contextual relevance model › add query “context” t ∈ q\s (i.e., the query remainder) into the model: P(e|s, q) • boils down to calculating ∏t P(t|e) and merging this back into the model

› naive implementation: LM or NB-based › more advanced: use word embeddings from neural language models

Page 34: Web-scale semantic search

Evaluation

34

▪ Webscope query-to-entities dataset (publicly available) › http://webscope.sandbox.yahoo.com/ › 2.6k queries with 6k editorially linked Wikipedia entities

▪ 4.6m candidate Wikipedia entities ▪ Entities and aliases › 13m aliases from Wikipedia hyperlinks (after filtering) › >100m aliases from click logs

▪ Baselines › commonness (most likely sense of an alias) › IR-based (LM) › Wikifier › Bing

Page 35: Web-scale semantic search

Results

35

is a large set of entities to disambiguate and the size ofthis set can vary greatly from query to query. If we breakdown this average and deviation by alias size (Figure 2), weobserve that it drops dramatically when the length of thealias query segment referring to the entity is greater than1. Figure 3 plots the di↵erent number of aliases that areassociated with the di↵erent entities, which is another heavytail distribution—the majority of the aliases (86%) point toa single entity.

3.2 Entity vectors trainingWe train the word vectors with the original word2vec

code on a text collection of 3 billion words extracted fromWikipedia, using the standard vector dimensionality D = 200.Given the scarcity and sparsity of labeled query exampleswe seek an unsupervised way to tune the hyperparameters� and ⇢ (the regularization parameter and the number ofnegative samples). To this aim we define an artificial taskwhich we call the retrieval task, and optimize the parameterson it. We define the retrieval task as follows. We sample aset of entities Etrain among those whose multiset Re has atleast 50 words, and extract a subsample Etest ⇢ Etrain. Foreach entity e in Etrain we hold out k words from Re andtrain the entity vector on the remaining words. Then, foreach entity e in Etest we use the k held out words to scoreall the entities in Etrain and compute the rank of e in theinduced ranking. Then, inspired by the standard discountgains in DCG metrics, we define the accuracy as the averagelogarithm of the ranks. We observe that as the number ofnegative samples ⇢ increases the accuracy improves but thetraining time grows linearly; we find a satisfactory trade-o↵ at ⇢ = 20, where the accuracy reaches a plateau. Withrespect to the regularization parameter, instead, we find amaximum at � = 10.

k Centroid LR

5 7.857 7.386

10 5.711 5.385

15 4.575 4.363

20 3.931 3.883

Table 3: Accuracy for the retrieval task.

In Table 3 we report the accuracy numbers for Centroid andLR trained with the above hyperparameters, for di↵erentvalues of k, using 50K entities for Etrain and 5, 000 entitiesfor Etest. Note that LR outperforms Centroid in all cases.

3.3 Entity linkingWe now compare the performance of our FEL models

against several baselines: a state-of-the-art system (Wikifier),a retrieval approach based on language models, and common-

ness (cmns). This latter method is a popular unsupervisedbaseline and has been shown to perform quite well on vari-ous kinds of input texts (tweets, news items, etc.) [3, 6]. It

is defined as cmns(e, s) =|L

e,s

|Pe

0 |Le

0,s

| where e is an entity,

s a sequence of terms and Le,s the set of all links withinWikipedia with anchor text s that target the page that repre-sents e. In order to process the queries using this method, wesplit each query into every single term n-gram starting fromthe longest possible, i.e., the whole query. Then we try tomatch each n-gram with anchor texts as found in Wikipedia.In the case an anchor text is found we score the entities

P@1 MRR MAP R-Prec

LM 0.0394 0.1386 0.1053 0.0365

LM-Click 0.4882 0.5799 0.4264 0.3835

Bing 0.6349 0.7018 0.5388 0.5223

Wikifier 0.2983 0.3201 0.2030 0.2086

Commonness 0.7336 0.7798 0.6418 0.6464

FEL 0.7669 0.8092 0.6528 0.6575

FEL+Centroid 0.8035 0.8366 0.6728 0.6765

FEL+LR 0.8352 0.8684 0.6912 0.6883

Table 4: Entity linking e�cacy.

based on cmns and ignore the smaller, constituent n-grams.Otherwise we recurse and try to match the (n-1)-grams. In-formation retrieval-based approaches (denoted LM) makeuse of a Wikipedia index and can rank the pages using theircontent. We indexed each Wikipedia article using di↵erentfields (title, content, anchors, first paragraph, redirects, andcategories) and implemented the method by Kim and Croft[5] for ranking the di↵erent pages. This method computesa per-field weighting term based on the query. We note theperformance of this method on its own is quite poor. Wetherefore enhanced it using a query-independent feature andrerank the obtained search results by the amount of clicks ona Wikipedia page as document prior (LM-Click). The methoddenoted Bing is a reference baseline that issues the querythrough the Bing search API and returns the first Wikipediapage as a result. This is a strong baseline in that commercialWeb search engines use supervised, machine learned rankingmodels incorporating a large number of features, includingfeatures derived from the text of entity pages, the hyperlinkgraph and user interactions (click logs).

Ratinov et al. [9] propose the use of a local approach (e.g.,commonness) to generate a disambiguation context and thenapply“global”machine learning for disambiguation. Wikifier3

is an implementation of this state-of-the-art system whichcombines Wikipedia pages, gazetteers, and Wordnet. Wedenote our Fast Entity Linker as FEL, and the two methodsthat use contextual vectors are FEL+Centroid and FEL+LRto indicate the di↵erent ways the vectors are learned.We evaluate the di↵erent systems using early precision

metrics, i.e, Precision at rank 1 (P@1), Mean ReciprocalRank (MRR), R-Precision (R-Prec) and also Mean AveragePrecision (MAP). We look for statistical significance using atwo-sided t-test.

Table 4 shows the main entity linking results. With respectto the baselines, Wikifier was designed primarily for entitylinking on longer texts and underperforms heavily with re-spect to the rest of the methods. Retrieval-based approachesare not competitive on their own. Retrieving entities usingthe Bing search API, which presumably uses a large amountof available signals, is the first competitive baseline, alongwith commonness. Somehow surprisingly, in our dataset thisbaseline outperformed every other baseline by a large margin.FEL significantly outperforms all the baselines, except forcommonness. FEL+Centroid, however, significantly outper-forms all the baselines as well as FEL, and FEL+LR is inturn significantly better than all other methods.

3.4 Runtime performance3http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier

Page 36: Web-scale semantic search

Optimizations

36

▪ Early stopping ▪ Compressing word embedding vectors › Golomb coding + Elias-Fano monotone sequence data structure • allowing retrieval in constant time

› compression: • word vectors: 3.44 bits per entry • centroid vectors: 3.42 bits per entry • LR vectors: 3.83 bits per entry • overall: ~10 times smaller than 32-bit floating point

▪ Compressing counts ▪ Sub-millisecond response time

Page 37: Web-scale semantic search

37

Query Intents

Page 38: Web-scale semantic search

Query understanding

38

▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations!

› context context context

<target id="4" text="James Dean"><qa> <q id="4.1" type="FACTOID">When was James Dean born?</q></qa><qa> <q id="4.2" type="FACTOID">When did James Dean die?</q></qa><qa> <q id="4.3" type="FACTOID">How did he die?</q></qa><qa> <q id="4.4" type="LIST">What movies did he appear in?</q></qa><qa> <q id="4.5" type="FACTOID">Which was the first movie that he was in?</q></qa><qa> <q id="4.6" type="OTHER">Other</q></qa></target>

Page 39: Web-scale semantic search

Query understanding

39

▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations!

› context context context<target id="4" text="James Dean"><qa> <q id="4.1" type="FACTOID">When was James Dean born?</q></qa><qa> <q id="4.2" type="FACTOID">When did James Dean die?</q></qa><qa> <q id="4.3" type="FACTOID">How did he die?</q></qa><qa> <q id="4.4" type="LIST">What movies did he appear in?</q></qa><qa> <q id="4.5" type="FACTOID">Which was the first movie that he was in?</q></qa><qa> <q id="4.6" type="OTHER">Other</q></qa></target>

Page 40: Web-scale semantic search

Query understanding

40

▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations!

› context context context

Page 41: Web-scale semantic search

Query understanding

41

▪ Entails mapping queries to entities, relations, types, attributes, … ▪ Still in its infancy, especially for keyword queries › QA › query patterns/templates › direct displays › query interpretation • rank interpretations!

› context context context

Page 42: Web-scale semantic search

Intents?

42

▪ Intent = “need behind the query”, “objective”, “task”, etc. › used for triggering, reranking, selecting, disambiguation, … › detection + mapping

▪ Search-oriented intents › navigational, informational, transactional › domains/verticals • autos, local, product, recipe, travel, …

› entity/type-centered intents (“attribute intents”), tied to • attributes • specific return type/s • actions • facets/refiners

Page 43: Web-scale semantic search

How to detect them?

43

▪ language models ▪ editorial ▪ rules/templates ▪ ML + neural LMs

Page 44: Web-scale semantic search

44

UI/UX

Page 45: Web-scale semantic search

Knowledge graphs

45

Page 46: Web-scale semantic search

Knowledge graphs

46

Page 47: Web-scale semantic search

Knowledge graphs

47

Page 48: Web-scale semantic search

Knowledge graphs

48

Page 49: Web-scale semantic search

Semantic search introduces new information access tasks

49

▪ Users come to expect increasingly advanced results › related entity finding › relationship explanation • between two “adjacent” entities • between more than two entities • for any path between entities in the KG

› relationship ranking › (contextual) type ranking › disambiguation

Page 50: Web-scale semantic search

Semantic search introduces new information access tasks

50

▪ Users come to expect increasingly advanced results › related entity finding › relationship explanation • between two “adjacent” entities • between more than two entities • for any path between entities in the KG

› relationship ranking › (contextual) type ranking › disambiguation

Page 51: Web-scale semantic search

Semantic search introduces new information access tasks

51

▪ Users come to expect increasingly advanced results › related entity finding › relationship explanation • between two “adjacent” entities • between more than two entities • for any path between entities in the KG

› relationship ranking › (contextual) type ranking › disambiguation

Page 52: Web-scale semantic search

Evaluation

52

▪ How to measure success (and train models)? › clicks › A/B testing, bucket testing › interleaving › dwell time › eye/mouse tracking › editorial assessments • costly • hard to generalize • relevance is not objective (but very subjective/contextual)

Page 53: Web-scale semantic search

Evaluating semantic search

53

▪ Semantic search aims to “answer” queries › show relevant entities › show the actual answer/fact/… !

▪ How do you measure/observe/determine success? › feedback › human editors • how to generalize?

› abandonment › task/location/context/user/… specific notion of relevance !

▪ Need adequate and reliable metrics

Page 54: Web-scale semantic search

Moving towards mobile

54

▪ Limited screen real estate ▪ Costly to scroll/click/back/etc. › people actually type longer queries on mobile devices!

▪ Rich context › hyperlocal • location (lat/lon, home/work/traveling/…) • time of day

› device type

▪ Move towards “discussion-style” interfaces, i.e., interactive “IR” › “QA”/Siri › dialog systems

Page 55: Web-scale semantic search

55

Page 56: Web-scale semantic search

UI/UX

56

▪ Mobile search, centered around entities

Page 57: Web-scale semantic search

UI/UX

57

▪ Mobile search, centered around entities

Page 58: Web-scale semantic search

UI/UX

58

▪ Mobile search, centered around entities

Page 59: Web-scale semantic search

Evaluation on mobile

59

▪ Even more tricky… !

▪ How do you measure/observe/determine success? › clicks (“taps”) are not easily interpreted › neither are swipes, pinches, etc.

▪ Current approaches include › “field studies” › observing users (in the lab and in the wild) › ...

▪ Need adequate and reliable metrics

Page 60: Web-scale semantic search

60

Challenges

Page 61: Web-scale semantic search

Current challenges

61

▪ Combining › text • queries • documents • entity descriptions/inlinks

› structure • internal and external to documents • explicit (RDF), implicit (links from KG to web pages), inferred (information extraction)

› context • users, sessions, and tasks • hyperlocal, personal, and social • temporal/popularity (buzziness)

› in rich, complex user interactions • evaluation?

Page 62: Web-scale semantic search

Thanks!

62

▪ More info? › @edgarmeij › [email protected] › http://edgar.meij.pro