Recruiting Solutions formation Retrieval: Search at LinkedIn Shakti Sinha Daniel Tunkelang Head, Search Relevance Head, Query Understanding 1 Shakti Daniel
May 09, 2015
Recruiting Solutions Recruiting Solutions Recruiting Solutions
formation Retrieval: Search at LinkedIn Shakti Sinha Daniel Tunkelang Head, Search Relevance Head, Query Understanding
1
Shakti Daniel
Why do 200M+ people use LinkedIn?
2
People use LinkedIn because of other people.
3
Search helps members find and be found.
4
Rich collection of professional content.
5
Every search is personalized.
6
Let’s talk a bit about how it all works.
§ Query Understanding
§ Search Spam
§ Unified Search More at http://data.linkedin.com/search.
7
Query Understanding
8
People are semi-structured objects.
9 9
for i in [1..n]! s ← w1 w2 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← {s}! a.prob ← Pc(s)! B[i] ← {a}! for j in [1..i-1]! for b in B[j]! s ← wj wj+1 … wi! if Pc(s) > 0! a ← new Segment()! a.segs ← b.segs U {s}! a.prob ← b.prob * Pc(s)! B[i] ← B[i] U {a}! sort B[i] by prob! truncate B[i] to size k!
Word sense is contextual.
10
Understand queries as early as possible.
11
Query structure has many applications.
§ Boost results that match query interpretation. § Bucket search log analysis by query classes. § Query rewriting specific to query classes. § …
Query understanding focuses on set-level metrics.
Not just about best answer, but getting to best question.
12
Search Spam
13
Let’s look at a search spammer.
14
Summary is verbose but legitimate.
15
But then comes the keyword stuffing.
16
How we train our search spam classifier.
§ Find the queries targeted by spammers. – 10,000 most common non-name queries.
§ Look at top results for a generic user. – i.e., show unpersonalized search results.
§ Remove private profiles. – Members first! Can’t sacrifice privacy to fight spammers.
§ Label data by crowdsourcing. – Relevance is subjective, but spam is relatively objective.
17
ROC curve for spam thresholding.
18
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
a
b
Spam score threshold
0 < a < b < 1
Integrate spamminess into relevance score.
§ Spam model yields a probability between 0 and 1.
§ Use spam score as piecewise linear factor: if score < spammin: # not a spammer relevance *= 1.0 elif score > spammax: # spammer relevance *= 0.0 else: # linear function of spamminess relevance *= (spammax - score) / (spammax - spammin)
19
Spam is an arms race.
§ We can’t reveal precisely which features we use for spam detection, or spammers will work around them.
§ Spammers will try to reverse-engineer us anyway.
§ Personalization benefits us and our legitimate users – it’s hard to spam your way to high personalized ranking.
§ Fighting spam is all about making the investment less profitable for the spammer.
20
Unified Search
21
Un-Unified Search
22
Introducing LinkedIn Unified Search!
Goal: make all of our content more discoverable.
Three new features: § Query Auto-Complete § Content Type Suggestions § Unified Search Result Page
23
Query Auto-Complete
24
Best completion not always the most popular.
§ In a heavy-tailed distribution, even the most popular queries account for a small fraction of distribution.
§ We don’t want to suggest generic queries that would produce useless results. – e.g., c -> company, j -> jobs
§ Goal is to not only to infer user’s intent but also suggest a search that yields relevant results across content types.
25
Content Type Suggestions
26
How we compute content type suggestions.
§ Rank content types by likelihood of a successful search. – Consider click-through behavior as well as downstream actions.
§ Bootstrap using what we know from pre-unified search behavior. – Tricky part is compensating for findability bias.
§ Continuously evaluate and collect feedback through user behavior. – E.g., members using the left rail to select a particular vertical.
27
Unified Search Result Page
28
Intent Detection and Page Construction
§ Relevance is now a two-part computation:
P(Content Type | User, Query) x
P(Document | User, Query, Content Type) § Intent detection comes first: inefficient to send all queries
to all verticals.
§ Secondary components introduce diversity.
29
Summary
§ Personalize every search and leverage structure. § Understand queries as early as possible. § Fight the spammers that be. § Unify and simplify the search experience.
Goal: help LinkedIn’s 200M+ members find and be found.
30
Thank you!
31
Want to learn more?
§ Check out http://data.linkedin.com/search.
§ Contact us: – Shakti: [email protected]
http://linkedin.com/in/sdsinha
– Daniel: [email protected] http://linkedin.com/in/dtunkelang
§ Did we mention that we’re hiring?
32