Enhancing relevancy through personalization & semantic search

ENHANCING RELEVANCY THROUGH PERSONALIZATION & SEMANTIC SEARCH Trey Grainger

Search Technology Development Manager

Dublin, IE 2013.11.07

My Background

Trey"Grainger"Search"Technology"Development"Manager"""@CareerBuilder.com"

"Relevant"Background"

•  Search"&"Recommenda>ons"•  HighAvolume,"Distributed"Systems"•  NLP,"Relevancy"Tuning,"User"Group"Tes>ng,"&"Machine"Learning"

" """""""""""""""""""""""""""Other"Projects"•  CoAauthor:""Solr%in%Ac*on%•  Founder"and"Chief"Engineer"@"""""""""""""""""""""""""".com"

•  I. How we use Solr @ CareerBuilder •  II. Traditional Relevancy Scoring •  III. Advanced Relevancy through functions

–  Factors as a linear function –  Context-aware relevancy parameter weighting

•  III. Personalization & Recommendations –  Profile and Behavior-based –  Solr as a recommendation engine –  Collaborative Filtering

•  IV. Semantic Search –  Mining user-behavior for synonyms –  Uncovering meaning through clustering –  Latent Semantic Indexing overview –  Document-based searching –  Foreground vs. Background analysis

Roadmap

How"we"use"Solr"@"CareerBuilder"

•  Over"2.5"million"new"jobs"each"month""

•  Over"60"million"ac>vely"searchable"resumes"

•  ~300"globally"distributed"search"servers""

•  Thousands"of"unique,"dynamically"generated"indexes"

•  Over"1"Billion"ac>vely"searchable"documents"

•  Over"1"million"searches"an"hour"

Search Scale @

Data Analytics

Data Analytics (market supply)

Data Analytics (market demand)

Data Analytics (labor pressure: supply/demand)

Data Analytics (hiring comparison per market)

Traditional Search

Recommendations

Tradi>onal"Relevancy"Scoring"

Default Lucene Relevancy Algorithm (DefaultSimilarity)

*Source:"Solr%in%Ac*on,"chapter"3"

Score(q,d)"="""""""""∑""("-(t"in"d)".""idf(t)2"."t.getBoost()"."norm(t,"d)")6.6coord(q,"d)".6queryNorm(q)6

"""""""""t"in"q"

"Where:""

"t"="term;"d"="document;"q"="query;"f"="field"666666666-(t"in"d)""=""numTermOccurrencesInDocument"½"666666666idf(t)"=""1"+"log"(numDocs"/"(docFreq"+"1))"666666666coord(q,"d)"="numTermsInDocumentFromQuery"/"numTermsInQuery"666666666queryNorm(q)"="1"/"(sumOfSquaredWeights"½")"666666666sumOfSquaredWeights"="q.getBoost()2"."∑"("idf(t)"."t.getBoost()")2"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""t"in"q"

666666666norm(t,"d)"""="""d.getBoost()""f""lengthNorm(f)""f"""f.getBoost()"

•  Term Frequency: “How well a term describes a document?” –  Measure: how often a term occurs per document

•  Inverse Document Frequency: “How important is a term overall?” –  Measure: how rare the term is across all documents

TF * IDF

Boosting documents and fields

•  Certain fields may be more important than other fields: –  The Job Title and Skills may be more relevant than other aspects of the job: /select?qf=jobtitle^10 skills^5 jobrequirements^2 jobdescription^1

•  It’s possible to boost documents and fields at both index time and query time

•  If you need more fine-grained control (such as per-term index-time boosting), you can make use of payloads

Custom scoring with Payloads •  In addition to boosting search terms and fields, content within Fields can also be

boosted differently using Payloads (requires a custom scoring implementation): design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], …

jobtitle: bucket=[1] boost=10; company: bucket=[2] boost=4; jobdescription: bucket=[ ] weight=1; experience: bucket=[3] weight=1.5

We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:4;3:1.5;default:1;

•  This allows us to map many relevancy buckets to search terms at index time and adjust the weighting at query time without having to search across hundreds of fields.

•  By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model

•  News search: popularity and freshness drive relevance •  Restaurant search: geographical proximity and price range are critical •  Ecommerce: likelihood of a purchase is key •  Movie search: More popular titles are generally more relevant •  Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

Advanced"Relevancy"through"Func>ons"

Example of domain-specific relevancy calculation

News website:

/select? fq=$myQuery& q=_query_:"{!func}scale(query($myQuery),0,100)" AND _query_:"{!func}div(100,map(geodist(),0,1,1))" AND _query_:"{!func}recip(rord(publicationDate),0,100,100)" AND _query_:"{!func}scale(popularity,0,100)"& myQuery="street festival"& sfield=location& pt=33.748,-84.391

*Example"from"chapter"16"of"Solr%in%Ac*on%

Fancy boosting functions

•  Separating “relevancy” and “filtering” from the query: q=_val_:"$keywords"&fq={!cache=false v=$keywords}&keywords=solr

•  Keywords (50%) + distance (25%) + category (25%)

q=_val_:"scale(mul(query($keywords),1),0,50)" AND _val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,25)” AND _val_:"scale(mul(query($category),1),0,25)" &keywords=solr &radiusInKm=48.28 &distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)” &category=jobtitle:"java developer" &fq={!cache=false v=$keywords}

Context aware relevancy

Example: Willingness to relocate for a job

1,000"

1,500"

2,000"

2,500"

1%" 5%" 10%" 20%" 25%" 30%" 40%" 50%" 60%" 70%" 75%" 80%" 90%" 95%"

So>ware6engineers6

Food6service6workers6

Willingness to relocate

Somware"engineers"in"Chicago"want"jobs"in"these"loca>ons:"

Willingness to relocate

Food"service"workers"in"Chicago"want"jobs"in"these"loca>ons:"

Personaliza>on"&"Recommenda>ons"

•  John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

•  Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

•  Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

•  Jane is a nurse educator in Boston seeking between $40K and $60K working in the healthcare industry

Beyond domain knowledge… consider per-user knowledge

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA”) AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K working in the healthcare industry

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":"Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action/

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359} "

•  We built a recommendation engine!

•  What is a recommendation engine? –  A system that uses known information (or derived information from that

known information) to automatically suggest relevant content

•  Our example was just an attribute based recommendation… we’ll see that behavioral-based (i.e. collaborative filtering) is also possible.

What did we just do?

Redefining “Search Engine”

•  “Lucene is a high-performance, full-featured text search engine library…”

Yes,6but6really…6

•  "Lucene"is"a"highAperformance,"fullyAfeatured"token"matching"and"scoring"library…"which"can"perform"fullAtext"searching."

Redefining “Search Engine”

or,6in6machine6learning6speak:6•  A"Lucene"index"is"mul>Adimensional""sparse"matrix…"with"very"fast"and"powerful"lookup"capabili>es."

•  Think"of"each"field"as"a"matrix"containing"each"term"mapped"to"each"document"

The Lucene Inverted Index (traditional text example)

Term6 Documents6

a" doc1"[2x]"brown" doc3"[1x]","doc5"[1x]"cat" doc4"[1x]"cow" doc2"[1x]","doc5"[1x]"…" ...6

once" doc1"[1x],"doc5"[1x]"over" doc2"[1x],"doc3"[1x]"the" doc2"[2x],"doc3"[2x],"

doc4[2x],"doc5"[1x]"

…" …"

Document6 Content6Field6

doc1"" once"upon"a">me,"in"a"land"far,"far"away"

doc2" the"cow"jumped"over"the"moon."

doc3"" the"quick"brown"fox"jumped"over"the"lazy"dog."

doc4" the"cat"in"the"hat"

doc5" The"brown"cow"said"“moo”"once."

…" …"

What6you6SEND6to6Lucene/Solr:6How6the6content6is6INDEXED6into6Lucene/Solr6(conceptually):6

Matching text queries to text fields

/solr/select/?q=jobcontent:“software engineer”

Job6Content6Field6 Documents6

…" …"

engineer" doc1,"doc3,"doc4,"doc5"

mechanical" doc2,"doc4,"doc6"…" …6

somware" doc1,"doc3,"doc4,"doc7,"doc8"

…" …"

doc7"""""doc8"

doc1"""""doc3"""""""""""doc4"

engineer"

somware"

somware"engineer"

Beyond Text Searching

•  Lucene/Solr"is"a"search"matching"engine"

•  When"Lucene/Solr"search"text,"they"are"matching"tokens"in"the"query"with"tokens"in"index"

•  Anything"that"can"be"searched"upon"can"form"the"basis"of"matching"and"scoring:"–  text,"atributes,"loca>ons,"results"of"func>ons,"user"behavior,"classifica>ons,"etc.""

•  Content-based –  Attribute based

i.e. income level, hobbies, location, experience –  Hierarchical

i.e. “medical//nursing//oncology”, “animal//dog//terrier” –  Textual Similarity

i.e. Solr’s MoreLikeThis Request Handler & Search Handler –  Concept Based

i.e. Solr => “software engineer”, “java”, “search”, “open source”

•  Collaborative Filtering “Users who liked that also liked this…”

•  Hybrid Approaches

Approaches to Recommendations

Collaborative Filtering

Term6 Documents6

user1" doc1,"doc5"user2" doc2"user3" doc2"user4" doc1,"doc3,""

doc4,"doc5"

user5" doc1,"doc46…" …"

Document6 “Users6who6bought6this6product”6field6

doc1"" user1,"user4,"user5"

doc2" user2,"user3"

doc3"" user4"

doc4" user4,"user5"

doc5" user4,"user1"

…" …"

What6you6SEND6to6Lucene/Solr:6 How6the6content6is6INDEXED6into6Lucene/Solr6(conceptually):6

Step 1: Find similar users who like the same documents

Document6 “Users6who6bought6this6product”6field6

doc1"" user1,"user4,"user5"

doc2" user2,"user3"

doc3"" user4"

doc4" user4,"user5"

doc5" user4,"user1"

…" …"

TopAscoring"results"(most"similar"users):"1)  "user4"(2"shared"likes)"2)  "user5"(2"shared"likes)"3)  "user"1"(1"shared"like)"

doc16user166666user4666666666666666user56

666user466666user56

q=documen>d:"("doc1""OR""doc4")"

Step 2: Search for docs “liked” by those similar users

Term6 Documents6

user1" doc1,"doc5"user2" doc2"user3" doc2"user4" doc1,"doc3,""

doc4,"doc5"

user5" doc1,"doc46…" …"

Top"recommended"documents:"1)"doc1"(matches"user4,"user5,"user1)"2)"doc4"(matches"user4,"user5)"3)"doc5"(matches"user4,"user1)"4)"doc3"(matches"user4)""//"doc2"does"not"match"

Most"similar"users:"1)  "user4"(2"shared"likes)"2)  "user5"(2"shared"likes)"3)  "user"1"(1"shared"like)"

"""""""""""""""""""""""""""""""""""""""""""""""""""""""/solr/select/?q=userlikes:("user4"^2"" " "" """""""""""""""""""""""""""""""""""""""""""""""""""""""""OR""user5"^2"OR""user1"^1)"

Building up to personalization

•  Use what you have: –  User’s keywords, IP address, searches, clicks, “likes” (purchases,

job applications, comments, etc.) –  Build up a dossier of information on your users –  If a user gives you a profile (resume, social profile, etc), even better.

For full coverage of building a recommendation engine in Solr…

•  See my talk from Lucene Revolution 2012 (Boston):

Personalized Search

•  Why limit yourself to JUST explicit search or JUST automated recommendations?

•  By augmenting your user’s explicit queries with information you know about them, you can personalize their search results.

•  Examples: –  A known software engineer runs a blank job search in New York…

•  Why not show software engineering higher in the results?

–  A new user runs a keyword-only search for nurse •  Why not use the user’s IP address to boost documents geographically closer?

Seman>c"Search"

Not going to talk about…

•  Using the SynonymFilter •  Automatic language detection •  Stemming/lemmatization/multi-lingual search •  Stopwords (For all of the above, see the Solr Wiki, Reference Guide, or read Solr in Action)

•  Instead, we’re going to cover: –  Mining user behavior to discover synonyms/related queries –  Discovering related concepts using document clustering in Solr –  Future work: Latent Semantic Indexing –  Document to Document searching using More Like This –  Foreground/Background corpus analysis

•  Our primary approach: Search Co-occurrences •  Strategy: Map/Reduce job which computes similar searches run for the same

John searched for “java developer” and “j2ee” Jane searched for “registered nurse” and “r.n.” and “prn”. Zeke searched for “java developer” and “scala” and “jvm”

•  By mining the searches of tens millions of search terms per day, we get a list of top

searches, with the corresponding top co-occurring searches. •  We also tie each search term to the top category of jobs (i.e java developer, truck

driver, etc.), so that we know in what context people search for each term.

Automatic Synonym Discovery

Example of “related search terms”

Example:"“accoun>ng”"accountant"8880,"accounts"payable"5235,"finance"3675,"accoun>ng"clerk"3651,"bookkeeper"3225,"controller"2898,"staff"accountant"2866,"accounts"receivable"2842"

Example:"“RN”:"registered"nurse"6588,"rn"registered"nurse"4300,"nurse"2492,"nursing"912,"lpn"707,"healthcare"453,"rn"case"manager"446,"registered"nurse"rn"404,"director"of"nursing"321,"case"manager"292"

Latent Semantic Indexing •  Concept: Build a matrix of all terms, perform singular value decomposition on that

Matrix to reduce the number of dimensions, and index the meaningful (i.e. blurred) terms on each document.

•  Why this matters: if done correctly, the search engine can automatically collapse terms by meaning, remove the useless and redundant ones, and for it’s own conceptual model of your domain space. This can be used to infuse more meaning into a document than just a keyword.

•  See blog posts and presentations by John Berryman and Doug Turnbull about their work on this. They’re leading the way on this right now (in the open-source community).

•  http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy

Future work on building conceptual links

Using Clustering to find semantic links

Setting up Clustering in solrconfig.xml <searchComponent.name="clustering".enable=“true“..class="solr.clustering.ClusteringComponent">"..<lst.name="engine">"

....<str.name="name">default</str>"

....<str.name="carrot.algorithm">.

.org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>"....<str.name="MultilingualClustering.defaultLanguage">ENGLISH</str>"

..</lst>"</searchComponent>"

<requestHandler.name="/clustering".enable=“true".class="solr.SearchHandler">"..<lst.name="defaults">"

....<str.name="clustering.engine">default</str>"

....<bool.name="clustering.results">true</bool>"

....<str.name="fl">*,score</str>"

..</lst>"

..<arr.name="lastIcomponents">"

....<str>clustering</str>"

..</arr>"

</requestHandler>"

Clustering Query

/solr/clustering/?q=(solr or lucene) &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25 //clustering & grouping don’t currently play nicely Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results

Original"Query:"""q=(solr"or"lucene)"""""""""""//"can"be"a"user’s"search,"their"job">tle,""a"list"of"skills," " " " """"""""""""""""""""//"or"any"other"keyword"rich"data"source"

Clustering Results

Clusters Identified: Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2)

Stage"1:"Iden>fy"Concepts"

q=content:(“Developer”^22"or"“Java"Developer”^13"or"“Somware"”^10"or"“Senior"Java"Developer”^9""or"“Architect"”^6"or"“Somware"Engineer”^6"or"“Web"Developer"”^5"or"“Search”^3"or"“Somware"Developer”^3"or"“Systems”^3"or"“Administrator”^2"or"“Hadoop"Engineer”^2"or"“Java"J2EE”^2"or"“Search"Development”^2"or"“Somware"Architect”^2"or"“Solu>ons"Architect”^2)66//6Your6can6also6add6the6user’s6loca[on6or6the6original6keywords6to6the66//6recommenda[ons6search6if6it6helps6results6quality6for6your6use\case."

Stage"2:"Use"Seman>c"Links"in"your"relevancy"calcula>on"

Goal: use an entire document as your Solr Query, recommending other related documents.

Standard approach: More Like This Handler Alternative Approach: Foreground vs. Background corpus analysis

Document to Document Searching

solrconfig.xml: <requestHandler name="/mlt" class="solr.MoreLikeThisHandler" />

Query: /solr/jobs/mlt/?df=jobdescription& fl=id,jobtitle& rows=3& q=J2EE& // recommendations based on top scoring doc mlt.fl=jobtitle,jobdescription& // inspect these fields for interesting terms mlt.interestingTerms=details& // return the interesting terms mlt.boost=true

Enhancing relevancy through personalization & semantic search

relevant job search

search terms

news search

traditional relevancy

traditional search

relevancy buckets

advanced relevancy

relevancy model

Technology

Enhancing relevancy through personalization & semantic...

Marc Schwering – Using Flink with MongoDB to enhance...

Digital Shopper Relevancy - Dreamforce

Digital Shopper Relevancy Study

Personalization & Relevancy: Targeting Your Message for Your...

Siebel Personalization Administration GuideSiebel...

Watchmen Literary Relevancy

Digital shopper relevancy report 2014

Search Engine Evaluation based on Relevancy - Infosys Engine...

Enhancing Presentations Through Personalization and...

Enhancing the B2B Buying Experience with Personalization –...

Keyword Relevancy

Internet Marketing Personalization. Topics Personalization.....

Digital Shopper Relevancy - Report 2014

Advanced Relevancy Ranking

Digital Shopper Relevancy by CapGemini