Digging into solr Rails Usergroup Hamburg 13. April 2011
Jun 27, 2015
What is solr
HTTP Request Servlet
Admin
Update Servlet
Different Request HandlerXML
Update
Solr Core
Lucene
config
schemacaching
concurrency
Replication
What is solr
● Unstructured rows● Denormalization of data● Dynamic fields● Schema → Tokenizer, Filters, etc.● Tons of XML
What is solr
Index
StringsTokenizer FilterTokenizer Token
Indexing
TokenizerFilter Query
Results
Query
What is solr
● Get Requestshl.fragsize=0&spellcheck=true&spellcheck.extendedResults=true&qf=everything_phonetic_wa^1+display_name_phonetic_wa^2+comment_en_wa^4+review_en_wa^8+everything_en_wa^16+everything_wa^32+display_name_en_wa^64+display_name_wa^128&spellcheck.collate=true&wt=ruby&hl=true&rows=100&fl =pk_i,score&start=0&q=chipotle+bbq&spellcheck.dictionary=spell_en&bf=linear(en_rating_points_i,100,0)&spellcheck.count=1&qt=dismax&fq=closed_b:false+AND+domain_id_s:uki*+AND+(type_s:Place)
Solr integration into Rails
● Sunspot● acts_as_solr● Qype → acts_as_solr● Optimized Queries for solr
● Monkey patching● Defined queries without dynamic fields● Names of search fields differ from AR names
Solr integration into Rails
● Data consistency● Synchronous
– AR stores in mysql and solr– Longer response times – Not really synchron in case of replication
● Asynchronous– AR stores in mysql– Data import via mysql requests by solr master– Out of sync for some minutes– Deletion by flag, later physically– Javascript preprocessing of data possible
● Pool of words for spellchecking● Words from real data● Beeeeeeer● 9 Languages● New → Spellchecker for different kind of data● Suggestion → Locator → Facet → best match ?● Similar word → fuzzy search vs. spellchecking
Challenges - Spellchecking
?
Challenges - Spellchecking
CC BY-ND 2.0 - JM3
Challenges - Spellchecking
Chipotle BBQCC BY-ND 2.0 - Meindert Arnold Jacob
Chinese BabyCC BY-ND 2.0 - joshDubya
CC BY-ND 2.0 raybdbomb
shingles!CC BY-ND 2.0 - michael clarke stuff
Challenges – Stemming
● Stemming vs. Lemmatizing● 9 Languages● Hafen – Hafer (Harbor – Oat)● Performance● Stemming → solr SnowBallPorterFactory● Polish → Lemmatizng → OpenOffice
Challenges – Synonyms
● 9 Languages● OpenOffice rules !● Not all languages available → NL is missing
Challenges – NGrams
● Hugh Index● Tee matches Steeb● EdgeNGrams● Bar → Sofabar → Barmbek
● Not matched string shall be a word → performance
Challenges – Phrases
● Boost matching of phrases → whole entry● 'Europa Passage'
● Boost matching of phrases → left sided● 'Galeria Kaufhof in Hamburg'● 'Boutique in Galeria Kaufhof'● Javascript pre processing
● Boost matching of phrase somewhere in entry● How to handle matches of some words in given
phrase?
Challenges – Whitespace in index
● Index: 'Ping Pong'● Search word: 'Pingpong'● Javascript pre processing
CC BY-ND 2.0 - zimpenfish
CC BY-ND 2.0 - Ewan-M
Experiences – sever setup
Live Staging Dev
Loadbalancer
Slave Slave Slave
Master
Slave
DB Slave
Solr queries
Replication
Import
Slave
Master
DB Slave
iMac
Solr & MySql
Experiences – size of indices
● Staging System → Sunday evening● Places in simple format: 712 MB● Previews simple format: 5,519 GByte● Places Previews Comments extended: 3,5 GB● Big Spellchecker: 16 GByte ● New combined index: 15 GByte
● Index: 14 Gbyte● Spellchecker: 1 GByte
Experiences – server setup
● Live Servers● 2 x 8 Cores, 2 x 16 Cores● 32 Gbyte RAM● Max. CPU usage: up to 500%● Solr loves RAM → 32 Gbyte full with cache
Experiences – Solr loves RAM
● Dev → 1 Gig● Staging → 4.5 Gig (no load)● Import → 11 Gig and more● Production → 14 Gig
Experiences – accesses
● More than ~60 requests per seconds are not recommended
● Max of 40 requests per seconds is OK
Experiences – Response times
● Spellchecking 'pizzt' big index (staging):● 1502 / 48 / 47 / 48 / 31 ms● Spellchecking 'pizzt' small index (staging):● 603 / 12 / 8 / 9 / 9 ms
Experiences – Response times
● Facet for spellchecking:● facet=true&facet.mincount=0&facet.limit=1&wt=ruby&rows=0&fl=pk_i,score&
facet.query=comment_de_wa:"pizza"+OR+review_de_wa:"pizza"+OR+everything_de_wa:"pizza"+OR+everything_wa:"pizza"+OR+display_name_de_wa:"pizza"+OR+display_name_wa:"pizza"+OR+display_name_ngram:"pizza"&facet.query=comment_de_wa:"pizze"+OR+review_de_wa:"pizze"+OR+everything_de_wa:"pizze"+OR+everything_wa:"pizze"+OR+display_name_de_wa:"pizze"+OR+display_name_wa:"pizze"+OR+display_name_ngram:"pizze"&facet.query=comment_de_wa:"pizz"+OR+review_de_wa:"pizz"+OR+everything_de_wa:"pizz"+OR+everything_wa:"pizz"+OR+display_name_de_wa:"pizz"+OR+display_name_wa:"pizz"+OR+display_name_ngram:"pizz"&facet.query=comment_de_wa:"pizzi"+OR+review_de_wa:"pizzi"+OR+everything_de_wa:"pizzi"+OR+everything_wa:"pizzi"+OR+display_name_de_wa:"pizzi"+OR+display_name_wa:"pizzi"+OR+display_name_ngram:"pizzi"&facet.query=comment_de_wa:"pizzs"+OR+review_de_wa:"pizzs"+OR+everything_de_wa:"pizzs"+OR+everything_wa:"pizzs"+OR+display_name_de_wa:"pizzs"+OR+display_name_wa:"pizzs"+OR+display_name_ngram:"pizzs"&ffacet.query=comment_de_wa:"pizzo"+OR+review_de_wa:"pizzo"+OR+everything_de_wa:"pizzo"+OR+everything_wa:"pizzo"+OR+display_name_de_wa:"pizzo"+OR+display_name_wa:"pizzo"+OR+display_name_ngram:"pizzo"&facet.query=comment_de_wa:"pizzy"+OR+review_de_wa:"pizzy"+OR+everything_de_wa:"pizzy"+OR+everything_wa:"pizzy"+OR+display_name_de_wa:"pizzy"+OR+display_name_wa:"pizzy"+OR+display_name_ngram:"pizzy"&facet.query=comment_de_wa:"pizzn"+OR+review_de_wa:"pizzn"+OR+everything_de_wa:"pizzn"+OR+everything_wa:"pizzn"+OR+display_name_de_wa:"pizzn"+OR+display_name_wa:"pizzn"+OR+display_name_ngram:"pizzn"&facet.query=comment_de_wa:"pezzt"+OR+review_de_wa:"pezzt"+OR+everything_de_wa:"pezzt"+OR+everything_wa:"pezzt"+OR+display_name_de_wa:"pezzt"+OR+display_name_wa:"pezzt"+OR+display_name_ngram:"pezzt"&facet.query=comment_de_wa:"pizzä"+OR+review_de_wa:"pizzä"+OR+everything_de_wa:"pizzä"+OR+everything_wa:"pizzä"+OR+display_name_de_wa:"pizzä"+OR+display_name_wa:"pizzä"+OR+display_name_ngram:"pizzä"&q=*:*&qt=standard&fq=closed_b:false+AND+domain_id_s:de600-hamburg*+AND+(type_s:Place)
● 10 facets: 231 / 5 /4 / 22 / 3(->xml) ms
Experiences – Response times
● Staging / index schama on prod● Standard Query 'pizza': 106 / 0 / 0 (9122)● Fuzzy (pizza~0.3): 4440 / 663 / 0 (40149)● Fuzzy (pizza~0.5): 822 / 0 / 0 (12129)● Fuzzy (pizza~0.8): 34 / 1 / 0 (9122)● Wildcard: (rest*): 39 / 0 / 0 (41031)