Top Banner
Solr Performance & Key Innovations Yonik Seeley, Lucid Imagination [email protected], May 26 2011
32

Seeley Yonik - Solr Performance Key Innovations

Mar 30, 2016

Download

Documents

Solr Performance & Key Innovations Yonik Seeley, Lucid Imagination [email protected], May 26 2011
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Seeley Yonik - Solr Performance Key Innovations

Solr Performance & Key Innovations

Yonik Seeley, Lucid Imagination [email protected], May 26 2011

Page 2: Seeley Yonik - Solr Performance Key Innovations

Solr 3.1 Highlights §  Numeric range facets (similar to date faceting). §  New spatial search, including spatial filtering,

boosting and sorting capabilities. §  Example Velocity driven search UI at

http://localhost:8983/solr/browse §  A new faster termvector-based highlighter. §  Extended dismax (edismax) query parser with

support for fielded queries, enhanced relevancy, and full lucene syntax support.

§  Distributed search support for the Spell check and Terms components.

3

Page 3: Seeley Yonik - Solr Performance Key Innovations

Solr 3.1 Highlights (continued) §  Suggester, a fast trie-based autocomplete

component. §  Sort results by any function query. §  JSON document indexing. §  CSV response format §  Apache UIMA integration for metadata

extraction. §  Tons of optimizations, bugfixes, and new

analysis capabilities via Apache Lucene 3.1.

4

Page 4: Seeley Yonik - Solr Performance Key Innovations

What’s not in 3.1? §  Result Grouping (AKA Field Collapsing) §  Pivot Faceting §  SolrCloud §  Pseudo-fields §  Pseudo-join §  Relevancy function queries §  Per-segment faceting §  *Tons* of new Lucene performance/efficiency

goodness 5

Page 5: Seeley Yonik - Solr Performance Key Innovations

Recent Lucene Performance §  TieredMergePolicy – the new default

•  Much better for incremental indexing / NRT •  Ignores segment order when selecting best merge •  Takes deletes into account •  Does not over-merge (no cascading merges)

§  Finite State Transducer (FST) based terms index

6

Page 6: Seeley Yonik - Solr Performance Key Innovations

DocumentWriterPerThread (DWPT)

7

_1_0.tiv _1_0.prx _1_0.frq …

_2_0.tiv _2_0.prx _2_0.frq …

_3_0.tiv _3_0.prx _3_0.frq …

Index Writer

DWPT DWPT DWPT

Indexing thread

Flush segment to disk

§  Flushing new segment is now concurrent w/ indexing

§  Use multiple indexing threads/connections

§  When max mem is hit, biggest DWPT is concurrently flushed

in-memory

Page 7: Seeley Yonik - Solr Performance Key Innovations

Solr Cloud

8

shard1(replica1) replica2

replica3

shard2(replica1) replica2

replica3

ZooKeeper quorum

ZK node

ZK node

ZK node

ZK node

ZK node

/configs /myconf solrconfig.xml schema.xml

/livenodes server1:8983/solr server2:8983/solr server2:8983/solr

/collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr

http://.../solr/collection1?distrib=true

Load-balanced sub-request

Page 8: Seeley Yonik - Solr Performance Key Innovations

Solr Cloud: Getting Started

http://wiki.apache.org/solr/SolrCloud java  -­‐Dbootstrap_confdir=./solr/conf    

 -­‐Dcollection.configName=myconf      -­‐DzkRun      -­‐jar  start.jar  

Upload /solr/conf to ZK and call it

“myconf”

Run an internal ZK server

http://localhost:8983/solr/collection1/admin/zookeeper.jsp

Page 9: Seeley Yonik - Solr Performance Key Innovations

Distributed Requests l Explicitly specify node addresses to load-balance across

shards=localhost:8983/solr|localhost:8900/solr,                localhost:7574/solr|localhost:7500/solr  l  A list of equivalent nodes are separated by “|” l  Different phases of the same distributed request use the same node

l Specify logical shard ids to search across shards=NY_shard,NJ_shard  

l Query across all shards in the collection http://localhost:8983/solr/collection1/select?distrib=true    

l public  CloudSolrServer(String  zkHost)  l  SolrJ Java client that load-balances across all nodes in cluster

Page 10: Seeley Yonik - Solr Performance Key Innovations

Extended Dismax Parser l Superset of dismax l Designed to directly handle user queries w/o exceptions

&defType=edismax&q=foo&qf=body  

l Fixes edge cases where dismax could still throw exceptions OR      AND      NOT      -­‐      “  

l Full lucene syntax support l  Tries lucene syntax first l  Smart escaping is done if syntax errors

l Optionally supports treating “and”/”or” as AND/OR in lucene syntax

l Fielded queries (e.g. myfield:foo) even in degraded mode l  uf parameter controls what field names may be directly specified in “q”

Page 11: Seeley Yonik - Solr Performance Key Innovations

Extended Dismax Parser (continued) l boost parameter for multiplicative boost-by-function l Pure negative query clauses

Example: solr  OR  (-­‐solr)  l Enhanced term proximity boosting

l  pf2=myfield – results in term bigrams in sloppy phrase queries  myfield:“aa  bb  cc” -­‐>    myfield:“aa  bb”    myfield:“bb  cc”  

l Enhanced stopword handling l  stopwords omitted in main query, but added in optional proximity

boosting part Example: q=solr  is  awesome  &  qf=myfield  &  pf2=myfield      -­‐>          +myfield:(solr  awesome)    (myfield:”solr  is”  myfield:”is  awesome”)  

l  Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Page 12: Seeley Yonik - Solr Performance Key Innovations

Faceting Performance Improvements

l For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement

l Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster

l Optimized deep facet paging – up to 10x faster with really large facet.offsets

l Less memory consumed by field cache entries l Per-segment faceting with facet.method=fcs

l  Only faster when re-opening index frequently (many times a second) l  Only works for single-valued fields

Page 13: Seeley Yonik - Solr Performance Key Innovations

Pivot Faceting l Other names that could have made sense:

l  Grid Faceting, Cross-Product Faceting, Matrix Faceting

l Syntax: facet.pivot=field1,field2,field3,…

#docs #docs w/ inStock:true

#docs w/ instock:false

cat:electronics 14 10 4 cat:memory 3 3 0 cat:connector 2 0 2 cat:graphics card 2 0 2 cat:hard drive 2 2 0

facet.pivot=cat,inStock

Page 14: Seeley Yonik - Solr Performance Key Innovations

Pivot Faceting

"facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4},

http://...&facet=true&facet.pivot=cat,popularity (continued)

{ "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […]

14 docs w/ cat==electronics

5 docs w/ cat==electronics && popularity==6

Page 15: Seeley Yonik - Solr Performance Key Innovations

Range Faceting §  Like Date faceting, but

more generic

http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50

"facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

Page 16: Seeley Yonik - Solr Performance Key Innovations

Spatial Search Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field>

Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store

Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc Returning the distance: &fl=geodist()

Note: You can now sort by any arbitrary function query!

Pseudo-fields!

Page 17: Seeley Yonik - Solr Performance Key Innovations

Pseudo-Fields Returns other info along with document stored fields §  Function queries

fl=name,location,geodist(),add(myfield,10)  

§  Fieldname globs fl=id,attr_*  

§  Multiple “fl” (field list) values &fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’)  

§  Aliasing fl=id,location:loc,_dist_:geodist()  

§  Future: inlined highlighting, “explain”, sort-values, group-value  

!!!

18

Page 18: Seeley Yonik - Solr Performance Key Innovations

Result Grouping / Field Collapsing

l Goal l  Limit the number of results per category l  “category” normally defined by unique values in a field

l Uses l  Web Search – collapse by web site l  Email threads – collapse by thread id l  Ecommerce/retail

l  Show the top 5 items for each store category (music, movies, etc)

Page 19: Seeley Yonik - Solr Performance Key Innovations

Field Collapsing by Site

Page 20: Seeley Yonik - Solr Performance Key Innovations

Field Collapse on Product Type Result Grouping by Category

Page 21: Seeley Yonik - Solr Performance Key Innovations

Group by Field

http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact "grouped":{

"manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback

Black"}] }}]}}}

Page 22: Seeley Yonik - Solr Performance Key Innovations

Group by Query http://...&group=true&group.query=price:[0 TO 99.99]

&group.query=price:[100 TO *]&group.limit=5 "grouped":{

"price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback

Black"}] }}}}

Page 23: Seeley Yonik - Solr Performance Key Innovations

Grouping Params parameter meaning default

group.field=<field> Like facet.field – group by unique field values

group.query=<query> Like facet.query – top docs that also match

group.function=<function query>

Group by unique values produced by the function query

group.limit=<n> How many docs per group 1

group.sort=<sort spec> How to sort documents within a group Same as sort

rows=<n> How many groups to return 10

sort=<sort spec> How to sort the groups relative to each other (based on top doc)

group.format=<format> grouped/simple – if simple, a single flat list is used and rows units are “docs”

grouped

group.main=true/false If true, the first field grouping command is used as main result set

false

Page 24: Seeley Yonik - Solr Performance Key Innovations

Pseudo-Join

25

id: blog1 name: Solr ‘n Stuff owner: Yonik Seeley Started: 2007-10-26

id: blog2 name: lifehacker owner: Gawker Media started: 2005-1-31

id: post1 blog_id: blog1 author: Yonik Seeley title: Solr relevancy function queries body: Lucene’s default ranking […]

id: post2 blog_id: blog1 author: Yonik Seeley title: Solr result grouping body: Result Grouping, also called […]

id: post3 blog_id: blog2 author: Whitson Gordon title: How to Install Netflix on Almost

Any Android Device fq={!join from=blog_id to=id}body:netflix

-  Finds all documents matching “netflix” -  Maps to different docs by following blog_id to id

Restrict to blogs mentioning netflix

Page 25: Seeley Yonik - Solr Performance Key Innovations

Pseudo-Join Examples §  Only show posts from blogs started after 2010

q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]

§  If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join) q=bomb&fq={!join from=blog_id to=blog_id}obama

§  If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama

26

Page 26: Seeley Yonik - Solr Performance Key Innovations

Cross-Core Join

http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john

27

id: doc1 security: managers title: doc for managers only body: …

id: mary security_groups: managers, employees

id: doc1 security: managers, employees title: doc for everyone body: …

id: john security_groups: employees

collection1 sec1

Single Solr Server

Page 27: Seeley Yonik - Solr Performance Key Innovations

Pseudo-Join vs Grouping Pseudo-Join Result Grouping / Field Collapsing

O(n_terms_in_join_fields) O(n_docs_in_result)

Single or multi-valued fields Single-valued fields only

Filters only (no info currently passed from the “from” docs to the “to” docs).

Can order docs within a group and groups by top doc within that group using normal sort criteria.

Chainable (one join can be the input to another)

Not currently chainable – can only group one field deep

Affects which documents match a request, so naturally affects facet numbers (e.g. you can search posts and get numbers of blogs)

Grouping does not currently affect the set of documents matching the query, so faceting is unaffected.

28

Page 28: Seeley Yonik - Solr Performance Key Innovations

Auto-Suggest l Many people previously used terms component

l  Can be slow for a large corpus

l New auto-suggest builds off SpellCheck component l  TST implementation: compact memory based trie l  FST implementation: slower to build, but smaller & faster lookup l  Based on a field in the main index, or on a dictionary file

http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult

29

"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

Page 29: Seeley Yonik - Solr Performance Key Innovations

Index with JSON $  URL=http://localhost:8983/solr/update/json  $  curl  $URL  -­‐H  'Content-­‐type:application/json'  -­‐d  ’  [      {          "id"  :  "978-­‐0641723445",          "cat"  :  ["book","hardcover"],          "title"  :  "The  Lightning  Thief",          "author"  :  "Rick  Riordan",          "series_t"  :  "Percy  Jackson  and  the  Olympians",          "sequence_i"  :  1,          "genre_s"  :  "fantasy",          "inStock"  :  true,          "price"  :  12.50,          "pages_i"  :  384      }  ]'  

Page 30: Seeley Yonik - Solr Performance Key Innovations

Query Results in CSV

http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 l  Can handle multi-valued fields (see “cat” field in example) l  Completely compatible with the CSV update handler (can round-trip) l  Results are streamed – good for dumping entire parts of the index

Page 31: Seeley Yonik - Solr Performance Key Innovations

http://localhost:8983/solr/browse

Page 32: Seeley Yonik - Solr Performance Key Innovations

Q&A