Solr 1.5 and Beyond Yonik Seeley May 11, 2010

Post on 24-Feb-2016

61 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

NYC Lucene / Solr Meetup. Solr 1.5 and Beyond Yonik Seeley May 11, 2010. Agenda. Lucene / Solr merge Relevancy (Extended Dismax Parser) Scalability ( Solr Cloud) Spatial/Geo Search Near Real Time Field Collapsing Q&A. Lucene-Solr Merge. Lucene / Solr voted to merge (March 2010) - PowerPoint PPT Presentation

Transcript

Solr 1.5 and BeyondYonik SeeleyMay 11, 2010

NYC Lucene/Solr Meetup

Lucid Imagination, Inc.

Agenda

Lucene/Solr merge

Relevancy (Extended Dismax Parser)

Scalability (Solr Cloud)

Spatial/Geo Search

Near Real Time

Field Collapsing

Q&A

Lucid Imagination, Inc.

Lucene-Solr Merge

Lucene/Solr voted to merge (March 2010)Were already separate sub-projects of the Lucene TLP

High committer overlap

Solr had stopped using Lucene trunk/development versions

Much code duplication

What it meansSingle set of committers

Single developer mailing list (dev@lucene.apache.org)

Single subversion trunk

Keep separate downloads, user mailing lists

Lucid Imagination, Inc.

Lucene/Solr Development Changes

Nutch, Tika, Mahout spun off to their own TLPStill may be considered part of “Lucene Ecosystem”

Lucene/Solr development changestrunk is now always next major release (currently 4.0)

branch_3x will be base for all 3.x releases

No back compat guarantees between major releases

Relevance

Lucid Imagination, Inc.

Extended Dismax Parser

Superset of dismax&defType=edismax&q=foo&qf=body

Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “

Full lucene syntax supportTries lucene syntax first

Smart escaping is done if syntax errors

Optionally supports treating “and”/”or” as AND/OR in lucene syntax

Fielded queries (e.g. myfield:foo) even in degraded modeuf parameter controls what field names may be directly specified in “q”

Lucid Imagination, Inc.

Extended Dismax Parser (continued)

boost parameter for multiplicative boost-by-function

Pure negative query clausesExample: solr OR (-solr)

Enhanced term proximity boostingpf2=myfield – results in term bigrams in sloppy phrase queries

myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”

Enhanced stopword handlingstopwords omitted in main query, but added in optional proximity boosting part

Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)

Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Scalability

Lucid Imagination, Inc.

SolrCloud

First steps toward simplifying cluster management

Integrates ZookeeperCentral configuration (schema.xml, solrconfig.xml, etc)

Tracks live nodes + shards of collections

Removes need for external load balancersshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr

Can specify logical shard idsshards=NY_shard,NJ_shard

Clients don’t need to know shards:http://localhost:8983/solr/collection1/select?

distrib=true

Lucid Imagination, Inc.

SolrCloud : The Future

Eliminate all single points of failure

Remove Master/Searcher distinctionEnables near real-time search in a highly scalable environment

High Availability for WritesEventual consistency model (like Amazon Dynamo, Cassandra)

ElasticSimply add/subtract servers, cluster will rebalance automatically

By default, Solr will handle document partitioning

Spatial Search

Lucid Imagination, Inc.

Spatial Search

PointTypeGeneric improvement: polyField – single value -> multiple indexed fields

Compound values: 38.89,-77.03

Range queries and exact matches supported• q=location:21.33,51.37• q=location:[10,20 TO 30,40]

Distance FunctionsGeneric improvement: function queries can yield multiple values

Haversine: hsin(3963.205, store, vector(10,20))Many possibilities, including boost by distance

Lucid Imagination, Inc.

Spatial Search (continued)

Sorting by function querysort=hsin(3963.205,store,vector(10,20)) asc

Distance Filtering (SOLR-1568)fq={!sfilt fl=store_tiles}&pt=45.17614,-93.87341&d=1Implementations: trie range queries, spatial tiles, geohash

Return sort values or function query values for each doc FunctionQuery results as pseudo-fields (SOLR-1298)

fl=field1,field2,{!func key=dist}hsin(…) ???

Near Real Time

Lucid Imagination, Inc.

Near Real-Time Search

Shorter times until updates are searchable/visible

Lucene 2.9 first laid the groundwork w/ per-segment searchingPer-segment FieldCache entries for sorting and FunctionQueries

NRT IndexWriter.getReader()• Make new segments available before merging is done in background

• Doesn’t cause commit/fsync first

Solr still needsPer-segment faceting

Per-segment caching

Per-segment statistics (and anything else that uses FieldCache)

Lucid Imagination, Inc.

Existing single-valued faceting algorithm

53514521

(null)batman

flashspidermansupermanwolverine

order: for each doc, an index into the lookup array

lookup: the string values

Lucene FieldCache Entry (StringIndex) for the “hero” field

027

010002

Documents matching the base query “Juggernaut”

accumulator

increment

lookup

q=Juggernaut&facet=true&facet.field=hero

Lucid Imagination, Inc.

Per-segment single-valued faceting algorithm

Segment1FieldCache

Entry

Segment2FieldCache

Entry

Segment3FieldCache

Entry

Segment4FieldCache

Entry

027

035012

0210

1304

010

Priority queue

Batman, 3flash, 5

Base DocSet

lookupinc

accumulator1 accumulator2 accumulator3 accumulator4

FieldCache + accumulator merger(Priority queue)

thread1

thread2 thread3thread4

Lucid Imagination, Inc.

Per-segment faceting

Enable with facet.method=fcs

Controllable multi-threadingfacet.field={!threads=4}myfield

DisadvantagesLarger memory use (FieldCaches + accumulators)

Slower (extra FieldCache merge step needed)

AdvantagesRebuilds FieldCache entries only for new segments (NRT friendly)

Multi-threaded

Lucid Imagination, Inc.

Per-segment faceting performance comparison

Time for request* facet.method=fc facet.method=fcs

static index 3 ms 244 ms

quickly changing index 1388 ms 267 ms

Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

Test index: 10M documents, 18 segments, single valued field

Time for request* facet.method=fc facet.method=fcs

static index 26 ms 34 ms

quickly changing index 741 ms 94 ms

Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

*complete request time, measured externally

A

B

Field Collapsing

Lucid Imagination, Inc.

Field Collapsing

Field collapsingLimit the number of results per category

“category” defined by unique values in a field

UsesWeb Search – collapse by web site

Email threads – collapse by thread id

Ecommerce/retail• Show the top 5 items for each store category (music, movies, etc)

Lucid Imagination, Inc.

Field Collapsing by Site

Lucid Imagination, Inc.

Field Collapse on Product Type

Q&A

top related