YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Yonik Seeley

Lucene/Solr Revolution 2014 Washington, D.C.

Native Code & Off-Heap Data Structures for Solr

Page 2: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

My Background •  Creator of Solr •  Heliosearch Founder •  LucidWorks Co-Founder •  Lucene/Solr committer, PMC member •  Apache Software Foundation member •  M.S. in Computer Science, Stanford

Page 3: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Heliosearch Project •  The Next Evolution of Solr •  Forked from Solr, Developing at github

–  Started Jan 2014 –  Well aligned community –  Open Source, Apache licensed

•  Bring back to Apache in the future? •  Currently drop-in replacement for Solr at the HTTP-API level

–  A super-set… we continually merge in upstream changes –  Latest version of Heliosearch includes latest Solr

•  Current Features: Off-heap filters, Off-heap fieldcache, facet-by-function, sub-facets, native code performance enhancements

Page 4: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Garbage Collection

Page 5: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Garbage Collection Basics Eden  Space  

Survivor  Space  1  

Survivor  Space  2  

Tenured  Space  

Permanent  Space  

q  New objects allocated in Eden q  Find live objects by tracing from GC

“roots” (threads, stack locals, etc) q  Make a copy of live objects, leaving

“garbage” behind q  Eden + Survivor Space copied

together to other Survivor space q  Tenured from Survivor when old

enough q  “stop-the-world” needed when GC

can’t keep up q  Out of memory when too much time

spent in GC

Thread  

Page 6: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Java Memory Waste -  Need to size for worst case scenario -  OS needs free memory to cache index files -  JVMs aren’t good at “sharing” with rest of the system -  mmap allocations managed by OS, can be immediately reused on free

OS  Real  Memory  

Heap  in  use  

Unused  Heap  

max  heap  

JVM  

Heap  in  use  

Unused  Heap  

max  heap  

JVM  

C  Heap  in  use  Unused  Heap  

C  Process  

C  Heap  in  use  Unused  Heap  

C  Process  

mmap    alloced   mmap    alloced  

“Free”  Memory  includes  buffer  cache,  important  to  cache  index  files  

Page 7: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

GC Impact q GC Reduces Throughput

q Time to copy all that memory around could be spent better!

q Stop-the-world pauses q Seconds to Minutes long q Pause time proportional to heap size q Still exists in all Hotspot GCs… CMS, G1GC, etc q Breaks Application SLAs (request timeouts, etc) q Can cause SolrCloud Zookeeper session timeouts

q Reducing max pause size normally means reduced throughput

q Non-graceful degradation q if you don't size your heap big enough… BOOM!

Page 8: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

GC Tuning UseSerialGC UseParallelGC UseParallelOldGC UseParallelOldGCCompacting UseParallelDensePrefixUpdate HeapMaximumCompactionInterval HeapFirstMaximumCompactionCount UseMaximumCompactionOnSystemGC ParallelOldDeadWoodLimiterMean ParallelOldDeadWoodLimiterStdDev UseParallelOldGCDensePrefix ParallelGCThreads ParallelCMSThreads YoungPLABSize OldPLABSize GCTaskTimeStampEntries AlwaysTenure NeverTenure ScavengeBeforeFullGC UseConcMarkSweepGC ExplicitGCInvokesConcurrent UseCMSBestFit UseCMSCollectionPassing UseParNewGC ParallelGCVerbose ParallelGCBufferWastePct ParallelGCRetainPLAB TargetPLABWastePct PLABWeight ResizePLAB PrintPLAB ParGCArrayScanChunk ParGCDesiredObjsFromOverflowList CMSParPromoteBlocksToClaim AlwaysPreTouch CMSUseOldDefaults CMSYoungGenPerWorker CMSIncrementalMode CMSIncrementalDutyCycle CMSIncrementalPacing CMSIncrementalDutyCycleMin CMSIncrementalSafetyFactor CMSIncrementalOffset CMSExpAvgFactor CMS_FLSWeight CMS_FLSPadding FLSCoalescePolicy CMS_SweepWeight CMS_SweepPadding CMS_SweepTimerThresholdMillis CMSClassUnloadingEnabled CMSCompactWhenClearAllSoftRefs UseCMSCompactAtFullCollection CMSFullGCsBeforeCompaction CMSIndexedFreeListReplenish CMSLoopWarn CMSMarkStackSize CMSMarkStackSizeMax CMSMaxAbortablePrecleanLoops CMSMaxAbortablePrecleanTime CMSAbortablePrecleanMinWorkPerIteration CMSAbortablePrecleanWaitMillis CMSRescanMultiple CMSConcMarkMultiple CMSRevisitStackSize CMSAbortSemantics CMSParallelRemarkEnabled CMSParallelSurvivorRemarkEnabled CMSPLABRecordAlways CMSConcurrentMTEnabled CMSPermGenPrecleaningEnabled CMSPermGenSweepingEnabled

CMSPrecleaningEnabled CMSPrecleanIter CMSPrecleanNumerator CMSPrecleanDenominator CMSPrecleanRefLists1 CMSPrecleanRefLists2 CMSPrecleanSurvivors1 CMSPrecleanSurvivors2 CMSPrecleanThreshold CMSCleanOnEnter CMSRemarkVerifyVariant CMSScheduleRemarkEdenSizeThreshold CMSScheduleRemarkEdenPenetration CMSScheduleRemarkSamplingRatio CMSSamplingGrain CMSScavengeBeforeRemark CMSWorkQueueDrainThreshold CMSWaitDuration CMSYield CMSBitMapYieldQuantum UseGCLogFileRotation NumberOfGCLogFiles GCLogFileSize LargePageSizeInBytes LargePageHeapSizeThreshold PrintGCApplicationConcurrentTime PrintGCApplicationStoppedTime OnOutOfMemoryError ClassUnloading BlockOffsetArrayUseUnallocatedBlock RefDiscoveryPolicy ParallelRefProcEnabled CMSTriggerRatio CMSBootstrapOccupancy CMSInitiatingOccupancyFraction UseCMSInitiatingOccupancyOnly HandlePromotionFailure PreserveMarkStackSize ZeroTLAB PrintTLAB TLABStats AlwaysActAsServerClassMachine DefaultMaxRAM DefaultMaxRAMFraction DefaultInitialRAMFraction UseAutoGCSelectPolicy AutoGCSelectPauseMillis UseAdaptiveSizePolicy UsePSAdaptiveSurvivorSizePolicy UseAdaptiveGenerationSizePolicyAtMinorCollection UseAdaptiveGenerationSizePolicyAtMajorCollection UseAdaptiveSizePolicyWithSystemGC UseAdaptiveGCBoundary AdaptiveSizeThroughPutPolicy AdaptiveSizePausePolicy AdaptiveSizePolicyInitializingSteps AdaptiveSizePolicyOutputInterval UseAdaptiveSizePolicyFootprintGoal AdaptiveSizePolicyWeight AdaptiveTimeWeight PausePadding PromotedPadding SurvivorPadding AdaptivePermSizeWeight PermGenPadding ThresholdTolerance AdaptiveSizePolicyCollectionCostMargin YoungGenerationSizeIncrement YoungGenerationSizeSupplement YoungGenerationSizeSupplementDecay TenuredGenerationSizeIncrement TenuredGenerationSizeSupplement TenuredGenerationSizeSupplementDecay

MaxGCPauseMillis MaxGCMinorPauseMillis GCTimeRatio AdaptiveSizeDecrementScaleFactor UseAdaptiveSizeDecayMajorGCCost AdaptiveSizeMajorGCDecayTimeScale MinSurvivorRatio InitialSurvivorRatio BaseFootPrintEstimate UseGCOverheadLimit GCTimeLimit GCHeapFreeLimit PrintAdaptiveSizePolicy DisableExplicitGC CollectGen0First BindGCTaskThreadsToCPUs UseGCTaskAffinity ProcessDistributionStride CMSCoordinatorYieldSleepCount CMSYieldSleepCount PrintGCTaskTimeStamps TraceClassLoadingPreorder TraceGen0Time TraceGen1Time PrintTenuringDistribution PrintHeapAtSIGBREAK TraceParallelOldGCTasks PrintParallelOldGCPhaseTimes MaxHeapSize MaxNewSize PretenureSizeThreshold MinTLABSize TLABAllocationWeight TLABWasteTargetPercent TLABRefillWasteFraction TLABWasteIncrement MaxLiveObjectEvacuationRatio OldSize MinHeapFreeRatio MaxHeapFreeRatio SoftRefLRUPolicyMSPerMB MinHeapDeltaBytes MinPermHeapExpansion MaxPermHeapExpansion QueuedAllocationWarningCount MaxTenuringThreshold InitialTenuringThreshold TargetSurvivorRatio MarkSweepDeadRatio PermMarkSweepDeadRatio MarkSweepAlwaysCompactCount PrintCMSStatistics PrintCMSInitiationStatistics PrintFLSStatistics PrintFLSCensus DeferThrSuspendLoopCount DeferPollingPageLoopCount SafepointSpinBeforeYield UseDepthFirstScavengeOrder GCDrainStackTargetSize ThreadSafetyMargin CodeCacheMinimumFreeSpace MaxDirectMemorySize PerfDataMemorySize AggressiveHeap UseCompressedStrings UseStringCache HeapDumpOnOutOfMemoryError HeapDumpPath PrintGC PrintGCDetails PrintGCTimeStamps PG1HeapRegionSize G1ReservePercent G1ConfidencePercent PrintPromotionFailure PrintGCDateStamps

-­‐XX:IniKaKngHeapOccupancyPercent=n  

-­‐XX:MaxHeapFreeRaKo=70  

-­‐XX:MaxGCPauseMillis=n  

-­‐XX:+ScavengeBeforeFullGC  

-­‐XX:ConcGCThreads=n  

-­‐XX:MaxTenuringThreshold=n  

Page 9: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

GC Reduction q Reuse objects – cause less garbage q Move certain things off-heap (invisible to GC) q Option1: Direct ByteBuffers

q Limited to “int” (2GB) q No way to directly “free” – still relies on GC

q Option2: sun.misc.Unsafe q malloc() + free() + direct memory access q Supported on all major JVMs q Widely used: Java (nio, concurrent),JSR166, Google

Guava, objenesis (which is used in Kyro, which is used in Twitter Storm), Apache DirectMemory,Lightning, Hazelcast, snappy, gson, …

q Being considered for Java 9

Page 10: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Off-Heap Filters 50M docs (3.8 GB index)

8GB RAM 20K requests 8 req threads 500 filters JVM Options: -Xmx4G (solr)

Page 11: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

title Off-Heap Filters Test

Observed  max  process  sizes  Solr                            :  3.8GB  –  4.3GB  Heliosearch:  3.6GB  –  3.7GB  

Page 12: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Off-Heap FieldCache Normal (on-heap) FieldCache q  Typically the largest data structures kept on the heap q  Used for sorting, function query values, single-valued faceting,

grouping q  Uses weak references Heliosearch nCache (n is for “native”) q  Allocated off-heap q  First-class managed Solr cache

q Configure size, warming policies q View statistics

q  Per-segment (NRT friendly) q  No weak references

Page 13: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch
Page 14: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

nCache admin stats

item_id:{  "field":"id",  "uses":8,  "class":"StrTopValues",        "refcount":2,  "numSegments":7,  "carriedOver":6,  "size":612}  item_popularity:{  "field":"popularity",  "uses":5,        "class":"IntTopValues",  "refcount":2,  "numSegments":7,        "carriedOver":6,  "size":106}  item_price:{        "field":"price”,        "uses":0,                  -- the number of top-level uses for searcher      "class":"FloatTopValues",        "refcount":2,        "numSegments":5,    -- number of segments populated      "carriedOver":5,    -- number of segments carried over from last searcher      "size":272                -- size in bytes for all populated segments  }  

Page 15: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Off-Heap Integer Field q  50M document index q  Sorting on 6 different integer fields (10,100,1000,10000,1M unique values) q  4 request threads Results q  42% faster sorting q  73% faster functions

Page 16: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

String Field Sorting q  10M document index q  10 different string fields, each field 80% populated q  Median latency

Page 17: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

String Field Sorting Throughput q  Concurrent throughput sorting on random fields in random order (asc/desc) q  ~50% performance gain

Page 18: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Native Code

Page 19: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Native Code q The Idea: create native accelerators for CPU hotspots

q Faceting anyone? q But…. JNI Sucks! (and it’s GC’s fault again)

q GetArrayElements() – makes a *copy* of the array! q GetPrimitiveArrayCritical() – blocks garbage collection!

q Tons of other restrictions… it’s a “critical section” q Defeats the purpose of going to native code in the first place q But… our data is already off-heap, we’re good!

 jint  *buf=  (*env)-­‐>GetIntArrayElements(env,  arr,  0);    for  (i=0;  i<len;  i++)  {            sum  +=  buf[i];    }  

Page 20: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Native Single Valued String Faceting

q Top-Level off-heap String cache q Improves Sorting and Faceting speed q Eliminates FieldCache “insanity”

q Native Code q Written in C++, compiled with GCC 4.7, 4.8 q Currently supports 64 bit Windows, OS-X, Linux (x86) q static compilation avoids JVM hotspot warmup period,

mis-compilation bugs, and variations between runs

Page 21: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Native Faceting Performance

Page 22: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Terms Query Optimization

Page 23: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch
Page 24: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

New Facet Module

Page 25: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Facet Module Goals q Replace the aging “SimpleFacets” q First class JSON support q Easier programmatic construction of complex nested facet

commands q Canonical response format that is easier for clients to

parse q First class analytics support q Cleaner distributed search support q Fully pluggable q Better base for integration of other search features Heliosearch is a Solr super-set, so you can still chose to use the old faceting or mix-n-match.

Page 26: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

&facet=true  

&facet.range={!key=age_ranges}age  

&f.age_ranges.facet.range.start=0  

&f.age_ranges.facet.range.end=100  

&f.age_ranges.facet.range.gap=10  

&facet.range={!key=price_ranges}price  

&f.price_ranges.facet.range.start=0  

&f.price_ranges.facet.range.end=1000  

&f.price_ranges.facet.range.gap=50  

{        age_ranges:  {                //  facet  name              range:  {                              //  facet  type                    field  :  age,              //  facet  params                    start  :  0,                    end  :  100,                    gap  :  10              }        },        price_ranges:  {              range:  {                    field  :  price,                    start  :  0,                    end  :  1000,                    gap  :  50              }          }  }  

API Comparison Old Style New JSON API

Page 27: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Facet Functions q Sort/Report by things other than “count” Aggregation Functions / Stats:                q Stats are calculated “per bucket” q Buckets created by Query, Range, or Terms (field) facets

count  sum(function)  avg(function)  sumsq(function)  min(function)  max(function)  unique(string_field)  

any  “funcKon  query”  that  yields  a  numeric  value!  

Example:    sum(mul(num_units,  unit_price))  

Page 28: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

$  curl  http://localhost:8983/solr/query  -­‐d  'q=widgets&  json.facet=  {        //  Comments  can  help  with  clarity            /*  traditional  C-­‐style  comments  are  also  supported  */      x  :  "avg(price)"  ,      //  Simple  strings  can  occur  unquoted      y  :  'unique(brand)'    //  Strings  can  also  use  single  quotes  }  '  

[…]  "facets"  :  {      "count"  :  314,      "x"  :  102.5,      "y"  :  28  }  

Number  of  documents  in  the  facet  bucket  

Simple Request + Response

Page 29: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Terms Facet Example  json.facet={        shoes:{            terms:{                field:  shoe_style,                sort:  {x  :  desc},                facet:{                    x  :  "avg(price)",                    y  :  "unique(brand)"                }            }        }    }  

"facets":  {      "count"  :  472,      "shoes":  {          "buckets"  :  [              {                  "val"  :  "Hiking",                  "count"  :  34,                  "x"  :  135.25,                  "y"  :  17,              },              {                  "val"  :  "Running",                  "count"  :  45,                  "x"  :  110.75,                  "y"  :  24,              },    

Executed  per-­‐bucket  

Page 30: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Sub-Facets q Any facet that produces buckets can have sub-

facets (terms/field, range, query) q Sub-facets can have facet functions (stats) or their

own sub-facets (no limit to nesting). q A subfacet can be any type (field, range, query) q Multiple subfacets can be added to any given facet q Subfacets are first-class facets - can be configured

independently like any other facet. q Different offsets, limits, stats, sorts, etc

Page 31: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Sub-Facet Example  json.facet={        shoes:{            terms:{                field:  shoe_style,                sort:  {x  :  desc},                facet:{                    x  :  "avg(price)",                    y  :  "unique(brand)",                    colors  :{terms:color}                  }            }        }    }  

"facets":  {      "count"  :  472,      "shoes":  {          "buckets"  :  [              {                  "val"  :  "Hiking",                  "count"  :  34,                  "x"  :  135.25,                  "y"  :  17,                  "colors"  :  {                      "buckets"  :  [                          {  "val"  :  "brown",                              "count"  :  12  },                          {  "val"  :  "black",                              "count"  :  10                          },  […]                      ]                  }  //  end  of  colors  sub-­‐facet              },  //  end  of  Hiking  bucket              {                  "val"  :  "Running",                  "count"  :  45,                  "x"  :  110.75,                  "y"  :  24,                  "colors"  :  {                      "buckets"  :  […]  

Short-­‐form  for  terms  facet  simply  specifies  the  field.  Sorts  buckets  

by  count  descending.  

Page 32: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Terms Facet Terms facet creates buckets of docs with the same value in a field -  field – The field name to facet over. -  offset – Used for paging, this skips the first N buckets. Defaults to 0. -  limit – Limits the number of buckets returned. Defaults to 10. -  mincount – Only return buckets with a count of at least this number. Defaults to 1. -  sort – Specifies how to sort the buckets produced. “count” specifies document count,

“index” sorts by the index (natural) order of the bucket value. One can also sort by any facet function / statistic that occurs in the bucket. The default is “count desc”. This parameter may also be specified in JSON like sort:{count:desc}. The sort order may either be “asc” or “desc”

-  missing – A boolean that specifies if a special “missing” bucket should be returned that is defined by documents without a value in the field. Defaults to false.

-  numBuckets – A boolean. If true, adds “numBuckets” to the response, an integer representing the number of buckets for the facet (as opposed to the number of buckets returned). Defaults to false.

-  allBuckets – A boolean. If true, adds an “allBuckets” bucket to the response, representing the union of all of the buckets. For multi-valued fields, this is different than a bucket for all of the documents in the domain since a single document can belong to multiple buckets. Defaults to false.

-  prefix – Only produce buckets for terms starting with the specified prefix.

Page 33: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Query Facet Query facet creates a single bucket of documents matching the query.

{    //  simple  example        highpop:{  query:{  q:"inStock:true  AND  popularity[8  TO  10]"  }  }  }  

{    //  example  with  multiple  sub-­‐facets        highpop:{  query:{            q  :  "inStock:true  AND  popularity[8  TO  10]",            facet  :  {                average_price  :  "agv(price)",                available_colors  :  {  terms  :  color  },                price_ranges  :  {  range  :  {                    field:price,  start:0,  end:200,  gap:10                }}        }}    }  

Page 34: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Range Facet Creates buckets over ranges on a numeric or date field Parameter names/values "in sync" with Solr range parameters: field – The numeric field or date field to produce range buckets from start – Lower bound of the ranges end – Upper bound of the ranges gap – Size of each range bucket produced hardend – A boolean, which if true means that the last bucket will end at “end” even if it is less than “gap” wide. If false, the last bucket will be “gap” wide, which may extend past “end”. other – This param indicates that in addition to the counts for each range constraint between facet.range.start and facet.range.end, counts should also be computed for…

–  "before" all records with field values lower then lower bound of the first range –  "after" all records with field values greater then the upper bound of the last range –  "between" all records with field values between the start and end bounds of all ranges –  "none" compute none of this information –  "all" shortcut for before, between, and after

include – By default, the ranges used to compute range faceting between facet.range.start and facet.range.end are inclusive of their lower bounds and exclusive of the upper bounds. The “before” range is exclusive and the “after” range is inclusive. This default, equivalent to lower below, will not result in double counting at the boundaries. This behavior can be modified by the facet.range.include param, which can be any combination of the following options…

–  "lower" all gap based ranges include their lower bound –  "upper" all gap based ranges include their upper bound –  "edge" the first and last gap ranges include their edge bounds (ie: lower for the first one, upper for the last one)

even if the corresponding upper/lower option is not specified –  "outer" the “before” and “after” ranges will be inclusive of their bounds, even if the first or last ranges already

include those boundaries. –  "all" shorthand for lower, upper, edge, outer

Page 35: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Sub-Facets + Facet-Functions =

Business Intelligence / Analytics

Page 36: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Fantasy  ($1045)    Top  Authors  $423  George  R.R.  MarKn  $347  Brandon  Sanderson  $155  JK  Rowling    Top  Books  $252  A  Game  of  Thrones  $113  Emperor  of  Thorns  $101  Nine  Princes  in  Amber  $82      Steel  Heart    

Sci-­‐Fi  ($898)    Top  Authors  $321  Iain  M  Banks  $218  Neal  Asher  $155  Neal  Stephenson    Top  Books  $113  Gridlinked  $101  Use  of  Weapons  $93      Snow  Crash  $82      The  Skinner    

Mystery  ($645)    Top  Authors  $191  James  Panerson  $145  Patricia  Cornwell  $126  John  Grisham    Top  Books  $85    One  for  the  Money  $77    Angels  &  Daemons  $64    Shuner  Island  $35    The  Firm    

Filter  By  State  $852  NJ      (14  stores)  $658  NY    (11  stores)  $421  CT      (8  stores)    Chain  $984  Amazoon                        (14  stores)  $734  Houses&Royalty  (9  stores)  $387  Books-­‐r-­‐us                      (7  stores)    Store  $108  Amazoon  Branchburg  $93      Books-­‐r-­‐us  Bridgewater  $87      H&R  NYC        Number  of  Books    Chain  201K  Houses&Royalty  183K  Amazoon  98K      Books-­‐r-­‐us    Store  193K  H&R  NYC  77K      Books-­‐r-­‐us  Bridgewater  68K      Amazoon  Branchburg        

Page 37: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

date_breakout  :  {  range:  {      field:  sale_date,      start  :  ...,      end  :  ...,      gap  :  "+1MONTH”,        facet  :  {          top_genre  :  {  terms  :  {              field  :  genre,              sort  :  "revenue  desc",              limit  :  4,              facet  :  {                  revenue  :  "sum(sales)"              }          }},            by_chain:  {  terms  :  {              field  :  chain,              facet  :  {                  revenue  :  "sum(sales)"              }          }}    […]  

Implementation Creates  series  of  facet  buckets  based  on  date  

For  each  date  bucket,  facet  by  genre,  taking  the  top  4  by  revenue  

For  each  genre  bucket,  report  revenue  

Page 38: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Fantasy  ($1045)    Top  Authors  $423  George  R.R.  MarKn  $347  Brandon  Sanderson  $155  JK  Rowling    Top  Books  $252  A  Game  of  Thrones  $113  Emperor  of  Thorns  $101  Nine  Princes  in  Amber  $82      Steel  Heart    

Sci-­‐Fi  ($898)    Top  Authors  $321  Iain  M  Banks  $218  Neal  Asher  $155  Neal  Stephenson    Top  Books  $113  Gridlinked  $101  Use  of  Weapons  $93      Snow  Crash  $82      The  Skinner    

Mystery  ($645)    Top  Authors  $191  James  Panerson  $145  Patricia  Cornwell  $126  John  Grisham    Top  Books  $85    One  for  the  Money  $77    Angels  &  Daemons  $64    Shuner  Island  $35    The  Firm    

top_genres:{  terms:{      field:  genre,      facet  :  {          rev  :  "sum(sales)",            top_authors:{  terms:{              field  :  author,              sort    :"rev  desc",              limit  :  3,              facet  :  {                  rev  :  "sum(sales)"              }          }},            top_books:{  terms:{              field  :  Ktle,              sort    :  "rev  desc",              limit  :  4,              facet  :  {                  rev  :  "sum(sales)"              }          }}    […]  

Page 39: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Filter  By  State  $852  NJ      (14  stores)  $658  NY    (11  stores)  $421  CT      (8  stores)    Chain  $984  Amazoon                        (14  stores)  $734  Houses&Royalty  (9  stores)  $387  Books-­‐r-­‐us                      (7  stores)    Store  $108  Amazoon  Branchburg  $93      Books-­‐r-­‐us  Bridgewater  $87      H&R  NYC        

 state_breakout:{  terms:{      field:  state,      sort:  "rev  desc",      facet  :  {          rev  :  "sum(sales)",          num_stores  :  "unique(store)"  }},    chain_breakout:{  terms:{      field:  chain,      sort:  "rev  desc",      facet  :  {          rev  :  "sum(sales)",          num_stores  :  "unique(store)"  }}  ,    store_breakout:{  terms:{      field:  store,      sort:  "rev  desc",      facet  :  {          rev  :  "sum(sales)",  }}}    

Page 40: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Misc Features

Page 41: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Parameter Substitution q Parameters / macros substituted across whole request q Happens before any parsing, so usable in any context

q=price:[ ${low} TO ${high} ] &low=100 &high=200

q Default values q=price:[ ${low:0} TO ${high:100} ]

q Nested q=${price_query} &price_query=${price_field}:[ ${low} TO ${high} ] AND inStock:true &price_field=specialPrice &low=50 &high=100

Page 42: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

New Query Parser Features

q Filters in queries - just like “fq” parameters, but may appear anywhere in a query q=(text:elephant –(filter(*:* -price:[ 0 TO 100 ]) OR filter(date[0 TO 2013]) )

q Constant Score Queries q=color:(blue OR green)^=1 text:shoes

q Comments in Queries (can nest)

q=+text:elephant /* the main query */ /* boosting part – WIP {!func}mul(pop,rank)^10 */

Page 43: Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, Heliosearch

Thank You

Help Develop the Next Generation of Solr! Resources: q http://heliosearch.org q https://github.com/Heliosearch/heliosearch q https://groups.google.com/forum/#!forum/heliosearch q https://groups.google.com/forum/#!forum/heliosearch-dev


Related Documents