Top Banner
28

Recent Additions to Lucene Arsenal

May 10, 2015

Download

Technology

Presented by Shai Erera, Researcher, IBM

Lucene's arsenal has recently expanded to include two new modules: Index Sorting and Replication. Index sorting lets you keep an index consistently sorted based on some criteria (e.g. modification date). This allows for efficient search early-termination as well as achieve better index compression. Index replication lets you replicate a search index to achieve high-availability, fault tolerance as well as take hot index backups. In this talk we will introduce these modules, discuss implementation and design details as well as best practices.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Recent Additions to Lucene Arsenal
Page 2: Recent Additions to Lucene Arsenal

Recent Additions to Lucene’s Arsenal

Shai Erera, Researcher, IBM

Adrien Grand, ElasticSearch

Page 3: Recent Additions to Lucene Arsenal

• Shai Erera– Working at IBM – Information Retrieval Research– Lucene/Solr committer and PMC member– http://shaierera.blogspot.com– [email protected]

• Adrien Grand– @jpountz– Lucene/Solr committer and PMC member– Software engineer at Elasticsearch

Who We Are

Page 4: Recent Additions to Lucene Arsenal

The Replicator

Page 5: Recent Additions to Lucene Arsenal

Load Balancing

Load

Balancer

Page 6: Recent Additions to Lucene Arsenal

Failover

Page 7: Recent Additions to Lucene Arsenal

Index Backup

Page 8: Recent Additions to Lucene Arsenal

The Replicator

Primary

Backup

Backup

http://shaierera.blogspot.com/2013/05/the-replicator.html

Re

plic

ato

r Re

plic

atio

nC

lien

tR

ep

lica

tion

Clie

nt

Page 9: Recent Additions to Lucene Arsenal

• Replicator– Mediates between the client and server– Manages the published Revisions– Implementation for replication over HTTP

• Revision– Describes a list of files and metadata– Responsible to ensure the files are available as long as clients replicate it

• ReplicationClient– Performs the replication operation on the replica side– Copies delta files and invokes ReplicationHandler upon successful copy– Always replicates latest revision

• ReplicationHandler– Acts on the copied files

Replication Components

Page 10: Recent Additions to Lucene Arsenal

• IndexRevision– Obtains a snapshot on the last commit through SnapshotDeletionPolicy– Released when revision is released by Replicator

• IndexReplicationHandler– Copies the files to the index directory and fsync them– Aborts (rollback) on any error– Upon successful completion, invokes a callback (e.g.

SearcherManager.maybeRefresh())

• Similar extensions for faceted index replication– IndexAndTaxonomyRevision: obtains snapshots on both the search and taxonomy

indexes– IndexAndTaxonomyReplicationHandler: copies the files to the respective

directories, keeping both in sync

Index Replication

Page 11: Recent Additions to Lucene Arsenal

Sample Code

// Server-side: publish a new RevisionReplicator replicator = new LocalReplicator();replicator.publish(new IndexRevision(indexWriter));

// Client-side: replicate a RevisionReplicator replicator; // either LocalReplicator or HttpReplicator

// refresh SearcherManager after index is updatedCallable<Boolean> callback = new Callable<Boolean>() { public Boolean call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); }}

ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);ReplicationClient client = new ReplicationClient(replicator, handler, factory);

client.updateNow(); // invoke client manually// -- OR --client.startUpdateThread(30000); // check for updates every 30 seconds

Page 12: Recent Additions to Lucene Arsenal

• Resume– Session level: don’t copy files that were already successfully copied– File level: don’t copy file parts that were already successfully copied

• Parallel Replication– Copy revision files in parallel

• Other replication strategies– Peer-to-peer

Future Work

Page 13: Recent Additions to Lucene Arsenal

Index SortingHow to trade index speed for search speed

Page 14: Recent Additions to Lucene Arsenal

Index = collection of immutable segments

Segments store documents sequentially on disk

Add data = create a new segment

Segments get eventually merged together

Order of segments / documents in segments doesn’t matter– the following segments are equivalent

Anatomy of a Lucene index

9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13

1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12IdPrice

13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0

12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3IdPrice

Page 15: Recent Additions to Lucene Arsenal

ordinal of a doc in a segment = doc id

used in the inverted index to refer to docs

Anatomy of a Lucene index

9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13

1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12Id

Price

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16doc id

shoe 1, 3, 5, 8, 11, 13, 15

Page 16: Recent Additions to Lucene Arsenal

Get top N=2 results:– Create a priority queue of size N– Accumulate matching docs

Top hits

9 0 7 8 2 2 1 8 10 3 4 4 6 10 1 1 13

1 3 10 4 7 20 42 11 9 8 15 18 30 31 99 5 12IdPrice

(3)() (3,4) (4,20) (4,9) (4,9) (9,31) (9,31)

Automatic overflow of the priority queue to remove the

least one

Create an empty priority queue

Top hits

Page 17: Recent Additions to Lucene Arsenal

Let’s do the same on a sorted index

Early termination

13 10 10 9 8 8 7 6 4 4 3 2 2 1 1 1 0

12 9 31 1 4 11 10 30 15 18 8 7 20 42 99 5 3IdPrice

(9)() (9,31) (9,31) (9,31) (9,31) (9,31) (9,31)

Priority queue never changes after this

document

Page 18: Recent Additions to Lucene Arsenal

Pros– makes finding the top hits much faster– file-system cache-friendly

Cons– only works for static ranks

– not if the sort order depends on the query– requires the index to be sorted– doesn’t work for tasks that require visiting every doc:

– total number of matches– faceting

Early termination

Page 19: Recent Additions to Lucene Arsenal

Not uncommon!

Graph-based ranks– Google’s PageRank

Facebook social search / Unicorn– https://www.facebook.com/publications/219621248185635

Many more...

Doesn’t need to be the exact sort order– heuristics when score is only a function of the static rank

Static ranks

Page 20: Recent Additions to Lucene Arsenal

A live index can’t be kept sorted– would require inserting docs between existing docs!– segments are immutable

Offline sorting to the rescue:– index as usual– sort into a new index– search!

Pros/cons– super fast to search, the whole index is fully sorted– but only works for static content

Offline sorting

Page 21: Recent Additions to Lucene Arsenal

Offline Sorting

// open a reader on the unsorted index and create a sorted (but slow) viewDirectoryReader reader = DirectoryReader.open(in);boolean ascending = false;Sorter sorter = new NumericDocValuesSorter("price", ascending);AtomicReader sortedReader = SortingAtomicReader.wrap( SlowCompositeReaderWrapper.wrap(reader), sorter);

// copy the content of the sorted reader to the new dirIndexWriter writer = new IndexWriter(out, iwConf);writer.addIndexes(sortedReader);writer.close();reader.close();

Page 22: Recent Additions to Lucene Arsenal

Sort segments independently– wouldn’t require inserting data into existing segments– collection could still be early-terminated on a per-segment basis

But segments are immutable– must be sorted before starting writing them

Online sorting?

Page 23: Recent Additions to Lucene Arsenal

2 sources of segments– flush– merge

flushed segments can’t be sorted– Lucene writes stored fields to disk on the fly– could be buffered but this would require a lot of memory

merged segments can be sorted– create a sorted view over the segments to merge– pass this view to SegmentMerger instead of the original segments

not a bad trade-off– flushed segments are usually small & fast to collect

Online sorting?

Page 24: Recent Additions to Lucene Arsenal

Online sorting?

Flushed segments - NRT reopens - RAM buffer size limit hit

Merged segments

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Merged segments can easily take 99+% of the size of the index

Page 25: Recent Additions to Lucene Arsenal

Online Sorting

IndexWriterConfig iwConf = new IndexWriterConfig(...);

// original MergePolicy finds the segments to mergeMergePolicy origMP = iwConf.getMergePolicy();

// SortingMergePolicy wraps the segments with a sorted viewboolean ascending = false;Sorter sorter = new NumericDocValuesSorter("price", ascending);MergePolicy sortingMP = new SortingMergePolicy(origMP, sorter);

// setup IndexWriter to use SortingMergePolicyiwConf.setMergePolicy(sortingMP);IndexWriter writer = new IndexWriter(dir, iwConf);

// index as usual

Page 26: Recent Additions to Lucene Arsenal

Collect top N matches

Offline sorting– index sorted globally– early terminate after N matches have been collected– no priority queue needed!

Online sorting– no early termination on flushed segments– early termination on merged segments

– if N matches have been collected– or if current match is less than the top of the PQ

Early termination

Page 27: Recent Additions to Lucene Arsenal

Early Termination

class MyCollector extends Collector {

@Override public void setNextReader(AtomicReaderContext context) throws IOException { readerIsSorted = SortingMergePolicy.isSorted(context.reader(), sorter); collected = 0; }

@Override public void collect(int doc) throws IOException { if (readerIsSorted && (++collected >= maxDocsToCollect || curVal <= pq.top()) { // Special exception that tells IndexSearcher to terminate // collection of the current segment throw new CollectionTerminatedException(); } else { // collect hit } }}

Page 28: Recent Additions to Lucene Arsenal

Questions?