Top Banner
38

What is in a Lucene index?

Nov 11, 2014

Download

Technology

Presented by Adrien Grand, Software Engineer, Elasticsearch

Although people usually come to Lucene and related solutions in order to make data searchable, they often realize that it can do much more for them. Indeed, its ability to handle high loads of complex queries make Lucene a perfect fit for analytics applications and, for some use-cases, even a credible replacement for a primary data-store. It is important to understand the design decisions behind Lucene in order to better understand the problems it can solve and the problems it cannot solve. This talk will explain the design decisions behind Lucene, give insights into how Lucene stores data on disk and how it differs from traditional databases. Finally, there will be highlights of recent and future changes in Lucene index file formats.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: What is in a Lucene index?
Page 2: What is in a Lucene index?

WHAT IS IN A LUCENE INDEX

Adrien GrandSoftware engineer at Elasticsearch@jpountz

Page 3: What is in a Lucene index?

• Lucene/Solr committer• Software engineer at Elasticsearch

• I like changing the index file formats!– stored fields– term vectors– doc values– ...

About me

Page 4: What is in a Lucene index?

Why should I learn about

Lucene internals?

Page 5: What is in a Lucene index?

• Know the cost of the APIs– to build blazing fast search applications– don’t commit all the time– when to use stored fields vs. doc values– maybe Lucene is not the right tool

• Understand index size– oh, term vectors are 1/2 of the index size!– I removed 20% of my documents and index size hasn’t changed

• This is a lot of fun!

Why should I learn about Lucene internals?

Page 6: What is in a Lucene index?

• Make data fast to search– duplicate data if it helps– decide on how to index based on the queries

• Trade update speed for search speed– Grep vs full-text indexing– Prefix queries vs edge n-grams– Phrase queries vs shingles

• Indexing is fast– 220 GB/hour for 4K docs!– http://people.apache.org/~mikemccand/lucenebench/indexing.html

Indexing

Page 7: What is in a Lucene index?

• Tree structure– sorted for range queries– O(log(n)) search

Let’s create an index

sql

index term

data

Lucene in action

DatabasesLucene

Page 8: What is in a Lucene index?

Lucene doesn’t work this way

Page 9: What is in a Lucene index?

Lucene

term

2

3

• Store terms and documents in arrays– binary search

Another index

Lucene in action

Databases

0

1

data

index

0

1

0,1

0,1

0

0

sql4 1

Page 10: What is in a Lucene index?

Lucene

term

2

3

• Store terms and documents in arrays– binary search

Another index

Lucene in action

Databases

0

1

data

index

0

1

0,1

0,1

0

0

Segment

doc id documenttermordinal

termsdict

postingslist

sql4 1

Page 11: What is in a Lucene index?

• Insertion = write a new segment• Merge segments when there are too many of them

– concatenate docs, merge terms dicts and postings lists (merge sort!)

Insertions?

Lucene

term

2

3

Lucene in action

Databases

0

1

data

index

0

1

0,1

0,1

0

sql4

0

1

Lucene

term

2

3

Lucene in action0

data

index

0

1

0

0

0

0

Databases0

data

index

0

1

0

0

sql2 0

Page 12: What is in a Lucene index?

• Insertion = write a new segment• Merge segments when there are too many of them

– concatenate docs, merge terms dicts and postings lists (merge sort!)

Insertions?

Lucene

term

2

3

Lucene in action

Databases

0

1

data

index

0

1

0,1

0,1

0

sql4

0

1

Lucene

term

2

3

Lucene in action0

data

index

0

1

0

0

0

0

Databases1

data

index

0

1

1

1

sql2 1

Page 13: What is in a Lucene index?

Lucene

term

2

3

• Deletion = turn a bit off• Ignore deleted documents when searching and merging (reclaims space)• Merge policies favor segments with many deletions

Deletions?

Lucene in action

Databases

0

1

data

index

0

1

0,1

0,1

0

0

sql4 1

1

0

live docs: 1 = live, 0 = deleted

Page 14: What is in a Lucene index?

• Updates require writing a new segment– single-doc updates are costly, bulk updates preferred– writes are sequential

• Segments are never modified in place– filesystem-cache-friendly– lock-free!

• Terms are deduplicated– saves space for high-freq terms

• Docs are uniquely identified by an ord– useful for cross-API communication– Lucene can use several indexes in a single query

• Terms are uniquely identified by an ord– important for sorting: compare longs, not strings– important for faceting (more on this later)

Pros/cons

Page 15: What is in a Lucene index?

Lucene can use several indexes

Many databases can’t

Page 16: What is in a Lucene index?

Index intersection

1, 2, 10, 11, 20, 30, 50, 1002, 20, 21, 22, 30, 40, 100

redshoe

Many databases just pick the most selective index and ignore the other ones

1 2

3

4

5

6 7

8

9

Lucene’s postings lists support skipping that can be use to “leap-frog”

Page 17: What is in a Lucene index?

• We just covered search• Lucene does more

– term vectors– norms– numeric doc values– binary doc values– sorted doc values– sorted set doc values

What else?

Page 18: What is in a Lucene index?

• Per-document inverted index• Useful for more-like-this• Sometimes used for highlighting

Term vectors

Lucene

term

2

3

Lucene in action0 data

index

0

1

0

0

0

0

Databases1 data

index

0

1

0

0

sql2 0

Lucene

term

2

3

data

index

0

1

0,1

0,1

0

0

sql4 1

Page 19: What is in a Lucene index?

• Per doc and per field single numeric values, stored in a column-stride fashion• Useful for sorting and custom scoring• Norms are numeric doc values

Numeric/binary doc values

Lucene in action

Databases

0

1

Solr in action

Java

2

3

42

1

3

10

afc

gce

ppy

ccn

field_a field_b

Page 20: What is in a Lucene index?

• Ordinal-enabled per-doc and per-field values– sorted: single-valued, useful for sorting– sorted set: multi-valued, useful for faceting

Sorted (set) doc values

Lucene in action

Databases

0

1

Solr in action

Java

2

3

1,2

0

0,1,2

1

distributed

Java

search

0

1

2

Ordinals Terms dictionary forthis dv field

Page 21: What is in a Lucene index?

• Compute value counts for docs that match a query– eg. category counts on an ecommerce website

• Naive solution– hash table: value to count– O(#docs) ordinal lookups– O(#doc) value lookups

• 2nd solution– hash table: ord to count– resolve values in the end– O(#docs) ordinal lookups– O(#values) value lookups

Faceting

Since ordinals are dense, this can be a simple array

Page 22: What is in a Lucene index?

• These are the low-level Lucene APIs, everything is built on top of these APIs: searching, faceting, scoring, highlighting, etc.

How can I use these APIs?

API Useful for Method

Inverted index Term -> doc ids, positions, offsets AtomicReader.fields

Stored fields Summaries of search results IndexReader.document

Live docs Ignoring deleted docs AtomicReader.liveDocs

Term vectors More like this IndexReader.termVectors

Doc values / Norms Sorting/faceting/scoring AtomicReader.get*Values

Page 23: What is in a Lucene index?

• Data duplicated up to 4 times– not a waste of space!– easy to manage thanks to immutability

• Stored fields vs doc values– Optimized for different access patterns

– get many field values for a few docs: stored fields– get a few field values for many docs: doc values

Wrap up

0,A 1,A 2,A

0,A 0,B 0,C 1,A 1,C 2,A 2,B 2,C1,B

0,B 1,B 2,B

0,B 1,B 2,B

Stored fields

Doc valuesAt most 1 seek per doc

At most 1 seek per doc per fieldBUT more disk / file-system cache-friendly

Page 24: What is in a Lucene index?

File formats

Page 25: What is in a Lucene index?

• Save file handles– don’t use one file per field or per doc

• Avoid disk seeks whenever possible– disk seek on spinning disk is ~10 ms

• BUT don’t ignore the filesystem cache– random access in small files is fine

• Light compression helps– less I/O– smaller indexes– filesystem-cache-friendly

Important rules

Page 26: What is in a Lucene index?

• File formats are codec-dependent

• Default codec tries to get the best speed for little memory– To trade memory for speed, don’t use RAMDirectory:

– MemoryPostingsFormat, MemoryDocValuesFormat, etc.

• Detailed file formats available in javadocs– http://lucene.apache.org/core/4_5_1/core/org/apache/lucene/codecs/package-

summary.html

Codecs

Page 27: What is in a Lucene index?

• Bit packing / vInt encoding– postings lists– numeric doc values

• LZ4– code.google.com/p/lz4– lightweight compression algorithm– stored fields, term vectors

• FSTs– conceptually a Map<String, ?>– keys share prefixes and suffixes– terms index

Compression techniques

Page 28: What is in a Lucene index?

What happens when I run a TermQuery?

Page 29: What is in a Lucene index?

• Lookup the term in the terms index– In-memory FST storing terms prefixes– Gives the offset to look at in the terms dictionary– Can fast-fail if no terms have this prefix

1. Terms index

l/4 uc

ry/3

b/2 r a/1br = 2brac = 3luc = 4lyr = 7

Page 30: What is in a Lucene index?

• Jump to the given offset in the terms dictionary– compressed based on shared prefixes, similarly to a burst trie– called the “BlockTree terms dict”

• read sequentially until the term is found–

2. Terms dictionary

[prefix=luc]a, freq=1, offset=101as, freq=1, offset=149ene, freq=9, offset=205ky, frea=7, offset=260rative, freq=5, offset=323

Jump hereNot foundNot foundFound

Page 31: What is in a Lucene index?

• Jump to the given offset in the postings lists• Encoded using modified FOR (Frame of Reference) delta

– 1. delta-encode– 2. split into block of N=128 values– 3. bit packing per block– 4. if remaining docs, encode with vInt

3. Postings lists

1,3,4,6,8,20,22,26,30,31

1,2,1,2,2,12,2,4,4,1

[1,2,1,2] [2,12,2,4] 4, 1

2 bits per value 4 bits per value

Example with N=4

vInt-encoded

Page 32: What is in a Lucene index?

• In-memory index for a subset of the doc ids– memory-efficient thanks to monotonic compression– searched using binary search

• Stored fields– stored sequentially– compressed (LZ4) in 16+KB blocks

4. Stored fields

0 1 2 3 4 5 6

16KB 16KB 16KB

docId=0offset=42

docId=3offset=127

docId=4offset=199

Page 33: What is in a Lucene index?

• 2 disk seeks per field for search• 1 disk seek per doc for stored fields

• It is common that the terms dict / postings lists fits into the file-system cache

• “Pulse” optimization– For unique terms (freq=1), postings are inlined in the terms dict– Only 1 disk seek– Will always be used for your primary keys

Query execution

Page 34: What is in a Lucene index?

Quizz

Page 35: What is in a Lucene index?

What is happening here?

#docs in the index

qps 1

2

Page 36: What is in a Lucene index?

What is happening here?

#docs in the index

qps 1

2

Index grows larger than the filesystem cache: stored fields not fully in the cache anymore

Page 37: What is in a Lucene index?

What is happening here?

#docs in the index

qps 1

2

Index grows larger than the filesystem cache: stored fields not fully in the cache anymore

Terms dict/Postings lists not fully in the cache

Page 38: What is in a Lucene index?

Thank you!