Top Banner
Lucene Roadmap Steve Rowe LucidWorks
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Lucene Roadmap

Steve Rowe

LucidWorks

Page 2: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

• 1997: Doug Cutting creates Lucene • 2000-2001: SourceForge hosts Lucene • 2001-present: Lucene @ Apache Software Foundation • 2006: Flexible indexing planning starts • 2007: Solr graduates from the Apache Incubator to join the Lucene PMC as a sub-project • 2008: Flexible indexing implementation begins • 2010: Lucene and Solr development merge • 2011: Lucene and Solr 3.1 and all further releases coordinated (13 joint releases so far) • 2012: Lucene/Solr 4.0 released

Some Lucene (& Solr) History & Stats

Page 3: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Lucene 4.0 Highlights

• Flexible indexing: pluggable codecs: index format suites

• Flexible scoring: more index stats & similarities that use them

• Faster multithreaded indexing via concurrent flushing: DWPT

• Doc Values: typed single-valued fields: flexible sorting, scoring

• Norms are now doc values: you can have more than one byte!

• More RAM efficient data structures, e.g. terms dict/idx & fieldcache

• Faster search filtering

• Merge I/O can be rate-limited, to reduce I/O contention

• IndexReader is now per-segment

• Completely reworked spatial search

Page 4: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Lucene 4.1 & 4.2 Highlights

• Seeks on writing out index files eliminated • Compressed stored fields and term vectors • AnalyzingSuggester and FuzzySuggester • Lucene facet module improvements: speedups, NRT

support, DrillSideways • PostingsHighlighter: uses postings offsets • CommonTermsQuery: speed up queries with very highly

frequent terms. • Doc Values API and performance improvements • The FST package supports FSTs over 2GB in size • LiveFieldValues: real-time get for Lucene • New classification module

Page 5: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Lucene 4.3 Highlights

• minShouldMatch BooleanQuery major performance improvement

• SortingAtomicReader and SortingMergePolicy • DocIdSetIterator and Scorer now has a cost API • Analyzing/FuzzySuggester now enable recording an

arbitrary byte[] as a payload • Spatial module: support for query relations Within,

Contains, and Disjoint • Facet module: new method computes facet counts

using SortedSetDocValuesField, without a separate taxonomy index.

Page 6: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

On the horizon

• More efficient positional queries

• Incremental field updates

• Korean Analyzer

Page 7: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Solr Dev/User Survey Results

Page 8: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Solr Developer/User survey, April 2013

• Survey invitation emailed to 4,136 people:

– LucidWorks training class attendees

– Revolution attendees

– LucidWorks webinar registrants

• 177 have responded so far

Page 9: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Please rank the following features by priority Answered: 165 Skipped: 12

Page 10: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Page 11: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Page 12: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Page 13: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Page 14: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Page 15: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

More questions

1. How many attendees are Eclipse developers?

2. How many attendees are running Solr Cloud in production?

Page 16: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Solr: Past, Present & Future

Yonik Seeley LucidWorks

Page 17: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Origins of Solr

• CNET driven to find alternatives to discontinued commercial enterprise search product

• Plan A: ATOMICS (Apache TO MySQL In CNET Search) – Standalone server speaking XML over HTTP

– Meet majority of “search” needs – http://conferences.oreillynet.com/cs/mysqluc2005/view/e_sess/7066

• Plan B: “Something based on Lucene” – Started Summer 2004

– First prototype called “Fusion”, later renamed SOLAR (Search On Lucene And Resin)

Page 18: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Origins of the first Solr admin UI

Page 19: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

New admin UI

Page 20: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Timeline (up to 1.4)

Initial prototype

CNET production

CNET contributes Solr to ASF

Solr graduates

from Incubator

Simple faceting

replication

highlighting, dismax

Spellchecking, CSV, Luke

MLT, Update Request

Processors

QParsers Search Components

Multi-core

Distributed Search

Data Import Handler

JMX

1.3 1.4

Statistics Component

Java Replication

Terms and TermVector

Components

Multi-select faceting

Dynamic Clustering

1.1 1.0

1.2

4.0

3.1

Page 21: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Solr 4

• Solr Cloud

– Distributed Indexing

– No single points of failure

– Near Real Time friendly (push replication)

• NoSQL feature set

– Update Durability

– Real-time get

– Atomic Updates

– Optimistic Concurrency

• Pseudo-join, Pivot Faceting, Pseudo-fields, etc

Page 22: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

What search solution/version are you currently using?

Page 23: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Recent Enhancements

Page 24: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Document Routing

80000000-bfffffff

00000000-3fffffff

40000000-7fffffff

c0000000-ffffffff

shard1 shard4

shard3 shard2

id = BigCo!doc5

1f27 3c71

(MurmurHash3)

q=my_query shard.keys=BigCo!

1f27 0000 1f27 ffff to

(hash)

shard1

numShards=4 router=compositeId

Page 25: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Seamless Online Shard Splitting

Shard2_0

Shard1

replica

leader Shard2

replica

leader Shard3

replica

leader

Shard2_1

1. New sub-shards created in “construction” state 2. Leader starts forwarding applicable updates, which

are buffered by the sub-shards 3. Leader index is split and installed on the sub-shards 4. Sub-shards apply buffered updates then become

“active” leaders and old shard becomes “inactive”

update

Page 27: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Schema REST API

• Restlet is now integrated with Solr

• Get a specific field curl http://localhost:8983/solr/schema/fields/price

{"field":{

"name":"price",

"type":"float",

"indexed":true,

"stored":true }}

• Get all fields curl http://localhost:8983/solr/schema/fields

• Get Entire Schema! curl http://localhost:8983/solr/schema

Page 28: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Dynamic Schema

• Add a new field (Solr 4.4) curl -XPUT http://localhost:8983/solr/schema/fields/strength -d ‘

{"type":”float", "indexed":"true”}

• Works in distributed (cloud) mode too!

• Future: More schemaless

– Reality: there is no such thing for Lucene based systems

– Type guessing for fields we haven’t seen before

Page 29: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Future

• Greater scalability

• More “NoSQL”

– More ways to update & manipulate documents

• Analytics

– More powerful faceting, functions, statistics

• Improved Relational queries

• More dynamic (settings & configuration)

• Continued focus on ease of use

Page 30: Keynote   Yonik Seeley & Steve Rowe lucene solr roadmap

Thank You!