Boston elasticsearch meetup October 2012

Elasticsearch in production

Igor [email protected]: @imotov

github: imotov

mailto:[email protected]

Sonian Inc.•Cloud-based email archiving •Founded in 2007•Headquarters: Newton, MA

Small team of about15 developers distributed

from Campinas, Brazil toVancouver, Canada

Using elasticsearch since June 2010, v0.8.0

6 billionrecords indexed in elasticsearch

We have about

100,000Netflix DVD Titles

3,000,000Pages in en.wikipedia.org

22,000,000Books in Library of Congress catalog

150,000,000Linked-in profiles

3,000,000,000Estimated bing.com index size

6,000,000,000

Sonian Inc. index size

50,000,000,000

Estimated google.com index size

Infrastructure

http://www.sonian.com/awssonian-technical-diagram/

Ingestion (safe): ClojureSearch Engine: elasticsearchWeb App: Ruby on Rail

Deployment: ChefMonitoring: Sensu

10 clusters6 AWS Regions

2-17 nodes in each cluster

Custom version of elasticsearch

based on 0.19.9with several plugins

jetty plugin

• jetty-based http transport• SSL support• Authentication• Request logging (json, plain)

Request logs are also indexed in elasticsearch

Open sourcehttps://github.com/sonian/elasticsearch-jetty

https://github.com/sonian/elasticsearch-jetty

https://github.com/sonian/elasticsearch-jetty

Zookeeper plugin

Zookeeper-based discoveryReplacement for zen

discovery

Experimental!

Open sourcehttps://github.com/sonian/elasticsearch-zookeeper

https://github.com/sonian/elasticsearch-zookeeper

https://github.com/sonian/elasticsearch-zookeeper

Valve plugin

•Custom jetty plugin filter•Rejects bulk indexing requests if cluster is overloaded

Lessons learned in the last two years

or

Proper Care and Feeding of

Elasticsearch Nodes

Rule1: Give nodes plenty of space

Running out of disk space or memory is the simplest

way to corrupt your index.

Make sure elasticsearch doesn’t swap

It reduces performance and causes nodes to leave

clusters

elasticsearch.yml

bootstrap.mlockall: true

Increase the number of open file descriptors to 64k.

Rule 2: Distributed but well connected

All nodes should be able to talk to each other all the

time

Otherwise your cluster might get split-brain

syndrome

Consider setting

discovery.zen.minimum_master_nodes

Rule 3: Throttle the bulk indexing load

Asynchronous architecture makes es scalable and fast, but susceptible to running

out of memory under excessive bulk indexing

load.

Rule 4: Try to make all shards approximately the

same size

Elasticsearch allocates shards based on the number of shards. It

doesn’t consider shard sizes or available disk

space.

4 rules for happy elasticsearch

1. Give nodes plenty of space

2. Distributed but well connected

3. Throttle the load4. Make all shards the

same size

Questions?

More Information

Latest stable release: 0.19.10

Web Site: http://www.elasticsearch.org/

Follow @elasticsearch on twitter

IRC: #elasticsearch on irc.freenode.net

GitHub: https://github.com/elasticsearch/elasticsearch

Mailing list: elasticsearch on http://groups.google.com/

Stackoverflow tag: elasticsearch

http://www.elasticsearch.org/

https://twitter.com/%23!/elasticsearch

https://github.com/elasticsearch/elasticsearch

http://groups.google.com/

Boston elasticsearch meetup October 2012

Technology

Boston elasticsearch meetup October 2012