Top Banner
Elasticsearch [email protected] - CTO Federico Panini CTO @ fazland.com email : [email protected] LikedIn : https://uk.linkedin.com/in/federicopanini slides : http://www.slideshare.net/FedericoPanini
73
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Elasticsearch quick Intro (English)

Elasticsearch

[email protected] - CTO

Federico Panini

CTO @ fazland.comemail : [email protected]

LikedIn : https://uk.linkedin.com/in/federicopaninislides : http://www.slideshare.net/FedericoPanini

Page 2: Elasticsearch quick Intro (English)

What is Elasticsearch

[email protected] - CTO

full-text search engine

“A search engine is an automated system which, upon request, uses a set of data and return an index of its content classifying them based on math/stats algorithm used to set the relevance, based in a search key.”

Page 3: Elasticsearch quick Intro (English)

What’s Elasticsearch ?

[email protected] - CTO

full-text search engine

Page 4: Elasticsearch quick Intro (English)

[email protected] - CTO

“It’s a distributed, scalable, and highly available Real-time search and analytics software.”

What’s Elasticsearch ?full-text search engine

Page 5: Elasticsearch quick Intro (English)

[email protected] - CTO

Real-time data Realtime data analysis Distributed system High Availability Full-text searches Document oriented DB Schemaless DB

RESTFul Api Persistence per-operation Open Source Based on Apache Lucene Optimistic version control

What’s Elasticsearch ?features

Page 6: Elasticsearch quick Intro (English)

Apache Lucene #1

[email protected] - CTO

It’s the heart of Elasticsearch

Lucene is the search engine of Elasticsearch

Page 7: Elasticsearch quick Intro (English)

Apache Lucene #1

[email protected] - CTO

It’s in Java

It’s an Apache Software Foundation, so Open Source!

Page 8: Elasticsearch quick Intro (English)

What has more than Lucene

[email protected] - CTO

full-text searches

horizontal scaling high availability Easy to use near real time

Page 9: Elasticsearch quick Intro (English)

Architecture

[email protected] - CTO

requirements - CPU

Elasticsearch doesn’t need a lot of CPU.

The advice is to use the last CPU model available.

In general is a good practice to use machines with 2 to 8 cores.

Page 10: Elasticsearch quick Intro (English)

Architecture

[email protected] - CTO

requirements - Disco

The I/O disk need is really important for all clusters.

Please use SSD disks.

Page 11: Elasticsearch quick Intro (English)

Architecture

[email protected] - CTO

requirements - HD - bonus slide …

One very important thing to know is you have to pay attention where data is stored and mostly how. The word you have to remember is scheduler. The scheduler on *nix system is responsible to decide when data should be “written” to disc and on which priority. Usually common unix OS setup cfq as scheduler, which for instance is a scheduler for rotating disks and optimised for them. The advice is to use SSD disks and to setup the SO to use “noop” or “deadline” which are scheduler optimised for SSD’s.

If you use the right scheduler you can reach improvements of 500x !!!

Page 12: Elasticsearch quick Intro (English)

[email protected] - CTO

Operating Systems

Elasticsearch is written in Java, so it’s a multiplatform solution. Use the last JDK available.

Architecture

Page 13: Elasticsearch quick Intro (English)

[email protected] - CTO

requirements - RAM

Elasticsearch is eager of RAM!!!

https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html

Architecture

Page 14: Elasticsearch quick Intro (English)

[email protected] - CTO

memory !?!?

Use solutions with 64GB is fine not more give to the Java heap size not more than 32GB of RAM use more than one machine for elasticsearch in order

setup correctly the cluster.

Architecture

Page 15: Elasticsearch quick Intro (English)

[email protected] - CTO

Installation

curl -L -O http://download.elasticsearch.org/PATH/TO/VERSION.zip unzip elasticsearch-$VERSION.zip cd elasticsearch-$VERSION

There are availbes packages for many distribution as Debian or RPM, and Puppet or Chef modules

Architecture

Page 16: Elasticsearch quick Intro (English)

Java based

[email protected] - CTO

elasticsearch

Elasticsearch has been developed in JAVA

Robust Scalable Multiplatform

Page 17: Elasticsearch quick Intro (English)

Talking to Elasticsearch

[email protected] - CTO

clients Java #1

There are 2 clients available in JAVA:

Node client : the client join the cluster as non-data node, this mean that the client knows perfectly where data are and on which node of the cluster.

Page 18: Elasticsearch quick Intro (English)

[email protected] - CTO

clients Java #2

Transport client : is a lightweight client and is the tool used to comunicate with the cluster remotely.

Talking to Elasticsearch

There are 2 clients available in JAVA:

Page 19: Elasticsearch quick Intro (English)

[email protected] - CTO

clients Java #2

Both Java clients talk to the cluster on port 9300, which is the same port use by the cluster itself.

Talking to Elasticsearch

There are 2 clients available in JAVA:

Page 20: Elasticsearch quick Intro (English)

[email protected] - CTO

client API RESTful

All programming languages other than Java can talk to the Elasticsearch cluster through its API Rest available on port 9200.

There are many official clients available in different programming languages.: Groovy, JavaScript, .NET, PHP, Perl, Python, e Ruby

Talking to Elasticsearch

Page 21: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Document oriented

NoSqlElasticsearch is a document oriented database. This mean Elasticsearch is a schema-less database.

After inserting documents inside Elasticsearch, the documents will be immediately indexed.

Page 22: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Document oriented

JSONElasticseach uses JSON as interchange language between the server and the API clients.

Page 23: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

glossary

cluster nodes indexes shards replica segments in-memory buffers translog

Page 24: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

cluster

The cluster is a set which belong one or more nodes, which shares the same property cluster.name. The cluster is used to balance the load of the server itself. A node could be deleted or inserted to the cluster, the cluster itself will re-organise itself.

Page 25: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

cluster

Inside a cluster a node is elected as Master. The Master node is responsible to manage operations as creation or removal indexes, join or deletion of a node. Every node could be elected as Master.

Page 26: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

nodes

A node is a minimum element of Elasticsearch that ensures the proper working of the cluster.

Page 27: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Index

Database RDBMS Elasticsearch

DATABASE INDEX

Page 28: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Type

Database RDBMS Elasticsearch

Table TYPE

Page 29: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Document

Database RDBMS Elasticsearch

ROW DOCUMENT

Page 30: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Fields

Database RDBMS Elasticsearch

COLUMNS FIELDS

Page 31: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

shards

If we want to start indexing data on Elasticsearch we need to create an index. Index is the term used only to identify a logical definition, which represent a pointer to one or more elements called SHARDS.

Page 32: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

shards

The shard is the low level element of Elasticsearch, and contains a subset of all the data inside and index.

The shard is in fact a single instance of Apache Lucene.

Page 33: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Replica shards

Replica shards are mirrors of shards used to protect our data from hardware failures. As the shards they are used exactly as the shards.

Page 34: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

shards immutability

The number of shards for an index is defined at Index creation time and is IMMUTABLE.

Page 35: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

shards immutability

curl -X http://localhost:9200/blogs -d ‘{ "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } }’

Page 36: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

shards immutability

curl http://localhost:9200/_cluster/health“{ "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 }”

Page 37: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

shards immutability

Replica shards on a single node instance are useless, the meaning for cluster is nothing in this case. To make replica shard useful we need at least 2 nodes to have data redundancy.

Page 38: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage data conflicts #1

Page 39: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts #2 : Pessimistic Concurrency ControlUsed in standard RDBMS

This approach is based on the concept that conflict could happened frequently and so to avoid them the RDBMS lock the resource.

The process lock the access to the row before reading it, this way we the RDBMS is sure that only one process will access to this thread and can subsequently modify it and nobody else. At the end of its process (update/delete) the thread will release the LOCK.

Page 40: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts #3 : Optimistic Concurrency Control

Elasticsearch uses OCC

This approach will consider conflicts as infrequent. The database won’t lock the resource when access to it.

The responsibility is given to the application : when data is amended between a read and write then the update fails. In this case you need to re-get the fresh new data and trying to update it.

Page 41: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts#4 : Optimistic Concurrency Control

Elasticsearch is a distributed solution, concurrent and asynchronous. When a document is created / updated / deleted is absolutely necessary to replicate this information across the whole cluster.

Every command sent to the nodes is sent in parallel and could happen that some data will reach its destination (node) already expired.

Page 42: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts#5 : Optimistic Concurrency Control

We need a way to understand that the entry we’re trying to update as been already updated by another process.

Page 43: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts#6 : Optimistic Concurrency Control

VERSIONING

Page 44: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts#7 : Optimistic Concurrency Control

In Elasticsearch every document has a field named:

_version

This system field is incremented every time an operation (update / delete) occurs over a document. In this way an update to _version:3 won’t be never applied to a document whose _version field value is at 4.

Page 45: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts #8 : Optimistic Concurrency Control

This approach move all the responsibility from the database to the application! so WE are responsible to not create conflicts over a document or and index. If we want to be sure to not have loss of data we nee to implement writes with the use of versioning!

Page 46: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

BONUS : manage conflicts #9 : Optimistic Concurrency Control

http://www.jillesvangurp.com/2014/12/03/optimistic-locking-for-updates-in-elasticsearch/ h t tps : / / aphy r.com/pos ts /317 -ca l l -me-maybe-elasticsearch https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html

Page 47: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple searches #1

Create IndexAPI RestGETDELETEPOSTSEARCH

Page 48: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple searches - CREATE AN INDEX

curl -XPUT http://fazlab.fazland.com:9200/fazlab-d "{ "settings" :

{ "number_of_shards" : 3, "number_of_replicas" : 1

} }"

Page 49: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple searches - INDEX A DOCUMENT

curl -XPUThttp://fazlab.fazland.com:9200/fazlab/categories/1?pretty -d '{

nome: "Federico"}'

Page 50: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple searches - GET A DOCUMENT

curl http://fazlab.fazland.com:9200/fazlab/categories/1?pretty

Page 51: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple searches - DELETE A DOCUMENT

curl -XDELETE http://fazlab.fazland.com:9200/fazlab/categories/2?pretty

Page 52: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple searches #1

DEMO SEARCHES!

Page 53: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

mapping and analysis

EXACT MATCH vs FULL TEXT

Page 54: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

mapping and analysis

EXACT MATCH vs FULL TEXT

Exact match Full Text

where name = ‘Federico’

and user_id = 2

and date > “2014-09-15”

“Frank has been to

South beach”

Frank / FRANK / frank

Page 55: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

mapping and analysis

EXACT MATCH vs FULL TEXT

Exact match

Full Text

binary : the document contains these values ?

How much is relevant the document compared to the

term used inside the query ?

Page 56: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

mapping and analysis

Elasticsearch to help a full-text search analyse the text and uses this result to build an inverted index.

Inverted Index Analyzer

Page 57: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Inverted Index

1. The quick brown fox jumped over the lazy dog

2. Quick brown foxes leap over lazy dogs in summer

Page 58: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Inverted Index

If we want to search the word “quick” and “brown” we will pick

only the documents where these 2 words are.

1. The quick brown fox jumped over the lazy dog

2. Quick brown foxes leap over lazy dogs in summer

Page 59: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Inverted Index

1. The quick brown fox jumped over the lazy dog

2. Quick brown foxes leap over lazy dogs in summer

Page 60: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

ANALYZERS

An analyzer has 3 functions:

Character filters

Tokenizer

Token Filters

Page 61: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

ANALYZERS - Character Filters

The first part of an analyser is to parse every string with character filer which will clean / reorganize the strings before tokenization.

During this phase special HTML chars will be removed or & will be converted in AND.

Page 62: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

ANALYZERS - Tokenizer

The second phase of an analyser is tokenisation which will divide a sentence in small terms.

Page 63: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

ANALYZERS - Token FiltersSuccessivamente alla fase di Tokenizzazione delle stringhe in singoli termini (terms), i filtri (selezionati) sono applicati in sequenza. After tokenisation filters will be applied in sequence. For example :

- put lower case the whole text - remove stop words - add synonyms

Page 64: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Standard Analyzer

“Set the shape to semi-transparent by calling set_trans(5)”

The standard analyzer is the default analyzer of Elasticsearch. Divide text in single words and remove most of punctuation.

“set, the, shape, to, semi, transparent, by, calling, set_trans, 5”

Page 65: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Simple Analyzer

“Set the shape to semi-transparent by calling set_trans(5)”

The simple analyser removes all characters which are not letters and put the whole text lowercase

“set, the, shape, to, semi, transparent, by, calling, set, trans”

Page 66: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Whitespace Analyzer

“Set the shape to semi-transparent by calling set_trans(5)”

The whitespace analyser will create token by white space and put text in lowercase

“Set, the, shape, to, semi, transparent, by, calling, set_trans(5)”

Page 67: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Language Analyzer

“Set the shape to semi-transparent by calling set_trans(5)”

This analyser uses a language specific feature to remove stop words or to do stemming.

“set, shape, semi, transpar, call, set_tran, 5”

Page 68: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Language Analyzer

arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

Page 69: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Pre-built Analyzers

Standard Analyzer Simple Analyzer

Whitespace Analyzer Stop Analyzer

Keyword Analyzer Pattern Analyzer

Language Analyzers Snowball Analyzer Custom Analyzer

Page 70: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Tokenizer

Standard Tokenizer Edge NGram Tokenizer

Keyword Tokenizer Letter Tokenizer

Lowercase Tokenizer NGram Tokenizer

Whitespace Tokenizer Pattern Tokenizer

UAX Email URL Tokenizer Path Hierarchy Tokenizer

Page 71: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Token Filters

Standard Token Filter ASCII Folding Token Filter

Length Token Filter Lowercase Token Filter

NGram Token Filter Edge NGram Token Filter Porter Stem Token Filter

Shingle Token Filter Stop Token Filter

… more than 32 Filters

Page 72: Elasticsearch quick Intro (English)

Elastic

[email protected] - CTO

Token Filters

THE END.

Page 73: Elasticsearch quick Intro (English)

References• Elasticsearch : The Definitive Guide• https://en.wikipedia.org/wiki/Full_text_search• https://www.elastic.co/guide/en/elasticsearch/guide/current/

hardware.html• https://www.elastic.co/guide/en/elasticsearch/guide/current/

heap-sizing.html• https://mtalavera.wordpress.com/2015/02/16/monitoring-with-

collectd-and-kibana/• Fuzzy search : https://www.found.no/foundation/fuzzy-search/• Phonetic-plugin : https://github.com/elastic/elasticsearch-

analysis-phonetic

[email protected] - CTO