Elasticsearch forData Analytics
Felipe Almeida http://queirozf.com
Introduction, examples and tips
Rio de Janeiro Elastic MeetupNovember 2016
Structure
● Introduction● Aggregations● Mappings● General tips
● Note: All Examples are based on Elasticsearch version 2.x
2
Introduction
● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene
3
Introduction
● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on
later versions only
4
Introduction
● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on
later versions only
● It’s generally used as an index for another database
5
Introduction
● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on
later versions only
● It’s generally used as an index for another database○ I.e. actual data is stored somewhere else; it’s only pointed to by
the index
6
Introduction
● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents
7
Introduction
● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents
● For example, Elasticsearch is a good place to store
8
Introduction
● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents
● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)
9
Introduction
● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents
● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)○ time series data
10
Introduction
● You can get very interesting information by running aggregations on such data
11
Aggregations
● The idea is that you obtain aggregate information about your data
12
Aggregations
● The idea is that you obtain aggregate information about your data
● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL
13
Aggregations
● The idea is that you obtain aggregate information about your data
● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL
14
SQL ELASTICSEARCH
select query
group by aggregations
rows JSON objects
Aggregations
● An Elasticsearch query is composed of at least two parts:
15
Aggregations
● An Elasticsearch query is composed of at least two parts:
{
"query":{
// matchers and filters
},
"aggregations":{
// aggregations
}
}16
query
aggregation
Aggregations
We’ll use the following sample database for the examples:
17
{ "name":"john", "city": "ny", "age": 40}
{ "name":"john", "city": "sf", "age": 45}
{ "name":"mary", "city": "ny", "age": 22}
{ "name":"pam", "city": "dc", "age": 41}
{ "name":"mary", "city": "london", "age": 20}
{ "name":"pete", "city": "ny", "age": 31}
Aggregations - terms
● One of the most useful aggregations is the terms aggregation
18
Aggregations - terms
● One of the most useful aggregations is the terms aggregation
● It tells you how many entries there are for each value a given attribute can take.
19
Aggregations - terms
● One of the most useful aggregations is the terms aggregation
● It tells you how many entries there are for each value a given attribute can take.
● For questions like:○ How many documents are there per city?
20
Aggregations - terms
● Query:{
"query": {
"match_all": {}
},
"aggregations": {
"per_city": {
"terms": {
"field": "city"
}
}
}
}
21
Aggregations - terms
● Query:{
"query": {
"match_all": {}
},
"aggregations": {
"per_city": {
"terms": {
"field": "city"
}
}
}
}
22
● Results:[
{
"key": "ny",
"doc_count": 3
}, {
"key": "dc",
"doc_count": 1
}, {
"key": "london",
"doc_count": 1
}, {
"key": "sf",
"doc_count": 1
}]
Aggregations - min, max, avg, sum
● These aggregations calculate statistics for numeric fields
23
Aggregations - min, max, avg, sum
● These aggregations calculate statistics for numeric fields
● For questions like:○ What is the maximum age over all query results?
24
Aggregations - min, max, avg, sum
● Query:{
"query": {
"match_all": {}
},
"aggregations": {
"max_age": {
"max": {
"field": "age"
}
}
}
}
25
Aggregations - min, max, avg, sum
● Query:{
"query": {
"match_all": {}
},
"aggregations": {
"max_age": {
"max": {
"field": "age"
}
}
}
}
26
● Result: {
"max_age": {
"value": 45
}
}
Aggregations - cardinality
● This aggregation calculates the number of distinct values for a given attribute.
27
Aggregations - cardinality
● This aggregation calculates the number of distinct values for a given attribute.
● For questions like:○ How many different cities are there in the database?
28
Aggregations - cardinality
● Query:{
"query": {
"match_all": {}
},
"aggregations": {
"cities": {
"cardinality": {
"field": "city"
}
}
}
}
29
Aggregations - cardinality
● Query:{
"query": {
"match_all": {}
},
"aggregations": {
"cities": {
"cardinality": {
"field": "city"
}
}
}
}
30
● Result: {
"cities": {
"value": 4
}
}
Aggregations - histogram
● This aggregation gives you information about the distribution of values for a given numeric attribute
31
Aggregations - histogram
● This aggregation gives you information about the distribution of values for a given numeric attribute
● For questions like:○ How are people’s ages distributed?
32
Aggregations - histogram
● This aggregation gives you information about the distribution of values for a given numeric attribute
● For questions like:○ How are people’s ages distributed? ○ How many are in their teens, how many are in the twenties,
and so on?
33
Aggregations - histogram
● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}
34
Aggregations - histogram
● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}
35
● Result: [ { "key": 20, "doc_count": 2 }, { "key": 30, "doc_count": 1 }, { "key": 40, "doc_count": 3 } ]
Aggregations - date_histogram
● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields
36
Aggregations - date_histogram
● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields
● This is one of the most useful aggregations if your data follow a time series
37
Nested aggregations
● Some aggregations allow extra aggregations to be performed on their results.
38
Nested aggregations
● Some aggregations allow extra aggregations to be performed on their results.
● For instance, you can perform a terms aggregation on the results of a histogram aggregation
39
Nested aggregations
● Some aggregations allow extra aggregations to be performed on their results.
● For instance, you can perform a terms aggregation on the results of a histogram aggregation
● For questions like:○ For each city, how many people are there in each age group?
40
Nested aggregations● Query:
{
"query": {
"match_all": {}
},
"aggregations": {
"cities": {
"terms": {
"field": "city"
},
"aggregations": {
"distrib": {
"histogram": {
"field": "age",
"interval": 10
}
}
}
}
}
} 41
Nested aggregations● Query:
{
"query": {
"match_all": {}
},
"aggregations": {
"cities": {
"terms": {
"field": "city"
},
"aggregations": {
"distrib": {
"histogram": {
"field": "age",
"interval": 10
}
}
}
}
}
} 42
● Result (partial) [ {
"key": "ny",
"doc_count": 3,
"distrib": {
"buckets": [{
"key": 20,
"doc_count": 1
}, {
"key": 30,
"doc_count": 1
}, {
"key": 40,
"doc_count": 1
}]
}
}, {
"key": "dc",
"doc_count": 1,
"distrib": {
"buckets": [
{
"key": 40,
"doc_count": 1
}
]
}
},
Mappings
● Elasticsearch is schemaless, which means you can add fields to your documents as you wish
43
Mappings
● Elasticsearch is schemaless, which means you can add fields to your documents as you wish
● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:
44
Mappings
● Elasticsearch is schemaless, which means you can add fields to your documents as you wish
● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings
45
Mappings
● Elasticsearch is schemaless, which means you can add fields to your documents as you wish
● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates
46
Mappings
● Elasticsearch is schemaless, which means you can add fields to your documents as you wish
● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates
● So you may want to control how your documents are indexed using appropriate mappings
47
Dynamic Templates
● Dynamic templates enable you to predefine mappings attributes or indices will be stored.
48
Dynamic Templates
● Dynamic templates enable you to predefine mappings attributes or indices will be stored.
● They can help you if:
49
Dynamic Templates
● Dynamic templates enable you to predefine mappings attributes or indices will be stored.
● They can help you if:○ You want to disable the analyzer for every new string field you
add in your documents
50
Dynamic Templates
● Dynamic templates enable you to predefine mappings attributes or indices will be stored.
● They can help you if:○ You want to disable the analyzer for every new string field you
add in your documents○ You want to index numeric attributes whose name end in
"timestamp" as dates
51
Dynamic Templates
● Dynamic templates enable you to predefine mappings attributes or indices will be stored.
● They can help you if:○ You want to disable the analyzer for every new string field you
add in your documents○ You want to index numeric attributes whose name end in
"timestamp" as dates○ You want to refuse to allow extra attributes to be indexed (i.e.
force a hard schema)
52
General tips
● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their
mapping to not_analyzed
53
General tips
● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their
mapping to not_analyzed
● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch
54
General tips
● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their
mapping to not_analyzed
● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch
● Any filters in the query area also affect the output of aggregations
55
General tips
● Always use bulk inserting rather than individual inserts to save bandwidth
56
General tips
● Always use bulk inserting rather than individual inserts to save bandwidth
● Use TTL (time to live) to expire documents that don’t need to be kept for long
57
General tips
● Always use bulk inserting rather than individual inserts to save bandwidth
● Use TTL (time to live) to expire documents that don’t need to be kept for long○ Changes in the TTL settings only affect new documents.
Documents that are already indexed are not affected.
58
General tips
● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.
59
General tips
● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.
● Dynamic templates can be created at the index level (for attributes that match some criteria) or at the cluster level (for indices that match some criteria)
60
General tips
● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options
61
General tips
● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options
● You can reference nested attributes when defining an aggregation
62
General tips
● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options
● You can reference nested attributes when defining an aggregation
● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns
63
General tips
● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options
● You can reference nested attributes when defining an aggregation
● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns
● Use "size":0 to suppress regular query results and return only aggregation results, thus saving processing and bandwidth
64
General tips
● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!
65
General tips
● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!
● By default, the terms aggregation does not return all possible values for the selected field. Tune the size attribute to control the result size (use 0 to bring all results)
66