Top Banner
Elasticsearch for Data Analytics Felipe Almeida http://queirozf.com Introduction, examples and tips Rio de Janeiro Elastic Meetup November 2016
66

Elasticsearch for Data Analytics

Apr 16, 2017

Download

Technology

Felipe Almeida
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 2: Elasticsearch for Data Analytics

Structure

● Introduction● Aggregations● Mappings● General tips

● Note: All Examples are based on Elasticsearch version 2.x

2

Page 3: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene

3

Page 4: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

4

Page 5: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

● It’s generally used as an index for another database

5

Page 6: Elasticsearch for Data Analytics

Introduction

● Elasticsearch was born to be a cluster-first search engine, built on top of Apache Lucene○ Differently from Solr, which had cluster capabilities added on

later versions only

● It’s generally used as an index for another database○ I.e. actual data is stored somewhere else; it’s only pointed to by

the index

6

Page 7: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

7

Page 8: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

● For example, Elasticsearch is a good place to store

8

Page 9: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)

9

Page 10: Elasticsearch for Data Analytics

Introduction

● Although its main use is as a search engine for textual documents, it is also very useful for storing and querying other types of documents

● For example, Elasticsearch is a good place to store○ arbitrary key-value dictionaries (json objects)○ time series data

10

Page 11: Elasticsearch for Data Analytics

Introduction

● You can get very interesting information by running aggregations on such data

11

Page 12: Elasticsearch for Data Analytics

Aggregations

● The idea is that you obtain aggregate information about your data

12

Page 13: Elasticsearch for Data Analytics

Aggregations

● The idea is that you obtain aggregate information about your data

● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL

13

Page 14: Elasticsearch for Data Analytics

Aggregations

● The idea is that you obtain aggregate information about your data

● Elasticsearch Aggregations are somewhat similar to GROUP BY clauses in regular SQL

14

SQL ELASTICSEARCH

select query

group by aggregations

rows JSON objects

Page 15: Elasticsearch for Data Analytics

Aggregations

● An Elasticsearch query is composed of at least two parts:

15

Page 16: Elasticsearch for Data Analytics

Aggregations

● An Elasticsearch query is composed of at least two parts:

{

"query":{

// matchers and filters

},

"aggregations":{

// aggregations

}

}16

query

aggregation

Page 17: Elasticsearch for Data Analytics

Aggregations

We’ll use the following sample database for the examples:

17

{ "name":"john", "city": "ny", "age": 40}

{ "name":"john", "city": "sf", "age": 45}

{ "name":"mary", "city": "ny", "age": 22}

{ "name":"pam", "city": "dc", "age": 41}

{ "name":"mary", "city": "london", "age": 20}

{ "name":"pete", "city": "ny", "age": 31}

Page 18: Elasticsearch for Data Analytics

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

18

Page 19: Elasticsearch for Data Analytics

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

● It tells you how many entries there are for each value a given attribute can take.

19

Page 20: Elasticsearch for Data Analytics

Aggregations - terms

● One of the most useful aggregations is the terms aggregation

● It tells you how many entries there are for each value a given attribute can take.

● For questions like:○ How many documents are there per city?

20

Page 21: Elasticsearch for Data Analytics

Aggregations - terms

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"per_city": {

"terms": {

"field": "city"

}

}

}

}

21

Page 22: Elasticsearch for Data Analytics

Aggregations - terms

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"per_city": {

"terms": {

"field": "city"

}

}

}

}

22

● Results:[

{

"key": "ny",

"doc_count": 3

}, {

"key": "dc",

"doc_count": 1

}, {

"key": "london",

"doc_count": 1

}, {

"key": "sf",

"doc_count": 1

}]

Page 23: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● These aggregations calculate statistics for numeric fields

23

Page 24: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● These aggregations calculate statistics for numeric fields

● For questions like:○ What is the maximum age over all query results?

24

Page 25: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"max_age": {

"max": {

"field": "age"

}

}

}

}

25

Page 26: Elasticsearch for Data Analytics

Aggregations - min, max, avg, sum

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"max_age": {

"max": {

"field": "age"

}

}

}

}

26

● Result: {

"max_age": {

"value": 45

}

}

Page 27: Elasticsearch for Data Analytics

Aggregations - cardinality

● This aggregation calculates the number of distinct values for a given attribute.

27

Page 28: Elasticsearch for Data Analytics

Aggregations - cardinality

● This aggregation calculates the number of distinct values for a given attribute.

● For questions like:○ How many different cities are there in the database?

28

Page 29: Elasticsearch for Data Analytics

Aggregations - cardinality

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"cardinality": {

"field": "city"

}

}

}

}

29

Page 30: Elasticsearch for Data Analytics

Aggregations - cardinality

● Query:{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"cardinality": {

"field": "city"

}

}

}

}

30

● Result: {

"cities": {

"value": 4

}

}

Page 31: Elasticsearch for Data Analytics

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

31

Page 32: Elasticsearch for Data Analytics

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

● For questions like:○ How are people’s ages distributed?

32

Page 33: Elasticsearch for Data Analytics

Aggregations - histogram

● This aggregation gives you information about the distribution of values for a given numeric attribute

● For questions like:○ How are people’s ages distributed? ○ How many are in their teens, how many are in the twenties,

and so on?

33

Page 34: Elasticsearch for Data Analytics

Aggregations - histogram

● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}

34

Page 35: Elasticsearch for Data Analytics

Aggregations - histogram

● Query:{ "query": { "match_all": {} }, "aggregations": { "distrib": { "histogram": { "field": "age", "interval": 10 } } }}

35

● Result: [ { "key": 20, "doc_count": 2 }, { "key": 30, "doc_count": 1 }, { "key": 40, "doc_count": 3 } ]

Page 36: Elasticsearch for Data Analytics

Aggregations - date_histogram

● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields

36

Page 37: Elasticsearch for Data Analytics

Aggregations - date_histogram

● This aggregation is similar to the previous one (histogram), but you can specify intervals and bounds using date macros, for date fields

● This is one of the most useful aggregations if your data follow a time series

37

Page 38: Elasticsearch for Data Analytics

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

38

Page 39: Elasticsearch for Data Analytics

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

● For instance, you can perform a terms aggregation on the results of a histogram aggregation

39

Page 40: Elasticsearch for Data Analytics

Nested aggregations

● Some aggregations allow extra aggregations to be performed on their results.

● For instance, you can perform a terms aggregation on the results of a histogram aggregation

● For questions like:○ For each city, how many people are there in each age group?

40

Page 41: Elasticsearch for Data Analytics

Nested aggregations● Query:

{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"terms": {

"field": "city"

},

"aggregations": {

"distrib": {

"histogram": {

"field": "age",

"interval": 10

}

}

}

}

}

} 41

Page 42: Elasticsearch for Data Analytics

Nested aggregations● Query:

{

"query": {

"match_all": {}

},

"aggregations": {

"cities": {

"terms": {

"field": "city"

},

"aggregations": {

"distrib": {

"histogram": {

"field": "age",

"interval": 10

}

}

}

}

}

} 42

● Result (partial) [ {

"key": "ny",

"doc_count": 3,

"distrib": {

"buckets": [{

"key": 20,

"doc_count": 1

}, {

"key": 30,

"doc_count": 1

}, {

"key": 40,

"doc_count": 1

}]

}

}, {

"key": "dc",

"doc_count": 1,

"distrib": {

"buckets": [

{

"key": 40,

"doc_count": 1

}

]

}

},

Page 43: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

43

Page 44: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:

44

Page 45: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings

45

Page 46: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates

46

Page 47: Elasticsearch for Data Analytics

Mappings

● Elasticsearch is schemaless, which means you can add fields to your documents as you wish

● However, some aggregations behave differently depending upon the way some attributes are indexed. In particular:○ not_analyzed strings○ dates

● So you may want to control how your documents are indexed using appropriate mappings

47

Page 48: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

48

Page 49: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:

49

Page 50: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents

50

Page 51: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents○ You want to index numeric attributes whose name end in

"timestamp" as dates

51

Page 52: Elasticsearch for Data Analytics

Dynamic Templates

● Dynamic templates enable you to predefine mappings attributes or indices will be stored.

● They can help you if:○ You want to disable the analyzer for every new string field you

add in your documents○ You want to index numeric attributes whose name end in

"timestamp" as dates○ You want to refuse to allow extra attributes to be indexed (i.e.

force a hard schema)

52

Page 53: Elasticsearch for Data Analytics

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

53

Page 54: Elasticsearch for Data Analytics

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch

54

Page 55: Elasticsearch for Data Analytics

General tips

● All text fields are analyzed (split into tokens) by default. ○ To be able to aggregate on strings you need to alter their

mapping to not_analyzed

● Use filters whenever you can. Filters and aggregation results are aggressively cached by Elasticsearch

● Any filters in the query area also affect the output of aggregations

55

Page 56: Elasticsearch for Data Analytics

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

56

Page 57: Elasticsearch for Data Analytics

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

● Use TTL (time to live) to expire documents that don’t need to be kept for long

57

Page 58: Elasticsearch for Data Analytics

General tips

● Always use bulk inserting rather than individual inserts to save bandwidth

● Use TTL (time to live) to expire documents that don’t need to be kept for long○ Changes in the TTL settings only affect new documents.

Documents that are already indexed are not affected.

58

Page 59: Elasticsearch for Data Analytics

General tips

● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.

59

Page 60: Elasticsearch for Data Analytics

General tips

● By default, ranges and histograms do not return buckets with zero documents. Use min_doc_count and, optionally, extended_bounds to include them.

● Dynamic templates can be created at the index level (for attributes that match some criteria) or at the cluster level (for indices that match some criteria)

60

Page 61: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

61

Page 62: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

● You can reference nested attributes when defining an aggregation

62

Page 63: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

● You can reference nested attributes when defining an aggregation

● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns

63

Page 64: Elasticsearch for Data Analytics

General tips

● In general, date-based aggregations are like others but they accept date macros (e.g. now-1h) when defining aggregation options

● You can reference nested attributes when defining an aggregation

● Scripts are useful but many Elasticsearch setups (e.g. AWS) do not support them due to security concerns

● Use "size":0 to suppress regular query results and return only aggregation results, thus saving processing and bandwidth

64

Page 65: Elasticsearch for Data Analytics

General tips

● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!

65

Page 66: Elasticsearch for Data Analytics

General tips

● The value_count aggregation does not count unique values for the attributes! The cardinality aggregation does that!

● By default, the terms aggregation does not return all possible values for the selected field. Tune the size attribute to control the result size (use 0 to bring all results)

66