Perl and Elasticsearch

Perl & Elasticsearch: Jumping on the bandwagon.

Me Dean Hamstead

[email protected]

Primary Usage:Pretty graphs

generated live, from

log’s

In most cases, you will be asked to feed logs into an Elasticsearch database. Then make dashboards with charts and graphs.

At the heart of Elasticsearch is Apache Lucene

Elasticsearch uses Lucene as its text indexer.

What it adds is an ability to scale horizontally with relative ease.

It also adds a comprehensive RESTful JSON interface.

Should I use Elasticsearch?

De-normalized data?

Don’t need transactions?

Willing to fight with Java Runtime Environment?

Maybe.

Need lots of data types?

Join queries?

Referential integrity?

100’s GB data only?

Access control?

Probably not.

Postgres full-text search will probably work fine with less resources and less hassle

TerminologyRoughly equivalent terms...

MySQL Elasticsearch

Database Index

Table Type

Row Document

Column Field

Schema Mapping/Templates

Index Everything is indexed

SQL Query DSL

SELECT * FROM table … GET http://…

UPDATE table SET … PUT http://…

The ELK StackA “Stack” with an memorable acronym? Management will love it!

Elasticsearch LogstashThe actual database

software. It’s written in Java, which explains many of its quirks.

A log tailer in Java. It’s performance is

appalling. Don’t waste any time on it.

KibanaThis is a Web frontend

to Kibana, from searches to graphs and dashboards. It’s node.js

and js heavy.

Use Rsyslog instead of Logstash - IMO it’s pointless to write logs to file then slurp them back in.

Amazingly performant and flexible. Ostensibly much better than Syslog-ng.

Stay sane by using RainerScript for config, eliminating all legacy style syslog config.

Old versions OK on local machines, but “Syslog servers” should run the latest 8.x

If you’re looking for more of an “all in one” solution, you might find graylog to be a good fit.

It can use elasticsearch under the hood to power it’s searches.

Give it a go, let me know how things work out?

Elasticsearch Basic ClusterData nodes store your data

(Eligible) Master nodes maintain a map of where data is.

Types of Elasticsearch nodes

Role node.master = node.data =

Eligible master true false

Data false true

Query false false

Dev-only true true

Also, Tribe nodes are a thing.

Comprehensive Elasticsearch Cluster

Interesting properties of Elasticsearch

A wildcard can be used in the index part of a queryThis feature is a key part of using Elasticsearch effectively

Aliases are used to reference one or more indexesMultiple changes to aliases can (and should) be grouped into one REST command - which Elasticsearch executes in an Atomic fashion

A template explicitly defines the mapping (schema) of data for yet to be created schemas.A regular expression is used to match against data insertions referencing an index name which does not exist. It is subsequently created

Templates also include other index propertiesSuch as aliases that a new index should be automatically be made a part of

An Index can be closed without deleting itIt becomes unusable until it is opened again. However it is out of memory and sitting on disk ready to go

Schemaless, NoSQL?

Elasticsearch queries are made with JSON in RESTful http/s. So it’s not SQL.If no index exists, it will be created on data insertion. If no template is defined, Elasticsearch will guess at the mapping.Turn this off, always define a template for every index.

Tips for server hardware selection & OS configuration

● 30GB of RAM for each Elasticsearch instance (beyond this the JVM slows down)● +25% RAM for OS. 48GB total is a good number● Use RAID0 (striping) or no RAID on disks. Elasticsearch will ensure data is preserved via

replication● Spinning disks have yet to be a bottleneck for me. Scale out rather than up. YMMV● Turn off Transparent Huge Pages - generally a good idea on any and all servers● Configure Elasticsearch’s JVM to huge Hugepages directly● By default, Linux IO is tuned to run as poorly as possible (even set these on your laptop/desktop)

○ echo 1024 > /sys/block/sda/queue/nr_requests (maybe more, benchmark to taste)○ blockdev --setra 16384 /dev/sda○ Use XFS with mount options like: rw,nobarrier,logbufs=8,inode64,logbsize=256k (XFS rocks)○ Don’t use partitions, just format the disk as is (mkfs -t xfs /dev/sdb). XFS will automatically

pick the perfect block alignment○ echo 0 > /sys/block/sda/queue/add_random (exclude the disk as a source of entropy)

● In iptables, it’s generally a good idea to disable connection tracking on the service ports (assuming you have no outbound rules). This saves on CPU time and avoids filling the connection state table

● Use the same JVM on all nodes. Either Oracle Java or OpenJDK are fine, pick one and don’t mix

Tips for Tuning Elasticsearch

● Elasticsearch default settings are for a read heavy load● There are lots and lots of settings, & lots and lots of blogs talking about how people have tuned

their clusters.● Blogs can be very helpful to find which combination of settings will be right for you● Be careful with anything referencing Elasticsearch before 2.0, ignore anything before 1.0. Things

have changed too much● Note above every setting in your config file a small blurb about what it does and why you have set

that setting. This will help you remember “why on earth did I think that was a good setting??”● The Elasticsearch official documentation is very very good. Take the time to read what each setting

does before you attempt to change it (or if that that setting still exists in the version you are running)

● Increase settings by small amounts and observe if performance improves● Having a setting too high or too low can both reduce performance - you’re trying to find the sweet

spot● More replicas can help read heavy loads if you have more nodes for them to run on, more shards

can too. However, shards cannot be changed after an index is created, replicas you can change at any time

● More indexes plus more nodes can help write heavy loads● Don’t run queries against data nodes

Elasticsearch lets you scale horizontally, so you have to actually scale your work load horizontally… but without overwhelming your cluster.

Achieving peak performance in Elasticsearch is a balancing game of server settings, indexing strategy and well conceived queries.

Different workloads will require retuning your cluster.

Degrading and Deleting Data

Elasticsearch is not intended to be a data warehouse.

Design a policy which degrades then eventually deletes your data

Degrades? Reduce the number of replicas, move data to nodes with slower disks, eventually close the index

Delete data? If you’re using date stamped index named, just drop the index. Records can also be created with a TTL

Degrading and Deleting Data (continued)

Your policy is implemented via cron tasks, only TTL expiry of records is inbuilt

Curator is the stock tool for this. es-daily-index-maintenance.pl from App::ElasticSearch::Utilities is better IMO

Put them all in a single file like /etc/cron.d/elasticsearch so you can keep track of them. Or maybe several cron.d files.

Aliases are also very helpful, as Elasticsearch will add indexes to them when created, if the template defines it. You can then use the cron job to remove older indexes etc.

Single Node Development Environment

A single node is a perfectly valid Elasticsearch cluster. Although, it’s not really suitable for production it’s perfectly fine for development use.

The node is configured to be a master node and a data node, with the number of expected masters also set to 1

For all indexes, shards = 1, replicas = 1

Use upto 30GB of RAM - you will probably be using less. Don’t worry too much about tuning, dedicated disks etc.

Elasticsearch is packages for deb, rpm etc. And only a few settings need changing to get running. Or chose one of the many Vagrant or similar install methods available online.

Now about Perl

Just use Search::Elasticsearch;

Don’t be tempted to craft JSON and GET/POST yourself

JSON queries translate nicely into Perl data structures, but are much much less annoying (trailing commas don’t matter)

Search::Elasticsearch takes care of connection pooling, proper serialization/deserialization, scrolling, and makes bulk requests very easy.

Search::Elasticsearch 2.03 includes support for 0.9, 1.0 and 2.0 series clusters.

They’re still available by installing their ::Client modules directly: Search::Elasticsearch::Client::0_90, Search::Elasticsearch::Client::1_0 or Search::Elasticsearch::Client::2_0

Search::Elasticsearch 5.01 dropped support for pre

Elasticsearch 5.0 from the main tar ball

Connecting to Elasticsearch

Explicitly connect to a single server

Provide a number of servers, which the client will RR between (i.e. query nodes)

Provide a single hostname, and have the client Sniff out the rest of the cluster. Which it will RR between.

Connecting to Elasticsearch (straight from the Pod)

use Search::Elasticsearch;

# Connect to localhost:9200:my $e = Search::Elasticsearch->new();

# Round-robin between two nodes:my $e = Search::Elasticsearch->new( nodes => [ 'search1:9200', 'search2:9200' ]);

# Connect to cluster at search1:9200, sniff all nodes and round-robin between them:my $e = Search::Elasticsearch->new( nodes => 'search1:9200', cxn_pool => 'Sniff');

Insert something, retrieve it again

Really basic stuff...

Some basics

# Index a document:$e->index( index => 'my_app', type => 'blog_post', id => 1, body => { title => 'Elasticsearch clients', content => 'Interesting content...', date => '2013-09-24' });

# Get the document:my $doc = $e->get( index => 'my_app', type => 'blog_post', id => 1);

Searching

Just a simple example to get started...

Searching

# Search:

my $results = $e->search( index => 'my_app', body => { query => { match => { title => 'elasticsearch' } } });

Cluster Status, Other stuff

Administrative type functions are also all available...

Cluster Status, Other stuff

# Cluster status requests:$info = $e->cluster->info;$health = $e->cluster->health;$node_stats = $e->cluster->node_stats;

# Index admin. requests:$e->indices->create(index=>'my_index');$e->indices->delete(index=>'my_index');

Scrolled Search Results

Elasticsearch has a limit to how many results it will return (which is a setting you can change, but has side effects)

Like the cursor function in an SQL database, Scrolled Search has the client work with the server to return results in small chunks.

Search::Elasticsearch takes care of all the details and makes it almost transparent.

Scrolled Search (like a cursor in SQL)

my $es = Search::Elasticsearch->new;

my $scroll = $es->scroll_helper( index => 'my_index', body => { query => {...}, size => 1000, # chunk size sort => '_doc' });

say "Total hits: ". $scroll->total;

while (my $doc = $scroll->next) { # do something}

Bulk FunctionsRESTful HTTP/s has a lot of overheads and adds a lot of latency. Inserting one record per HTTP request will almost certainly never keep up with your logs.

Bulk requests allow more than one action at a time for each HTTP request.

Search::Elasticsearch makes this very very easily. You push actions into the $bulk object, and it will flush them based on your parameters or when explicitly asked. Callbacks hooks are also provided

(Elasticsearch used to have a UDP data insert feature. It’s gone now)

Bulk Functions

my $es = Search::Elasticsearch->new;my $bulk = $es->bulk_helper( index => 'my_index', type => 'my_type');

# Index docs:$bulk->index({ id => 1, source => { foo => 'bar' }});$bulk->add_action( index => { id => 1, source => { foo=> 'bar' }});

# Create docs:$bulk->create({ id => 1, source => { foo => 'bar' }});$bulk->add_action( create => { id => 1, source => { foo=> 'bar' }});$bulk->create_docs({ foo => 'bar' })

Bulk Functions (continued)

# on_success callback, called for every action that succeedsmy $bulk = $es->bulk_helper(

on_success => sub { my ($action,$response,$i) = @_; # do something

},);# on_conflict callback, called for every conflictmy $bulk = $es->bulk_helper(

on_conflict => sub { my ($action,$response,$i,$version) = @_; # do something

},);# on_error callback, called for every errormy $bulk = $es->bulk_helper(

on_error => sub { my ($action,$response,$i) = @_; # do something

Search::Elasticsearch takes care of connection pooling - so no load balancer is required.

It makes Scrolled Searches easy and almost transparent.

It makes Bulk functions amazingly easy.

It makes use of several HTTP clients, picking the “best” one available on the fly.

It’s awesome! Don’t bother with DIY

More Awesomes...

App::ElasticSearch::Utilities - very useful CLI/cron tools for managing Elasticsearch

Dancer2::Plugin::ElasticSearch - Dancer 2 plugin

Dancer::Plugin::ElasticSearch - Dancer plugin (uses older perl ElasticSearch library)

Catalyst::Model::Search::ElasticSearch - Catalyst ModelNote: CPAN has lots of ElasticSearch, but Elasticsearch is the correct capitalization

More on query’s...

Non-search Query ParametersAll the things you might expect…

...plus many many more!

my $res = $e->search(index => ‘mydata-*’, # wildcards allowedbody => {

query => { .. }, # search query},

from => 0, # first result to returnsize => 10_000, # no. of results to returnsort => [ # sort results by

{ "@timestamp" => {"order" => "asc"}}, "srcport", { "ipv4" => "desc" }, ],

# we don’t want e/s to send us the raw original data_source => 0,

# which fields we want returnedfields => [ 'ipv4', 'srcport', '@timestamp' ]

);

More on QueriesWildcard queries

What you would expect

Regexp queries

Also, what you would expect

query => { wildcard => { user => "ki*y" }}

query => { regexp => { "name.first" => "s.*y" }}

More on QueriesRange query

Used with numeric and date field types

query => { range => { # range query age => { # field gte => 10, # greater than lte => 20, # less than } }}

query => { range => { date => { # ranges for dates can be date math gte => "now-1d/d", # /d rounds to the day Lt => "now/d" "time_zone" => "+01:00" # optional } }}

More on QueriesExists query

Exists literally the same meaning as in perl

Bool query

There’s a lot too this, I will just touch on it

query => { exists => { field => "user" }}

query => { bool => { must => [ # basically AND { exists => { field => 'ipv4' } }, { exists => { field => 'srcport' } }, { missing => { field => 'natv4' } }, # opposite of exists ] }}

Effective queries rely on good mappings

A mapping is the schema

You can create an empty index with the mapping you define

Or, an index can be automatically created on insert, with a mapping based upon a matching template

The more you can break you data up into fields with a native datatype, the better Elasticsearch can serve results and the more you can make use of datetype specific functionality (date math for example)

Core DatatypesThe basics

String● text and keyword

Numeric datatypes● long, integer, short, byte, double, float

Date datatype● date

Boolean datatype● boolean

Binary datatype● binary

Complex DatatypesObjects and things

Array datatype● (Array support does not require a dedicated type)

Object datatype● object for single JSON objects

Nested datatype● nested for arrays of JSON objects

Geo DatatypesFun with maps etc

Geo-point datatype● geo_point for lat/lon points

Geo-Shape datatype● geo_shape for complex shapes like polygons

Specialised DatatypesYou’ll need to read up on a lot of these.

IP datatype● ip for IPv4 and IPv6 addresses

Completion datatype● completion to provide auto-complete suggestions

Token count datatype● token_count to count the number of tokens in a string

mapper-murmur3● murmur3 to compute hashes of values at index-time and

store them in the index

Attachment datatype● See the mapper-attachments plugin which supports indexing

attachments like Microsoft Office formats, Open Document formats, ePub, HTML, etc. into an attachment datatype.

Percolator type● Accepts queries from the query-dsl

Summary

● Select sensible hardware (or VM) and tune your OS

● Know your workload and tune Elasticsearch to match

● Rsyslog is amazing, it can talk natively to Elasticsearch and is unbelievably scalable

● Search::Elasticsearch is always the way to go (except perhaps, for trivial shell scripts)

● Break your data up into as many fields as you can

● Use native dataypes and get maximum value using Elasticsearch’s query functions

● More shards and/or more replicas with more servers will increase query performance

● More indexes will increase write performance if you write across them

● Use Index names with date stamps and Aliases to manage data elegantly and efficiently

● Plan how you will degrade then drop data

Thank You!

Perl and Elasticsearch

Technology

Perl and Elasticsearch