Introduction to Elasticsearch with basics of Lucene

Introduction to Elasticsearchwith basics of Lucene

May 2014 Meetup

Rahul Jain

@rahuldausa@http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/

Who am I Software Engineer

7 years of software development experience

Built a platform to search logs in Near real time with volume of 1TB/day#

Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?)

Areas of expertise/interest High traffic web applications JAVA/J2EE Big data, NoSQL Information-Retrieval, Machine learning

# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr

Agenda

• IR Overview

• Basic Concepts

• Lucene

• Elasticsearch

• Logstash & Kibana - Short Introduction

• Q&A

Information Retrieval (IR)

”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing”

- Wikipedia

Basic Concepts

• Term t : a noun or compound word used in a specific context

• tf (t in d) : term frequency in a document • measure of how often a term appears in the document• the number of times term t appears in the currently scored document d

• idf (t) : inverse document frequency • measure of whether the term is common or rare across all documents, i.e.

how often the term appears across the index• obtained by dividing the total number of documents by the number of

documents containing the term, and then taking the logarithm of that quotient.

• boost (index) : boost of the field at index-time

• boost (query) : boost of the field at query-time

Basic ConceptsTF - IDF

TF - IDF = Term Frequency X Inverse Document Frequency

Credit: http://http://whatisgraphsearch.com/

Apache Lucene

• Fast, high performance, scalable search/IR library• Open source• Initially developed by Doug Cutting (Also author

of Hadoop)• Indexing and Searching• Inverted Index of documents• Provides advanced Search options like synonyms,

stopwords, based on similarity, proximity.• http://lucene.apache.org/

Lucene Internals - Inverted Index

Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html

Lucene Internals (Contd.)

• Defines documents Model

• Index contains documents.

• Each document consist of fields.

• Each Field has attributes.– What is the data type (FieldType)

– How to handle the content (Analyzers, Filters)

– Is it a stored field (stored="true") or Index field (indexed="true")

Indexing Pipeline

• Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters)

• Each field can define an Analyzer at index time/query time or the both at same time.

Credit : http://www.slideshare.net/otisg/lucene-introduction

Analysis Process - Tokenizer

WhitespaceAnalyzerSimplest built-in analyzer

The quick brown fox jumps over the lazy dog.

[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]

Tokens

Analysis Process - Tokenizer

SimpleAnalyzerLowercases, split at non-letter boundaries

The quick brown fox jumps over the lazy dog.

[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]

Tokens

Elasticsearch

Introduction

• Enterprise Search platform for Apache Lucene

• Open source

• Highly reliable, scalable, fault tolerant

• Support distributed Indexing, Replication, and load

balanced querying

• http://www.elasticsearch.org/

Elasticsearch - Features

• Distributed RESTful search server

• Document oriented

• Domain Driven

• Schema less

• Restful

• Easy to scale horizontally

Elasticsearch - Features

• Highlighting• Spelling Suggestions• Facets (Group by)• Query DSL

– based on JSON to define queries

• Automatic shard replication, routing• Zen discovery

– Unicast– Multicast

• Master Election– Re-election if Master Node fails

• HTTP RESTful Api

• Java Api

• Clients

– perl, python, php, ruby, .net etc

• All APIs perform automatic node

operation rerouting.

How to startIt’s this Easy.

Operations

INDEX CREATION

curl -XPUT "http://localhost:9200/movies/movie/1" -d‘ {

"title": "The Godfather", "director": "Francis Ford Coppola",

"year": 1972 }'

http://localhost:9200/<index>/<type>/[<id>]

Credit: http://joelabrahamsson.com/elasticsearch-101/

INDEX CREATION RESPONSE

UPDATE

curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"]

Updated Version

New field

curl -XGET "http://localhost:9200/movies/movie/1" -d''

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''

DELETE

Search across all indexes and all types http://localhost:9200/_search

Search across all types in the movies index. http://localhost:9200/movies/_search

Search explicitly for documents of type movie within the movies index. http://localhost:9200/movies/movie/_search

curl -XPOST "http://localhost:9200/_search" -d'{ "query": { "query_string": { "query": "kill" } }}'

SEARCH

SEARCH RESPONSE

Updating existing Mapping

curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'{ "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } } } }}'

Cluster Architecture

Source: http://www.slideshare.net/DmitriBabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup

Index Request

Search Request

Who are using

• Github

• Stumbleupon

• Soundcloud

• Datadog

• Stackoverflow

• Many more…

– http://www.elasticsearch.com/case-studies/

Logstash

• Open Source, Apache licensee• Written in JRuby• Part of Elasticsearch family• http://logstash.net/• Current version: 1.4.0• This talk is with 1.3.3

Logstash

• Multiple Input/ Multiple Output• Centralize logs

• Collect• Parse• Forward/Store

Architecture

Source: http://www.infoq.com/articles/review-the-logstash-book

Logstash – life of an event

• Input Filters Output

• Filters are processed in order of config file

• Outputs are processed in order of config file

• Input: Input stream

– File input (tail)

– Log4j

– Redis

– Syslog

– and many more…

• http://logstash.net/docs/1.3.3/

Logstash – life of an event• Codecs : decoding log messages

• Json

• Multiline

• Netflow

• and many more…

• Filters : processing messages

• Date – Date format

• Grok – Regular expression based extraction

• Mutate – Change data type

• Output : storing the structured message

• Elasticsearch

• Mongodb

• Email

• Nagios

http://logstash.net/docs/1.3.3/

Quick Start

< 1.3.3 version:java -jar logstash-1.3.3-flatjar.jar agent -f agent.conf – web

1.4 version:bin/logstash agent –f agent.confbin/logstash –web

basic-agent.conf :input {tcp { type => "apache" port => 3333 } }output { stdout { debug => true } elasticsearch { embedded => true }}

Kibana

Source: http://www.slideshare.net/AmazeeAG/2014-0422-loggingwithlogstashbastianwidmercampusbern

Analytics

Analytics source : Kibana.org based on ElasticSearch and Logstash Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8

Thanks!@rahuldausa on twitter and slideshare

http://www.linkedin.com/in/rahuldausa

Find Interesting ?

Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/

Introduction to Elasticsearch with basics of Lucene

Technology

Elastic search in CA PPM - ijmra.us...

Spring Lucene Reference...

Faceting with Lucene Block Join Query - Lucene/Solr...

Towards Practical Visual Search Engine Within … › pdf...

Hibernate Search 5.10.9 - JBoss · 2/18/2020 · Since...

AN INTRODUCTION TO ELASTICSEARCH - Microsoft ·...

CESNET Radoslav Bodó, Daniel...

Elasticsearch Basics

Search Evolution - Von Lucene zu Solr und ElasticSearch

The Operations Trifecta Logging, Metrics, and APM · •...

Elasticsearch basics for developers - Catalyst basics...

Search Evolution - Von Lucene zu Solr und ElasticSearch...

Intro to search engines with Lucene and ElasticSearch

Hacking Elasticsearch - ZeroNights 2016 · PDF fileWhat is.....

iFinder Enterprise Search für Mitarbeiterportal...

in Lucene, Solr and ElasticSearch and the...