ElasticSearch - Suche im Zeitalter der Clouds
Post on 20-Aug-2015
467 Views
Preview:
Transcript
Background ‣ open source (free software)
‣ Linux
‣ Web
‣ Java
‣ Android
‣ CTO@inovex
‣ Christian Meder
Christian Meder
Speaker
2
Background ‣ Lucene
‣ Solr
‣ Text Mining Technologies, Information Retrieval
‣ Hadoop
‣ Java
‣ Big Data Engineer@inovex
‣ bpflugfelder@inovex.de
Bernhard Pflugfelder Speaker
3
‣ Can you think of other scenarios where search applications will also do a good job?
‣ Remind the key capabilities of search technologies:
‣ Persistency
‣ Flexible data model
‣ Unstructured data, but not only
‣ Extremely quick access to data
‣ Horizontal scalability
There are plenty of applications scenarios out there where search technologies shall be considered!
Document store Search applications
13
Open source
Search technologies
14
http://lucene.apache.org
http://lucene.apache.org/solr/
http://www.elasticsearch.org
Lucene is an open source, pure Java API for enabling information retrieval
‣ Originally developed by Doug Cutting 1999 and became Apache TLP in 2001 ‣ Licensed by Apache License 2.0 ‣ Pure Java Library with implementations for :
‣ Lucene.NET (http://lucenenet.apache.org) ‣ PyLucene (http://lucene.apache.org/pylucene/) ‣ and more:
http://wiki.apache.org/lucene-java/LuceneImplementations ‣ Large and very active developer community, well documented and supported (38
active committer!) ‣ Current stable release: 4.2.1 ‣ Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/lucene-java/PoweredBy
Overview
15
http://lucene.apache.org/
Solr is a standalone enterprise search server & document store with based on Lucene
‣ Created by Yonik Seeley at CNET Networks in 2004
‣ Introduced as Apache Incubator in 2006, became TLP in 2007 ‣ Licensed by Apache License 2.0 ‣ Seeley and others founded Lucid Imagination -> LucidWorks ‣ Large and very active developer community, well documented and supported
(strong relationship to Lucene community also) ‣ Current stable release: 4.2.1 ‣ Widely used and adopted for commercial / non-commercial projects:
http://wiki.apache.org/solr/PublicServers
Overview
16
http://lucene.apache.org/solr/
Elasticsearch is a “distributed-from-scratch” search server based on Lucene
Created by Shay Banon with a first version made public in 02/2010:
Elasticsearch itself was born out of my frustration with the fact that there isn’t really a good, open source, solution for distributed search engine out there, which also combines what I expect of search engines after building Compass (and on that, I will blog later…). I have been working on this for the past several months, pouring my search and distributed knowledge into this (and portions of my heart and time ;) )
[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]
Motivation
18
http://www.elasticsearch.org/
‣ Current stable version 0.20.6 working with Lucene 3.6 ‣ Available version 0.90 RC2 includes Lucene 4.2.1 integration
‣ Licensed by Apache License 2.0
‣ Small, but growing group of core developer
‣ Strong support of valuable Lucene committer
‣ Company elasticsearch.com founded in 2012
‣ By the people behind elasticsearch.org
‣ www.elasticsearch.com
Overview
19
http://www.elasticsearch.org/
‣ Code search is organized on a cluster ‣ 26 storage nodes holding the searchable data
‣ 8 client nodes coordinating query requests
‣ Storage cluster has 2TB of SSD based storage
‣ 17 TB of indexed data is stored in cluster
‣ shared in the cluster with replication factor of 1
‣ makes overall 34 TB of indexed data
Github
21
http://www.elasticsearch.org/
‣ Question-and-answer website ‣ aggregates questions and answer in terms of topics
‣ Sources are the web in general, social media
‣ Goals for search:
‣ low latency for queries
‣ increased relevancy of results.
‣ evaluates elasticsearch against Solr and Sphinx
‣ “After much benchmarking with our data set, we discovered that ElasticSearch was clearly the fastest of the possible search platforms we were considering.”
Quora
22
http://www.elasticsearch.org/
Quora
23
http://www.elasticsearch.org/
http://www.quora.com/Full-Text-Search-on-Quora/What-technology-does-Quora-use-for-its-full-text-search-infrastructure/answer/Adrien-Lucas-Ecoffet?srid=pilt&share=1
Soundcloud
24
http://bed-con.org/2013/wp-content/uploads/2013/04/Wie_SoundCloud_skaliert.pdf
http://www.elasticsearch.org/
Huffington Post
26
http://blogs.vmware.com/vfabric/2013/03/scaling-real-time-comments-huffpost-live-with-rabbitmq.html
http://www.elasticsearch.org/
‣ Scalable, High-Performance Indexing ‣ over 95GB/hour on modern hardware
‣ small RAM requirements
‣ incremental indexing as fast as batch indexing
‣ index size roughly 20-30% the size of text indexed
‣ Powerful, Accurate and Efficient Search Algorithms ‣ ranked searching -- best results returned first
‣ many powerful query types
‣ fielded searching (e.g., title, author, contents)
‣ date-range searching
‣ sorting by any field
‣ multiple-index searching with merged results
‣ allows simultaneous update and searching [From http://lucene.apache.org/core/features.html]
Highlights
28
http://lucene.apache.org/
‣ Pure Java application ‣ Powered by Lucene
‣ Document-oriented
‣ Schema-less
‣ HTTP API with JSON In & Out
‣ Indexing / Updating
‣ Searching
‣ Administration / Monitoring
‣ Extendable by plugins
‣ Distribution is a fundamental paradigm of Elasticsearch
Overview
29
http://www.elasticsearch.org/
Architecture
30
2 1 1 2
3 2 1
3 3
Primary Shard Replica Shard
Master node
Node
Node
http://www.elasticsearch.org/
‣ Index distribution by auto sharding ‣ Automatic replication and balancing
‣ Fault tolerant + high availability
‣ Cluster building & managment
‣ node detection through zen discovery
‣ nodes communicate via unicast / multicast
‣ automatic master election
‣ influence into master / data node assignment possible
‣ Master responsible to
‣ route the search request
‣ include new nodes into cluster
‣ Index / query routing (automatic / individual)
Architecture
31
http://www.elasticsearch.org/
‣ Define a mapping for type book
‣ Retrieve the current mapping for type book
Schema-less, but
35
# echo " { "mappings" : {
"books" : { "properties" : { ”id" : { "type" : "string" }, "title" : { "type" : "string" },
"author" : { "type" : "string" }, ”subject" : { "type" : ”string" }, ”view_count" : { "type" : ”integer" }, "created" : { "type" : "date",
"format" : “dateOptionalTime" } }}}} " > book.json curl –XPUT 'localhost:9200/gutenberg/books/_mapping’ –d @book.json
# curl 'localhost:9200/gutenberg/books/_mapping?pretty=1
http://www.elasticsearch.org/
‣ Search on terms, numeric values, dates, numeric ranges, date/time ranges ‣ Lots of query types
‣ terms, phrases, fuzzy, wildcard, ranges
‣ faceting, filtering
‣ Geospatial search called GeoShape Query
‣ Configurable caching for
‣ Filter queries
‣ Field values
‣ NRT search with separate API
‣ Sorting, Highlighting
‣ MoreLikeThis
‣ Multi Tenancy
Search highlights
36
http://www.elasticsearch.org/
‣ Gateway module stores cluster metadata to: ‣ Local FS, Shared FS, Hadoop, Amazon S3
‣ River:
‣ Pluggable service to constantly pull data
‣ Manage over specific REST endpoint
‣ Implementations for CouchDB, MongoDB, JDBC, Solr, …
‣ Bulk indexing
‣ Default: single document indexing
‣ Bulk indexing over specific REST endpoints
‣ Lucene Analyzer specification over elasticsearch.yml or API
Some more features
42
http://www.elasticsearch.org/
‣ Query types such as term, terms, match, wildcard, fuzzy, range, … ‣ Multi Search
‣ Get
‣ Multi Get
‣ Filter
‣ Facets
‣ Highlighting
‣ Suggest
‣ MoreLikeThis
‣ Index boosting
‣ Explain
‣ Percolate
Search API
43
http://www.elasticsearch.org/
‣ Create, Delete, Exists, Open, Close, Optimize, Refresh, Flush, Settings ‣ Index templates (mappings + settings)
‣ Get, Put, Delete Mapping
‣ Get, update settings
‣ Snapshot
‣ Aliases
‣ Warmers
‣ Statistics, Status
Indices API
44
http://www.elasticsearch.org/
‣ Live configuration of cluster settings ‣ minimum master nodes
‣ cache sizes
‣ routing
‣ allocation
‣ moving shards
‣ Moving replicas
‣ Cluster health & status
‣ Nodes info & stats, Shutdown all / specific nodes
Cluster API
45
http://www.elasticsearch.org/
+ Elasticssearch feels light-weighted + Simple but effective architecture
+ Easiness of use, even when using distributed search
+ High matureness, even though ES is young
+ High-performance search (at least based on current benchmarks seen)
+ Modern technologies used (HTTP, JSON, NoXML, Java, Guava)
- Still small community and small group of core developer
- Missing data connectors (e.g. dataimporthandler),
- Missing search features grouping & search result clustering
- Less number of query types
- Less possibilities for boosting (e.g function queries)
- Less number of analyzers
Pros & Cons
46
http://www.elasticsearch.org/
‣ The world becomes data-driven and user-driven ‣ large data volumes
‣ multiple sources
‣ many users shall be able to access
‣ Therefore search technologies Elasticsearch becomes important: ‣ Easy aggregation of data from multiple sources
‣ Provide unified access layer through search
‣ Scalable regarding data volume and users
‣ Highly configurable
‣ ElasticSearch is easy to use, distributed, scalable and search is fast
Wrap up
47
http://www.elasticsearch.org/
top related