Top Banner
Discovering python search engine José Manuel Ortega
78

Discovering python search engines

Jan 22, 2018

Download

Software

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Discovering python search engines

Discovering python search engineJosé Manuel Ortega

Page 2: Discovering python search engines

● Introduction to search engines● ElasticSearch,whoosh,django-hystack● ElasticSearch example● Other solutions & tools● Conclusions

Page 3: Discovering python search engines

Search engines

Page 4: Discovering python search engines

Search engines

● Document based● A document is the unit of searching in a full text

search system.● A document can be a json or python dictionary

Page 5: Discovering python search engines

Core concepts

Page 6: Discovering python search engines

● Index: Named collection of documents that have

similar characteristics(like a database)

● Type:Logical partition of an index that contains

documents with common fields(like a table)

● Document:basic unit of information(like a row)

● Mapping:field properties(datatype,token extraction).

Includes information about how fields are stored in the

index

Page 7: Discovering python search engines

● Relevance are the algorithms used to rank the results based on the query

● Corpus is the collection of all documents in the index

● Segments:Sharded data storing the inverted index.Allow searching in the index in a efficient way

Page 8: Discovering python search engines

Inverted index

Page 9: Discovering python search engines
Page 10: Discovering python search engines
Page 11: Discovering python search engines

ElasticSearch

Page 12: Discovering python search engines
Page 13: Discovering python search engines
Page 14: Discovering python search engines

● Open source search server based on Apache Lucene

● Written in Java● Cross-platform● Communications with the search server is

done through HTTP REST API● curl -X<GET|POST|PUT|DELETE>

http://localthost:9200/<index>/<type_document>/id

Page 15: Discovering python search engines

● You can add a document without creating an index

● ElasticSearch will create the index,mapping type and fields automatically

● ElasticSearch will infer the data types based on the document’s data

Page 16: Discovering python search engines
Page 17: Discovering python search engines

● TF-IDF(Term Frecuency-Inverse Doc Freq)● TF-IDF = TF * IDF● TF = number of apperences of the term in

all documents● IDF = log (N / DF)● N = total_document_count● DF = number of documents where appears

the term

Page 18: Discovering python search engines
Page 19: Discovering python search engines

Creating an Indexcurl -XPUT ‘localhost:9200/myindex’-d {

“settings”:{..}

“mappings”:{..}

}

Page 20: Discovering python search engines
Page 21: Discovering python search engines
Page 22: Discovering python search engines
Page 23: Discovering python search engines

Searching a documentcurl -XGET ‘localhost:9200/myindex/mydocument/_search?q=elasticSearch’

curl -XGET ‘localhost:9200/myindex/mydocument/_search?pretty’ -d{

“query”:{

“match”:{

“_all”:”elasticSearch”

}

}

}

Query DSL

Page 24: Discovering python search engines
Page 25: Discovering python search engines

Searching a document● Search can get much more complex

○ Multiple terms○ Multi-match(math query on specific fields)○ Bool(true,false)○ Range○ RegExp○ GeoPoint,GeoShapes

Page 26: Discovering python search engines

ElasticSearch python client● The official low-level client is elasticsearch-py

○ pip install elasticsearch

Page 27: Discovering python search engines

ElasticSearch-py API

Page 28: Discovering python search engines

ElasticSearch-py API

Page 29: Discovering python search engines
Page 30: Discovering python search engines
Page 31: Discovering python search engines
Page 32: Discovering python search engines
Page 33: Discovering python search engines
Page 34: Discovering python search engines

Geo queries● Elastic search supports two types of geo fields

○ geo_point(lat,lon)○ geo_shapes(points,lines,polygons)

● Perform geographical searches○ Finding points of interest and GPS coordinates

Page 35: Discovering python search engines
Page 36: Discovering python search engines
Page 37: Discovering python search engines

https://github.com/jmortega/python_discover_search_engine

Page 38: Discovering python search engines
Page 39: Discovering python search engines
Page 40: Discovering python search engines
Page 41: Discovering python search engines
Page 42: Discovering python search engines
Page 43: Discovering python search engines
Page 44: Discovering python search engines
Page 45: Discovering python search engines
Page 46: Discovering python search engines
Page 47: Discovering python search engines
Page 48: Discovering python search engines

Whoosh

Page 49: Discovering python search engines

● Pure-python full-text indexing and searching library● Library of classes and functions for indexing text and

then searching the index.● It allows you to develop custom search engines for

your content.● Mainly focused on index and search definition using

schemas● Python 2.5 and Python 3

Page 50: Discovering python search engines

Schema

Page 51: Discovering python search engines

Create index and insert document

Page 52: Discovering python search engines

Searching single field

Page 53: Discovering python search engines

Searching multiple field

Page 54: Discovering python search engines

Django-haystack

Page 55: Discovering python search engines
Page 56: Discovering python search engines

● Multiple backends (you have a Solr & a Whoosh index, or a master Solr & a slave Solr, etc.)

● An Elasticsearch backend● Big query improvements● Geospatial search (Solr & Elasticsearch only)● The addition of Signal Processors for better control● Input types for improved control over queries● Rich Content Extraction in Solr

Page 57: Discovering python search engines
Page 58: Discovering python search engines
Page 59: Discovering python search engines
Page 60: Discovering python search engines
Page 61: Discovering python search engines
Page 62: Discovering python search engines

● Create the index○ Run ./manage.py rebuild_index to create the new

search index.● Update the index

○ ./manage.py update_index will add new entries to the index.

○ ./manage.py rebuild_index will recreate the index from scratch.

Page 63: Discovering python search engines

Other solutions

Page 64: Discovering python search engines

Other solutions● https://xapian.org● https://docs.djangoproject.com/en/1.11/ref/contrib/pos

tgres/search/● https://www.postgresql.org/docs/9.6/static/textsearch.

html

Page 65: Discovering python search engines

pysolr

Page 66: Discovering python search engines
Page 67: Discovering python search engines

Conclusions

Page 68: Discovering python search engines

● Elasticsearch's Query DSL syntax is really flexible and it's pretty easy to write complex queries with it,other solutions doesn't have an equivalent

● Elasticsearch is faster and flexible than other solutions like postgresssql full text search or solr

● Aggregations in ES for searching by category is another interesting feature that haven’t got other solutions

● SOLR requires more configuration than ES● Whoosh is suitable for a small project. Limited scalability

for search and indexing.

Page 69: Discovering python search engines

Other tools

Page 70: Discovering python search engines
Page 71: Discovering python search engines
Page 72: Discovering python search engines
Page 73: Discovering python search engines
Page 74: Discovering python search engines
Page 75: Discovering python search engines

References● http://elasticsearch-py.readthedocs.io/en/master/● https://whoosh.readthedocs.io/en/latest● http://django-haystack.readthedocs.io/en/master/● http://solr-vs-elasticsearch.com/● https://wiki.apache.org/solr/SolPython● https://github.com/django-haystack/pysolr

Page 76: Discovering python search engines
Page 77: Discovering python search engines
Page 78: Discovering python search engines

Thanks!jmortega.github.io

@jmortegac