Top Banner
Real time fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2013
35

Real time fulltext search with sphinx

May 06, 2015

Download

Technology

Adrian Nuta

My talk about real-time indexing and searching with Sphinx. It was given at the 2013 Froscon in Sankt Augustin, Germany.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real time fulltext search with sphinx

Real time fulltext search

with Sphinx

Adrian Nuta // Sphinxsearch // 2013

Page 2: Real time fulltext search with sphinx

Quick intro

Sphinx search

• high performance fulltext search engine

• written in C++

• serving searches since 2001

• can work on any modern architecture

• distributed under GPL2 licence

Page 3: Real time fulltext search with sphinx

Why a search engine?

• performanceo a search engine delivery faster a search and with

less resourses

• quality of searcho build-in FTS in databases don’t offer advanced

search options

• independent FTS engines offer speed not

only for FT searches, but other types, like

geo or faceted searches

Page 4: Real time fulltext search with sphinx

Classic way of indexing in Sphinx

on-disk (classic) method:

• use a data source which is indexed

• to update the index you need to reindex again

• in addition to main index, a secondary index

(delta) index can be used to reindex only latest

changes

• easy because indexing doesn’t require changes

in the application, but:

• reindexing, even delta one, can put pressure

on data source and system

Page 5: Real time fulltext search with sphinx

Real time indexing in Sphinx

• index has no data source

• everything that needs be indexed must be added manually in the index

• you can add/update/remove at any time

• compared to classic method, RT requires changes in the application

• performance is same or near same as classic index

• Only specific requirement :

workers = threads

Page 6: Real time fulltext search with sphinx

Structures

Page 7: Real time fulltext search with sphinx

RealTime index definition

index rt {

type = rt

rt_field = title

rt_field = content

rt_attr_uint = user_id

rt_attr_string = title

rt_attr_json = metadata

}

Page 8: Real time fulltext search with sphinx

Schema - Fields

rt_field - fulltext field, raw text is not stored

Tokenization features:

wildcarding ( prefix or infix),

morphology, custom charset definition,

stopwords, synonyms, segmentation, html

stripping, paragraph/sentence detection etc.

Page 9: Real time fulltext search with sphinx

Schema - Attributes

• rt_attr_uint & rt_attr_bigint

• rt_attr_bool

• rt_attr_float

• rt_attr_multi & rt_attr_multi64 -integer set

• rt_attr_timestamp

• rt_attr_string - actual text stored, kept in memory, used only for display, sorting and grouping.

• rt_attr_json - full support for JSON documents

Page 10: Real time fulltext search with sphinx

Content manipulation

Page 11: Real time fulltext search with sphinx

Quick intro to SphinxQL

• our SQL dialect

• any mysql client can be used to connect to

Sphinx

• MySQL server is not required!

• Full document updates only possible with

SphinxQL

• to enable it, add in searchd section of config

listen = host:port:mysql41

Page 12: Real time fulltext search with sphinx

Content insert

$mysql> INSERT INTO rt

(id,title,content,user_id,metadata)

VALUES(100,’My title’, ‘Some long content

to search’, 10,

’{“image_id”:1,”props”:[20,30,40]}’);

Page 13: Real time fulltext search with sphinx

Full content replace

$mysql> REPLACE INTO rt

(id,title,content,user_id,metadata)

VALUES(100,’My title’, ‘Some long content

to search’, 10,

’{“image_id”:1,”props”:[20,30,40]}’);

• needed for text field, json and string attribute

updates

Page 14: Real time fulltext search with sphinx

Updating numerics

• For numeric attributes including MVA:

$mysql> UPDATE rt SET user_id = 10 WHERE id

= 100;

• For numeric JSON elements it’s possible to

do inplace updates:

$mysql> UPDATE rt SET metadata.image_id =

1234 WHERE id=100;

Page 15: Real time fulltext search with sphinx

Deleting

$mysql> DELETE FROM rt WHERE id = 100;

$mysql> DELETE FROM rt WHERE user_id > 100;

$mysql> TRUNCATE RTINDEX rt;

● empty the memory shard, delete all disk shards and

release the index binlogs

Page 16: Real time fulltext search with sphinx

Adding new attributes

mysql> ALTER TABLE rt ADD COLUMN gid

INTEGER;

• only for int/bigint/float/bool attributes for

now

Page 17: Real time fulltext search with sphinx

Searching

Page 18: Real time fulltext search with sphinx

Searching

• no difference in searching a RT or classic

index

• dict = keywords required for wildcard search.

Page 19: Real time fulltext search with sphinx

Relevancy ranking

• build-in rankers:o proximity_bm25 ( default)

o none, matchany,wordcount,fieldmask,bm25

• custom ranker - create own expression rank

exampleranker = proximity_bm25

same as ranker = expr(‘sum(lcs*user_weight)*1000+bm25’)

Page 20: Real time fulltext search with sphinx

Tokenization settings example

index rt {

charset_type = utf-8

dict = keywords

min_word_len = 2

min_infix_len = 3

morphology = stem_en

enable_star = 1

}

Page 21: Real time fulltext search with sphinx

Operators on fulltext fields

• Boolean: hello | world, hello ! world

• phrasing: “hello world”

• proximity: “hello world”~10

• quorum: “world is a beautiful place”/3

• exact form: =cats and =dogs

• strict order: cats << and << dogs

• zone limit: (h2,h4) cats and dogs

• SENTENCE: all SENTENCE words SENTENCE “ in

one sentence”

• PARAGRAPH: “this search” PARAGRAPH “is fast”

• selected fields only: @(title,body) hello world

• excluded fields: @!(title,body) hello world

Page 22: Real time fulltext search with sphinx

Using API

<?php

require("sphinxapi.php");

$cl = new SphinxClient();

$res = $cl->Query('search me now','rt');

print_r($res);

Official: PHP, Python, Ruby, Java, C

Unofficial: JS(Node.js), perl, C++, Haskell,

.NET

Page 23: Real time fulltext search with sphinx

Using SphinxQL

$mysql> SELECT * FROM rt WHERE

MATCH('”search me fuzzy”~10') AND featured

= 1 LIMIT 0,20;

$mysql> SELECT * FROM rt WHERE

MATCH('”search me fuzzy”~10 @tag

computers') AND featured = 1 GROUP BY

user_id ORDER BY title ASC LIMIT 30,60

OPTION field_weights=(title=10,content=1),

ranker=expr(‘sum((4*lcs+2*(min_hit_pos==1)

+exact_hit)*user_weight)*1000+bm25’);

Page 24: Real time fulltext search with sphinx

Boolean filtering

$mysql> SELECT *,

views > 10 OR category = 4 AS cond

FROM rt WHERE

MATCH('”search me proximity”~10') AND

featured = 1 AND cond = 1

GROUP BY user_id ORDER BY title ASC

LIMIT 30,60 OPTION ranker=sph04;

Page 25: Real time fulltext search with sphinx

Geo search

mysql> SELECT *, GEODIST(lat,long,0.71147,-

1.29153) as distance FROM rt WHERE distance <

1000 ORDER BY distance ASC;

mysql> SELECT *, GEODIST(lat,long,40.76439,-

73.99976,

{in=degrees,out=miles,method=adaptive}) as

distance FROM rt WHERE distance < 10 ORDER BY

distance ASC;

Page 26: Real time fulltext search with sphinx

Multi-queries

mysql> DELIMITER \\

mysql> SELECT *,COUNT(*) as counter FROM rt WHERE

MATCH('search me') GROUP by property_one ORDER by

counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE

MATCH('search me') GROUP by property_two ORDER by

counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE

MATCH('search me') GROUP by property_three ORDER by

counter DESC;

\\

• used for faceting

Page 27: Real time fulltext search with sphinx

Internals

Page 28: Real time fulltext search with sphinx

Internal architecture

Each RT index is a sharded index consisting of:

• one memory shard for latest content

• one or more disk shards

Page 29: Real time fulltext search with sphinx

Internal shards management

rt_mem_limit = maximum size of memory

shard

When full, is flushed to disk as a new disk

shard.

• OPTIMIZE INDEX rt - merge all disk shards

into one.o Merging too intensive? throttle with rt_merge_iops

and rt_merge_maxiosize

Page 30: Real time fulltext search with sphinx

Binlog support

Sphinx support binlogs, so memory shard will not be lost in case of disasters

• binlog_flusho like innodb_flush_log_at_trx_commit

o 0 - flush and sync every second - fastest, 1 sec lose

o 1 - flush and sync every transaction - most safe, but slowest

o 2 - flush every transaction, sync every second - best

balance, default mode

• binlog_patho binlog_path = # disable logging

Page 31: Real time fulltext search with sphinx

Fast RT setup using classic index

• Create classic index to get initial data.

• Declare a RT index

• mysql> ATTACH INDEX classic TO RTINDEX rt

• transform classic index to RT

• operation is almost instant o in essence is a file renaming: classic index

becomes a RT disk shard

Page 32: Real time fulltext search with sphinx

Sphinx use 1 CPU core per

index

More power?

Distribute!

Page 33: Real time fulltext search with sphinx

Distributed RT index

Update on each shard, search on everythingindex distributed

{

type = distributed

local = rtlocal_one

local = rtlocal_two

agent = some.ip:rtremote_one

}

don’t forget about dist_threads = x

Page 34: Real time fulltext search with sphinx

Copy RT index from one server to

another

• just simulate a daemon restart

• searchd --stopwait

• flushes memory shard to disk

• Copy all index files to new server.

• Add RT index on new server sphinx.conf

• Start searchd on new server

Page 35: Real time fulltext search with sphinx

Questions?

www.sphinxsearch.com

Docs: http://sphinxsearch.com/docs/

Wiki: http://sphinxsearch.com/wiki/

Official blog: http://sphinxsearch.com/blog/

SVN repository: https://code.google.com/p/sphinxsearch/