YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Postgresql search demystified

* in *

PgConf EU 2014 presents

Javier RamirezPostgreSQL

Full-text search

demystified@supercoco9

https://teowaki.com

Page 2: Postgresql search demystified

The problem

Page 3: Postgresql search demystified

our architecture

Page 4: Postgresql search demystified
Page 5: Postgresql search demystified

One does not simply

SELECT * from stuff where

content ilike '%postgresql%'

Page 6: Postgresql search demystified
Page 7: Postgresql search demystified
Page 8: Postgresql search demystified

Basic search features

* stemmers (run, runner, running)* unaccented (josé, jose)* results highlighting* rank results by relevance

Page 9: Postgresql search demystified

Nice to have features* partial searches

* search operators (OR, AND...)

* synonyms (postgres, postgresql, pgsql)

* thesaurus (OS=Operating System)

* fast, and space-efficient

* debugging

Page 10: Postgresql search demystified

Good News:

PostgreSQL supports all

the requested features

Page 11: Postgresql search demystified

Bad News:

unless you already know about search

engines, the official docs are not obvious

Page 12: Postgresql search demystified

How a search engine works

* An indexing phase

* A search phase

Page 13: Postgresql search demystified

The indexing phase

Convert the input text to tokens

Page 14: Postgresql search demystified

The search phase

Match the search terms to

the indexed tokens

Page 15: Postgresql search demystified

indexing in depth

* choose an index format

* tokenize the words

* apply token analysis/filters

* discard unwanted tokens

Page 16: Postgresql search demystified

the index format

* r-tree (GIST in PostgreSQL)

* inverse indexes (GIN in PostgreSQL)

* dynamic/distributed indexes

Page 17: Postgresql search demystified

dynamic indexes: segmentation

* sometimes the token index is

segmented to allow faster updates

* consolidate segments to speed-up

search and account for deletions

Page 18: Postgresql search demystified

tokenizing

* parse/strip/convert format

* normalize terms (unaccent, ascii,

charsets, case folding, number precision..)

Page 19: Postgresql search demystified

token analysis/filters

* find synonyms

* expand thesaurus

* stem (maybe in different languages)

Page 20: Postgresql search demystified

more token analysis/filters

* eliminate stopwords

* store word distance/frequency

* store the full contents of some fields

* store some fields as attributes/facets

Page 21: Postgresql search demystified

“the index file” is really

* a token file, probably segmented/distributed

* some dictionary files: synonyms, thesaurus,

stopwords, stems/lexems (in different languages)

* word distance/frequency info

* attributes/original field files

* optional geospatial index

* auxiliary files: word/sentence boundaries, meta-info,

parser definitions, datasource definitions...

Page 22: Postgresql search demystified

the hardest

part is now

over

Page 23: Postgresql search demystified

searching in depth* tokenize/analyse

* prepare operators

* retrieve information

* rank the results

* highlight the matched parts

Page 24: Postgresql search demystified

searching in depth: tokenize

normalize, tokenize, and analyse

the original search term

the result would be a tokenized, stemmed,

“synonymised” term, without stopwords

Page 25: Postgresql search demystified

searching in depth: operators

* partial search

* logical/geospatial/range operators

* in-sentence/in-paragraph/word distance

* faceting/grouping

Page 26: Postgresql search demystified

searching in depth: retrieval

Go through the token index files, use the

attributes and geospatial files if necessary

for operators and/or grouping

You might need to do this in a distributed way

Page 27: Postgresql search demystified

searching in depth: ranking

algorithm to sort the most relevant results:

* field weights

* word frequency/density

* geospatial or timestamp ranking

* ad-hoc ranking strategies

Page 28: Postgresql search demystified

searching in depth: highlighting

Mark the matching parts of the results

It can be tricky/slow if you are not storing the full contents

in your indexes

Page 29: Postgresql search demystified

PostgreSQL as a

full-text

search engine

Page 30: Postgresql search demystified

search features

* index format configuration

* partial search

* word boundaries parser (not configurable)

* stemmers/synonyms/thesaurus/stopwords

* full-text logical operators

* attributes/geo/timestamp/range (using SQL)

* ranking strategies

* highlighting

* debugging/testing commands

Page 31: Postgresql search demystified

indexing in postgresql

you don't actually need an index to use full-text search in PostgreSQL

but unless your db is very small, you want to have one

Choose GIST or GIN (faster search, slower indexing,

larger index size)

CREATE INDEX pgweb_idx ON pgweb USING

gin(to_tsvector(config_name, body));

Page 32: Postgresql search demystified

Two new things

CREATE INDEX ... USING gin(to_tsvector (config_name, body));

* to_tsvector: postgresql way of saying “tokenize”

* config_name: tokenizing/analysis rule set

Page 33: Postgresql search demystified

Configuration

CREATE TEXT SEARCH CONFIGURATION

public.teowaki ( COPY = pg_catalog.english );

Page 34: Postgresql search demystified

Configuration

CREATE TEXT SEARCH DICTIONARY english_ispell (

TEMPLATE = ispell,

DictFile = en_us,

AffFile = en_us,

StopWords = spanglish

);

CREATE TEXT SEARCH DICTIONARY spanish_ispell (

TEMPLATE = ispell,

DictFile = es_any,

AffFile = es_any,

StopWords = spanish

);

Page 35: Postgresql search demystified

Configuration

CREATE TEXT SEARCH DICTIONARY english_stem (

TEMPLATE = snowball,

Language = english,

StopWords = english

);

CREATE TEXT SEARCH DICTIONARY spanish_stem (

TEMPLATE= snowball,

Language = spanish,

Stopwords = spanish

);

Page 36: Postgresql search demystified

Configuration

Parser.

Word boundaries

Page 37: Postgresql search demystified

Configuration

Assign dictionaries (in specific to generic order)

ALTER TEXT SEARCH CONFIGURATION teowaki

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,

hword_part

WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;

ALTER TEXT SEARCH CONFIGURATION teowaki

DROP MAPPING FOR email, url, url_path, sfloat, float;

Page 38: Postgresql search demystified

debugging

select * from ts_debug('teowaki', 'I am searching unas

b squedas con postgresql database');ú

also ts_lexize and ts_parser

Page 39: Postgresql search demystified

tokenizing

tokens + position (stopwords are removed, tokens are folded)

Page 40: Postgresql search demystified

searching

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres');

Page 41: Postgresql search demystified

searching

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres:*');

Page 42: Postgresql search demystified

operators

SELECT guid, description from wakis where

to_tsvector('teowaki',description)

@@ to_tsquery('teowaki','postgres | mysql');

Page 43: Postgresql search demystified

ranking weights

SELECT setweight(to_tsvector(coalesce(name,'')),'A') ||

setweight(to_tsvector(coalesce(description,'')),'B')

from wakis limit 1;

Page 44: Postgresql search demystified

search by weight

Page 45: Postgresql search demystified

ranking

SELECT name, ts_rank(to_tsvector(name), query) rank

from wakis, to_tsquery('postgres | indexes') query

where to_tsvector(name) @@ query order by rank DESC;

also ts_rank_cd

Page 46: Postgresql search demystified

highlighting

SELECT ts_headline(name, query) from wakis,

to_tsquery('teowaki', 'game|play') query

where to_tsvector('teowaki', name) @@ query;

Page 47: Postgresql search demystified

USE POSTGRESQL

FOR EVERYTHING

Page 48: Postgresql search demystified

When PostgreSQL is not good

* You need to index files (PDF, Odx...)

* Your index is very big (slow reindex)

* You need a distributed index

* You need complex tokenizers

* You need advanced rankers

Page 49: Postgresql search demystified

When PostgreSQL is not good

* You want a REST API

* You want sentence/ proximity/ range/

more complex operators

* You want search auto completion

* You want advanced features (alerts...)

Page 50: Postgresql search demystified

But it has been

perfect for us so far.

Our users don't care

which search engine

we use, as long as

it works.

Page 51: Postgresql search demystified

* in *

PgConf EU 2014 presents

Javier RamirezPostgreSQL

Full-text search

demystified@supercoco9

https://teowaki.com


Related Documents