Steam Learn: Full text search with PostgreSQL

7th of August 2014

Full Text SearchWith PostgreSQL

by Vincent Desmares

7th of August 2014

Summary

1) What is a Full Text Search

2) A basic PostgreSQL example

3) FTS advanced features

4) FTS advanced configuration

7th of August 2014

What is a Full Text Search?

● Searching for documents

● Use the whole document

● Be able to set the precision

7th of August 2014

Why using a Full Text Search?● Basic methods are = and ILIKE

○ No linguistic support for textual search operators

■ “countries” should be the same as “country”

○ Can’t compare matches relevance

○ Basic search too slow for complex queries

● Why PostgreSQL

○ Native and sufficiently performant

7th of August 2014

Library / Database

FTS basic usage

Search for:

● A document

● A business object

● A rowOtherdocuments

Relevantdocuments

7th of August 2014

What is a search?

● A query to run● On a parsed text

SELECT *

FROM document

WHERE to_tsvector(title || content) @@ to_tsquery(‘Car’)

7th of August 2014

How to_tsvector works?

● Text is separated into tokens

# select * from to_tsvector('Hello my name is vincent. I am very happy to be vincent.')

"'happi':9 'hello':1 'name':3 'vincent':5,12"

7th of August 2014

How to_tsquery works?

● Parse a formated query

#select to_tsquery('Vincent is Happy')

ERROR: syntax error in tsquery: "Vincent is Happy"

#select to_tsquery('Vincent & is & Happy')

"'vincent' & 'happi'"

7th of August 2014

Résultat

@@ operator is the same as = for the FTS

#select content from document where to_tsquery('Vincent & is & Happy') @@ to_tsvector(content) limit 1

'Hello my name is vincent. I am very happy to be vincent.'

7th of August 2014

And it’s faaaaaaaaaaast

# select count(*) FROM document;count | 11909475

# select count(*) FROM document where content_vector @@ to_tsquery(‘countries’);count | 424813Time: 454.709 ms

# select count(*) FROM document whereILIKE '%countries%';count | 116734Time: 11672.649 ms

7th of August 2014

Why it’s faaaaaaaaaaaast?

● Indexed

● GIN (Generalized Inverted Index)

○ Longer to build, faster

● GiST (Generalized Search Tree)

○ Quicker to update, slower,

CREATE INDEX document_tsvector_idx ON document USING gin to_tsvector(title || content);

7th of August 2014

Advanced Features

7th of August 2014

Ranked results

#select content, ts_rank_cd( to_tsvector(content), to_tsquery('Happy'), 1|8) as rankfrom document where to_tsquery('Happy') @@ to_tsvector(content)ORDER BY

rank DESC

7th of August 2014

Google style results

# SELECT id, ts_headline( body, q,

‘StartSel=, StopSel=,MaxWords=5, MinWords=4, ShortWord=3, HighlightAll=FALSE,MaxFragments=0, FragmentDelimiter=" ... "’

) FROM document WHERE to_tsquery('Happy') @@ to_tsvector(content)

“Vincent is happy because … very happy to be with ...”

7th of August 2014

Advanced Configuration

7th of August 2014

Simplest workflow

Original Content ts_vector

Vincent is very very Happy

‘happy’:5 ‘is’:2 ‘very’:3,4 ‘vincent’:1

7th of August 2014

Useless words? Stop Words!

● Just a file with a list of words● Must be in the postgres tsearch directory

CREATE TEXT SEARCH DICTIONARY documentIspell ( TEMPLATE = ispell, stopwords = 'my_file');

/usr/share/postgresql/9.3/tsearch_data/my_file

7th of August 2014

With stop words



‘happy’:5 ‘very’:3,4 ‘vincent’:1

Remove useless Words

With only “is” in the .stop

7th of August 2014

Custom dictfile

● Just a file with a list of words● Contain suffix/Affix metadata (can be custom)

CREATE TEXT SEARCH DICTIONARY documentIspell ( [...]dictfile = ‘my.dict’

);

# cat my.dict | grep fryfry/NGDS

# cat en_us.affix | grep SSFX S Y 4SFX S y ies [^aeiou]ySFX S 0 s [aeiou]ySFX S 0 es [sxzh]SFX S 0 s [^sxzhy]

7th of August 2014

With linguistic dictionaries



‘happi’:5 ‘very’:3,4 ‘vincent’:1


Reduce Words to their roots

With as custom .affix and .dict

7th of August 2014

The thesaurus

● Link business terms

# cat /var/postgresql/9.3/tserach_data/inovia_learn.thsMcDo : *McDonaldsMc do : *McDonaldsvery happy : *blessed

7th of August 2014

The final chain



‘very’:3, ‘blessed’:4, ‘vincent’:1


Reduce Words to their roots

Reduce Words to their syn.

With as custom .ths

7th of August 2014

How to debug?

# Select * FROM ts_debug(‘Vincent is very very happy’)

7th of August 2014

The drawbacks (yes, last slide)

● Transformed words (lexem) Indexed○ Only full or suffix match available

Solution: autocomplete● Business have custom meaningEx: fry (third-person singular simple present fries)Solution: Custom dictionary● Indexes are long to build

http://en.wiktionary.org/wiki/fries#English

7th of August 2014

Merci !

Sources:http://www.postgresql.org/docs/9.3/static/textsearch.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Precision_and_recall

For online questions, please leave a comment on the article.

Questions ?

http://www.postgresql.org/docs/9.3/static/textsearch.html

http://www.postgresql.org/docs/9.3/static/textsearch.html

http://en.wikipedia.org/wiki/Full_text_search

http://en.wikipedia.org/wiki/Full_text_search

http://en.wikipedia.org/wiki/Precision_and_recall

http://en.wikipedia.org/wiki/Precision_and_recall

7th of August 2014

For online questions, please leave a comment on the article.

Questions ?

7th of August 2014

Join the community !(in Paris)

Social networks :● Follow us on Twitter : https://twitter.com/steamlearn● Like us on Facebook : https://www.facebook.com/steamlearn

SteamLearn is an Inovia initiative : inovia.fr

You wish to be in the audience ? Contact us at [email protected]

https://twitter.com/steamlearn

https://www.facebook.com/steamlearn

http://www.inovia.fr

mailto:[email protected]

mailto:[email protected]

Steam Learn: Full text search with PostgreSQL

Software