7th of August 2014 Full Text Search With PostgreSQL by Vincent Desmares
7th of August 2014
Full Text SearchWith PostgreSQL
by Vincent Desmares
7th of August 2014
Summary
1) What is a Full Text Search
2) A basic PostgreSQL example
3) FTS advanced features
4) FTS advanced configuration
7th of August 2014
What is a Full Text Search?
● Searching for documents
● Use the whole document
● Be able to set the precision
7th of August 2014
Why using a Full Text Search?● Basic methods are = and ILIKE
○ No linguistic support for textual search operators
■ “countries” should be the same as “country”
○ Can’t compare matches relevance
○ Basic search too slow for complex queries
● Why PostgreSQL
○ Native and sufficiently performant
7th of August 2014
Library / Database
FTS basic usage
Search for:
● A document
● A business object
● A rowOtherdocuments
Relevantdocuments
7th of August 2014
What is a search?
● A query to run● On a parsed text
SELECT *
FROM document
WHERE to_tsvector(title || content) @@ to_tsquery(‘Car’)
7th of August 2014
How to_tsvector works?
● Text is separated into tokens
# select * from to_tsvector('Hello my name is vincent. I am very happy to be vincent.')
"'happi':9 'hello':1 'name':3 'vincent':5,12"
7th of August 2014
How to_tsquery works?
● Parse a formated query
#select to_tsquery('Vincent is Happy')
ERROR: syntax error in tsquery: "Vincent is Happy"
#select to_tsquery('Vincent & is & Happy')
"'vincent' & 'happi'"
7th of August 2014
Résultat
@@ operator is the same as = for the FTS
#select content from document where to_tsquery('Vincent & is & Happy') @@ to_tsvector(content) limit 1
'Hello my name is vincent. I am very happy to be vincent.'
7th of August 2014
And it’s faaaaaaaaaaast
# select count(*) FROM document;count | 11909475
# select count(*) FROM document where content_vector @@ to_tsquery(‘countries’);count | 424813Time: 454.709 ms
# select count(*) FROM document whereILIKE '%countries%';count | 116734Time: 11672.649 ms
7th of August 2014
Why it’s faaaaaaaaaaaast?
● Indexed
● GIN (Generalized Inverted Index)
○ Longer to build, faster
● GiST (Generalized Search Tree)
○ Quicker to update, slower,
CREATE INDEX document_tsvector_idx ON document USING gin to_tsvector(title || content);
7th of August 2014
Advanced Features
7th of August 2014
Ranked results
#select content, ts_rank_cd( to_tsvector(content), to_tsquery('Happy'), 1|8) as rankfrom document where to_tsquery('Happy') @@ to_tsvector(content)ORDER BY
rank DESC
7th of August 2014
Google style results
# SELECT id, ts_headline( body, q,
‘StartSel=<b>, StopSel=</b>,MaxWords=5, MinWords=4, ShortWord=3, HighlightAll=FALSE,MaxFragments=0, FragmentDelimiter=" ... "’
) FROM document WHERE to_tsquery('Happy') @@ to_tsvector(content)
“<b>Vincent</b> is <b>happy</b> because … very <b>happy</b> to be with ...”
7th of August 2014
Advanced Configuration
7th of August 2014
Simplest workflow
Original Content ts_vector
Vincent is very very Happy
‘happy’:5 ‘is’:2 ‘very’:3,4 ‘vincent’:1
7th of August 2014
Useless words? Stop Words!
● Just a file with a list of words● Must be in the postgres tsearch directory
CREATE TEXT SEARCH DICTIONARY documentIspell ( TEMPLATE = ispell, stopwords = 'my_file');
/usr/share/postgresql/9.3/tsearch_data/my_file
7th of August 2014
With stop words
Original Content ts_vector
Vincent is very very Happy
‘happy’:5 ‘very’:3,4 ‘vincent’:1
Remove useless Words
With only “is” in the .stop
7th of August 2014
Custom dictfile
● Just a file with a list of words● Contain suffix/Affix metadata (can be custom)
CREATE TEXT SEARCH DICTIONARY documentIspell ( [...]dictfile = ‘my.dict’
);
# cat my.dict | grep fryfry/NGDS
# cat en_us.affix | grep SSFX S Y 4SFX S y ies [^aeiou]ySFX S 0 s [aeiou]ySFX S 0 es [sxzh]SFX S 0 s [^sxzhy]
7th of August 2014
With linguistic dictionaries
Original Content ts_vector
Vincent is very very Happy
‘happi’:5 ‘very’:3,4 ‘vincent’:1
Remove useless Words
Reduce Words to their roots
With as custom .affix and .dict
7th of August 2014
The thesaurus
● Link business terms
# cat /var/postgresql/9.3/tserach_data/inovia_learn.thsMcDo : *McDonaldsMc do : *McDonaldsvery happy : *blessed
7th of August 2014
The final chain
Original Content ts_vector
Vincent is very very Happy
‘very’:3, ‘blessed’:4, ‘vincent’:1
Remove useless Words
Reduce Words to their roots
Reduce Words to their syn.
With as custom .ths
7th of August 2014
How to debug?
# Select * FROM ts_debug(‘Vincent is very very happy’)
7th of August 2014
The drawbacks (yes, last slide)
● Transformed words (lexem) Indexed○ Only full or suffix match available
Solution: autocomplete● Business have custom meaningEx: fry (third-person singular simple present fries)Solution: Custom dictionary● Indexes are long to build
7th of August 2014
Merci !
Sources:http://www.postgresql.org/docs/9.3/static/textsearch.htmlhttp://en.wikipedia.org/wiki/Full_text_searchhttp://en.wikipedia.org/wiki/Precision_and_recall
For online questions, please leave a comment on the article.
Questions ?
7th of August 2014
For online questions, please leave a comment on the article.
Questions ?
7th of August 2014
Join the community !(in Paris)
Social networks :● Follow us on Twitter : https://twitter.com/steamlearn● Like us on Facebook : https://www.facebook.com/steamlearn
SteamLearn is an Inovia initiative : inovia.fr
You wish to be in the audience ? Contact us at [email protected]