Top Banner
27

The Typed Index

May 08, 2015

Download

Technology

Presented by Christoph Goller, Chief Scientist, IntraFind Software AG

If you want to search in a multilingual environment with high-quality language-specific word-normalization, if you want to handle mixed-language documents, if you want to add phonetic search for names if you need a semantic search which distinguishes between a search for the color "brown" and a person with the second name "brown", in all these cases you have to deal with different types of terms. I will show why it makes much more sense to attach types (prefixes) to Lucene terms instead of relying on different fields or even different indexes for different kinds of terms. Furthermore I will show how queries to such a typed index look and why e.g. SpanQueries are needed to correctly treat compound words and phrases or realize a reasonable phonetic search. The Analyzers and the QueryParser described are available as plugins for Lucene, Solr, and elasticsearch.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Typed Index
Page 2: The Typed Index

THE TYPED INDEX

Christoph Goller Chief Scientist at IntraFind Software AG [email protected]

Page 3: The Typed Index

• IntraFind Software AG

• Analyzers, Inverted File Index

• Different Types of Terms

• Why do we need them in one field?

• The Typed Index

• Multilingual Search / Mixed Language Documents

Outline

Page 4: The Typed Index

A few words about me and about IntraFind

Page 5: The Typed Index

• Specialist for Information Retrieval and Enterprise Search

• Founding of the company: October 2000

• More than 850 customers mainly in Germany, Austria, and Switzerland

• Employees: 30

• Lucene Committers: B. Messer, C. Goller

• Independent Software Vendor, entirely self-financed

• Products are a combination of Open Source Components and in-house Development

• Support (up to 7x24), Services, Training,

• Focus on Quality / Text Analytics / SOA Architecture

– Linguistic Analyzers for most European Languages

– Semantic Search

– Named Entity Recognition

– Text Classification

– Clustering

IntraFind Software AG

Page 7: The Typed Index

Analyzers and the Inverted File Index

Page 8: The Typed Index

Break stream of characters into tokens /terms

• Normalization (e.g. case)

• Stop Words

• Stemming

• Lemmatizer / Decomposer

• Part of Speech Tagger

• Information Extraction

Analysis / Tokenization

Page 9: The Typed Index

Inverted File Index

Page 10: The Typed Index

Different Term Normalizations

Different Types of Terms

Page 11: The Typed Index

• Lemmatizer: maps words to their base forms

• Decomposer: decomposes words into their compounds

Kinderbuch (children‘s book) Kind (Noun) | Buch (Noun)

Versicherungsvertrag (insurance contract) Versicherung (Noun) | Vertrag (Noun)

Stemmer: usually simple algorithm (huge collection of stemmers available in lucene contributions)

going -> go decoder, decoding, decodes -> decod

Overstemming: Messer -> mess ?????? king -> k ??????????? several, server -> server ????

Understemming: spoke -> speak

Morphological Analyzer vs. Stemming

English German

going go (Verb) lief laufen (Verb)

bought buy (Verb) rannte rennen (Verb)

bags bag (Noun) Bücher Buch (Noun)

bacteria bacterium (Noun) Taschen Tasche (Noun)

Page 12: The Typed Index

Bad Precision with Algorithmic Stemmer

Page 13: The Typed Index

High Recall and High Precision with

Morphological Analyzers

Page 14: The Typed Index

High Recall and High Precision with

Morphological Analyzers

Page 15: The Typed Index

Word Decomposition and Search

Federal Ministry for Family Affairs

Page 16: The Typed Index

• Stemmers / Lemmatizers are language-specific

• MultiTermQueries: WildcardQuery, FuzzyQuery

– no stemming, no lemmatization

– should work on original terms generated from Tokenizer

– only very simple normalizations such as: Citroën -> Citroen

– in Solr: <analyzer type=“multiterm”>

• Case-Sensitive

– Stemmers / Lemmatizers map everything to lowercase

– sometimes case matters: MAN vs. man

• Phonetic Search (Double Metaphone):

– Mazlum -> MSLM; Muslim -> -> MSLM

– book -> PK; books -> PKS

– Kaother Tabai -> K0R TP , Kouther Tapei -> K0R TP

Why do we need other Normalizations?

Page 17: The Typed Index

Automated extraction of information from

unstructured data

• People names

• Company names

• Brands from product lists

• Technical key figures from technical data

(raw materials, product types, order IDs,

process numbers, eClass categories)

• Names of streets and locations

• Currency and accounting values

• Dates

• Phone numbers, email addresses,

hyperlinks

Named Entity Recognition (NER)

Page 18: The Typed Index

Why do we need these different types of terms

in one field?

Page 19: The Typed Index

• Query: “MAN sagt” PhraseQuery / NearQuery !!!!!

Matching Document: “MAN sagte” not “man sagte”

• Query: “book of Kouther Tapei” PhraseQuery / NearQuery !!!!!

Matching Document: books of Kaouther Tabai

– For book to match books we need a stemmer or a lemmatizer

– For the names to match we need phonetics

• Query: Mazlum

– It leads to matches for the very frequent word Muslim

– Users want: Give me phonetic matches for Mazlim but not Muslim

– Mazlum=P AND NOT Muslim=E doesn’t do the job!!!

• No match for “Mazlum is a member of the Muslim society in Munich”

– spanNot(spanOr([body:V_mazzlim, body:F_MSLM]), body:V_muslim))

– New Syntax: <Mazlim=P BUTNOT Muslim=E>

• Query: Persons near synonyms of founding and Microsoft

“E_Person found Microsoft” PhraseQuery / NearQuery

Why do we need them in one field?

Page 20: The Typed Index

Semantic Search

Question: Wer hat Microsoft gegründet?

Semantic Search

Page 21: The Typed Index

Semantic Search

Semantic Search

Question: Wo liegen Werke von Audi?

Page 22: The Typed Index

The Typed Index

Multilingual Search

Mixed Language Documents

Page 23: The Typed Index

• We need different types of terms in one field

• Types are term properties: payloads are not a good option

• Use prefixes to distinguish them:

– V_ for fullforms (case sensitive)

– N_ for diacritics normalizations

– F_ for phonetic normal forms

– E_ for entities

• E_Person, E_Location, E_Organization

• E_PersonName_Brown, E_Location_Munich

– B_ for baseforms: B_Noun_book, B_Verb_fly, …

• Multilingual Search is handled in the same way

B_EN_NOUN_book, B_DE_NOUN_buch

The typed Index

Page 24: The Typed Index

Generate a language-specific copy of every content-field:

– configure language-specific analyzers for the language-specific fields

– Indexing: Adapt indexing chain to determine document language,

generate new language-specific fields

– Search: Use MultiFieldQueryParser to expand query to every

language-specific field

– Highlighting: depending on document-language call Highlighter for

language-specific fields with the respective analyzer

– no solution for mixed-language documents

Multilingual Search: Standard Approach

Page 25: The Typed Index

Choose analyzer depending on language but do not use different fields:

– Analyzers generate terms typed with language: B_EN_NOUN_book,

B_DE_NOUN_buch

– Indexing: choose analyzer in indexing chain based on language

– Search: Use a special MultiAnalyzerQueryParser to expand query to every

language

– Highlighting: choose analyzer based on language and apply it to content-field

– Advantage: you could implement a multi-language analyzer for handling mixed-

language documents, which switches language even within paragraphs.

Multilingual Search and the Typed Index

Page 26: The Typed Index

• Keep positions aligned in an easier way

• Only tokenize once : Performance!

• Reuse existing Queries like PhraseQueries, MultiPhraseQueries

• Treatment for Mixed-Language Documents: Use Lemmatizer

Results to switch between languages

Summary: Advantages of Typed Index to

Multi-Field Index

Page 27: The Typed Index

Questions ?

By the way: Our Analyzers are available as Plugins for Lucene / Solr / ElasticSearch

Dr. Christoph Goller

Phone: +49 89 3090446-0

Fax: +49 89 3090446-29

Email: [email protected]

Web: www.intrafind.de

IntraFindSoftware AG

Landsberger Straße 368

80687 München

Germany

Thanks for listening