Top Banner
Optimizing Multilingual Search Principal Software Engineer, Basis Technology [email protected] David Troiano
33

Optimizing multilingual search in SOLR

Jul 16, 2015

Download

Internet

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimizing multilingual search in SOLR

Optimizing Multilingual Search

Principal Software Engineer, Basis Technology

[email protected]

David Troiano

Page 2: Optimizing multilingual search in SOLR

Talk Overview

• The problem we’re trying to solve

• Natural language processing (NLP)

• Approaches to multilingual search in Solr

Page 3: Optimizing multilingual search in SOLR

A Multilingual Search Example

Page 4: Optimizing multilingual search in SOLR

The Goal

• Build a search engine where:

• Document corpus spans multiple languages

– Potentially mixed language documents

• Queries within a language, or potentially spanning multiple

Page 5: Optimizing multilingual search in SOLR

NLP Meets Search (Querying)

Terms

Inverted Index

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “clinton speaking”

NLP pipeline

clinton, speak

Page 6: Optimizing multilingual search in SOLR

NLP Meets Search (Indexing)

Document 123

Terms

Inverted Index

NLP pipeline

Bill Clinton spoke about ...

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

bill, clinton, speak, about

Page 7: Optimizing multilingual search in SOLR

NLP Meets Search

Terms

Inverted Index

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

Document 123

NLP pipeline

Bill Clinton spoke about ...

bill, clinton, speak, about

query: “clinton speaking”

NLP pipeline

clinton, speak

Page 8: Optimizing multilingual search in SOLR

The NLP Pipeline

• Language Detection

• Tokenization

• Decompounding

• Word Form Normalization

Page 9: Optimizing multilingual search in SOLR

Language Detection

• Often required when indexing

• Typically not used at query time

– Lower accuracy on short strings

– Sometimes unsolvable even to humans, e.g., named entities

– End user applications often know query language upstream of search engine

– No readily available plugin pattern in Solr

Page 10: Optimizing multilingual search in SOLR

Tokenization

• Breaking text into words

• Particularly difficult with CJK languages

– Find the words: 帰国後ハーバード大学に入学を認められていたもの

Page 11: Optimizing multilingual search in SOLR

Decompounding

• Breaking compound words into subcomponents

• Common in German, Dutch, Korean

– Samstagmorgen Samstag, morgen

Page 12: Optimizing multilingual search in SOLR

Word Form Normalization

• Reduce word form variations to a canonical representation

• Critical for recall

• Two approaches

– Stemming

– Lemmatization

Page 13: Optimizing multilingual search in SOLR

Normalization: Stemming

• Simple rules-based approach

• “Chop off the end”

– arsenal, arsenic arsen

Page 14: Optimizing multilingual search in SOLR

Normalization: Lemmatization

• Map words to their dictionary form via morphological analysis

• spoke, speaks, speaking speak

• Higher precision and recall compared to stemming

Page 15: Optimizing multilingual search in SOLR

NLP Meets Search

Terms

Inverted Index

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

Document 123

NLP pipeline

Bill Clinton spoke about ...

bill, clinton, speak, about

query: “clinton speaking”

NLP pipeline

clinton, speak

Solr

Page 16: Optimizing multilingual search in SOLR

NLP Within Solr

• Maximal precision / recall requires NLP pipeline per language

• NLP pipeline (mostly) specified within Solr field type

• Index / query strategies in Solr

– Field per language

– Core per language

– A new approach: Single multilingual field

Page 17: Optimizing multilingual search in SOLR

Field Per Language

schema.xml<field name="content_cjk" type="text_cjk" indexed="true" stored="true" /><field name="content_eng" type="text_eng" indexed="true" stored="true" />

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"><analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.CJKWidthFilterFactory"/><filter class="solr.CJKBigramFilterFactory"/>

</analyzer></fieldType>

queryhttp://<solr

url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_eng

Page 18: Optimizing multilingual search in SOLR

Field Per Language

http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engq=serie%20a

Page 19: Optimizing multilingual search in SOLR

Field Per Language

http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engdefType=edismax

Page 20: Optimizing multilingual search in SOLR

Field Per Language

http://<solr url>/solr/articles/select?q=serie%20a&defType=edismax&qf=content_cjk%20content_engqf=content_cjk%20content_eng

Page 21: Optimizing multilingual search in SOLR

Core Per Language

CJK core’s schema.xml

<field name="content" type="text_cjk" indexed="true" stored="true" multiValued="true"/>

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.CJKWidthFilterFactory"/>

<filter class="solr.CJKBigramFilterFactory"/>

</analyzer>

</fieldType>

query

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_eng

Page 22: Optimizing multilingual search in SOLR

Core Per Language

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_engq=content:serie%20a

Page 23: Optimizing multilingual search in SOLR

Core Per Language

http://.../select?q=content:serie%20a&shards=<url>/articles_cjk,<url>/articles_engshards=<url>/articles_cjk,<url>/articles_eng

Page 24: Optimizing multilingual search in SOLR

Approach Comparison

Field Per Language Core Per Language

Simplicity

Speed

Page 25: Optimizing multilingual search in SOLR

Approach Comparison: Query Latency

• Experimental Setup

• Corpus: Wikipedia across 9 languages (9 million articles)

• Queries: 1000 most frequently used terms for each language, randomized

• JMeter running 1 hour for each of 6 test runs

0

20

40

60

80

100

120

140

160

1 4 9

Field per lang

Core per lang

Avg

late

ncy

(m

s)

# languages queried

Page 26: Optimizing multilingual search in SOLR

An Alternative Approach

• All languages in a single field

• Requires custom meta field type that is applies per-language concrete field type(s)

• Patch submitted to Solr

• cf. Solr In Action / Trey Grainger

• https://github.com/treygrainger/solr-in-action

Page 27: Optimizing multilingual search in SOLR

An Alternative Approach

Terms

Inverted Index

term document IDs

... ...

clinton …, 123, ...

... ...

speak …, 123, ...

query: “[en, es]clinton speaking”

Inspect [en, es], apply English and Spanish field types to “clinton speaking”, merge results

clinton, speak

Page 28: Optimizing multilingual search in SOLR

An Alternative Approach

• Results scoring potentially worse than other approaches

• IDF thrown off with single field

– e.g., soy common in Spanish, relatively rare in English

– Consider a query for “soy dessert recipe” against a corpus of English and Spanish recipes

– Though IDF of named entity tokens perhaps better with a single field…

Page 29: Optimizing multilingual search in SOLR

Enhancing NLP Pipeline

• Limitations of NLP in Solr out of the box

• Poor precision / performance of CJK tokenization

• Poor precision / recall of stemmers (no lemmatizers)

• Poor recall due to lack of decompounding

Rosette to the rescue!

Page 30: Optimizing multilingual search in SOLR

CJK Tokenization

ケネディはマサチューセッツ

Rosette: ケネディ, は, マサチューセッツ

Bigrams: ケネ, ネデ, ディ, ィは, はマ, マサ, サチ, チュ, ュー, ーセ, セッ, ッツ

How does this impact precision, recall, index size, speed?

Page 31: Optimizing multilingual search in SOLR

Rosette In Solr

<fieldType name="text_zho" class="solr.TextField"><analyzer type="index">

<tokenizerclass="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"rootDirectory="<rootDir>"language="zho" />

<filterclass="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"rootDirectory="<rootDir>"language="zho" />

</analyzer></fieldType>

cf. http://www.basistech.com/search-essentials/

Page 32: Optimizing multilingual search in SOLR

Wrapping Up

• Multilingual search is everywhere

• Solr as your multilingual search platform

• Search quality hinges on quality of NLP tools

Page 33: Optimizing multilingual search in SOLR

Optimizing Multilingual Search

• David Troiano

• Principal Software Engineer, Basis Technology

[email protected]