Top Banner
05/21/10 Apache Lucene EuroCon Key topics when Migratng from FAST to Solr By Jan Høydahl cominvent as
43

Key topics when migrating from FAST to Solr, EuroCon 2010

May 08, 2015

Download

Technology

Cominvent AS

Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Key topics when

Migratng from FAST to Solr

By Jan Høydahl

cominvent as

Page 2: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Agenda

About Cominvent & Jan Høydahl

Quick overview of FAST ESP

The migraton step by step

Pain points

Q&A

Page 3: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Jan Høydahl: BIO

● Enterprise search consultant since 2000

● Background in Telecom, Mobile services & sofware development

● Second FAST Global Services engineer

● Founder of Cominvent AS

● Lucid Imaginaton certfed instructor & partner

● FAST Certfed instructor

Logos represent projects I've been involved in, and ™ are © of respectve companies

Page 4: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Cominvent AS: Consultng Vendor independent search consultng

Page 5: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Cominvent AS: Training Certfed Solr Training Partner with Lucid Imaginaton Certfed FAST ESP Training Partner

Photo: fuidpowerzone.com

Page 6: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Solr training Oslo June 1-3

Page 7: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Assumptons

Decision to migrate to Solr is already done This is not a "sales talk" for any partcular technology

Basic knowledge of Solr

None or limited knowledge of FAST ESP

Migraton to plain Solr or LucidWorks(LucidWorks Enterprise editon not considered)

Page 8: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Introducton to...

...for Solr people

Page 9: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Connectors

Security

Page 11: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

FAST ESP architecture

Source: www.microsof.com

Page 12: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

FormatConversion

LanguageDetection Entities

Linguistic Normalization

OntologyCustomPlug-in

AlertSearch

Taxonomy Sentiment

PARIS (Reuters) - Venus Williams

raced into the second round of the

$11.25 million French Open

Monday, brushing aside Bianka

Lamade, 6-3, 6-3, in 65 minutes.

Very strong & scalable document processing framework

Page 13: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

FAST Document Processors (DP)

DPs transform documents prior to indexing

This is diferent from Solr feld centric analysis

Examples of stages: Encoding normalizaton, language identfcaton

Text extracton (HTML, PDF, MS Ofce, etc.)

Tokenizaton, lemmatzaton, entty extracton

DPs are chained in pipelines

ESP ships with lots useful DPs and pipelines

Writen in Python, very easy to script new ones

OntologyCustomPlug-in

Taxonomy Sentiment

Page 14: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Terminology

Lucene/Solr FASTReplica Search row

Shard Column

Facet Navigator

Spellcheck Did you mean

Update processor Document processor

Request Handler Query Transformer (QT)

Response Writer Result Processor(RP)/TWM

Page 15: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Terminology

Lucene/Solr FASTSchema Index profile

Index segment Index partition

Lucene IndexWriter/Rdr indexer/fsearch (RTS)

~Multi core ~Multi cluster

(Documents receiving same processing)

Collection

Page 16: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Important diferences

Lucene/Solr FASTMost features query-time Most features index-time

Field centric analysis Document centric analysis

One language per field Multi lingual fields

One Update handler per input type (XML, CSV)

Format conversion in document pipeline

Slim disk & memory footprint

Quite fat disk & memory footprint

One Java Web app 15-20 processes

Page 17: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Solr Architecture

Thanks to Christan Moen/ATILIKA for graphics

Page 19: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Steps of the migraton Review current features & architecture

Keep all features? Add new?

Install Solr and do a quick iteraton (1-2 days): Draf schema.xml & solrconfg.xml

Dump & index some real data

Play around with queries – Solritas is nice here

Design spec covering all migraton areas: Schema, Content, Feeding & Analysis

Frontends, Querying & API

Admin & Operatonal

Implement :)

Page 20: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Spreadsheet for planning the schema

Page 21: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Migratng index-profle -> Solr schema

ESP index profle -> Solr schema.xml FAST felds example:

Solr equivalent:

Example: A feld with "tokenize=auto" in FAST → type="text"

Create new <feldType>'s as needed

Page 22: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Product facets & generic felds With FAST you ofen use «generic1», «generic2» etc to

model product facets which may vary between product groups. Front ends need logic to convert.

Page 23: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Product facets & generic felds With Solr, using dynamic felds, each document can have

as many facets you like.

Makes it easy to e.g. Introduce a new «color» facet for cars or a «MegaPixels» facet for digital cameras

Page 24: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Composite felds -> DisMax ReqHandler FAST uses composite felds to search across multple

felds, with weightng defned in Rank Profles

FAST's composite felds & rank profles can be modelled as Solr «DisMax» queries

Set suitable defaults in solrconfg.xml using named requesthandler instances.

In case of many felds & performance issues, use <copyField> to group similarly ranked felds!

Freshness boost, GEO boost etc handled through Functon Queries

Page 25: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Composite felds -> DisMax ReqHandler Given a FAST composite feld / Rank Profle

Page 26: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Composite felds -> DisMax ReqHandler This Solr query will do the same, confgureable per query:

qt=dismax

q=oslo

qf=ttle^5.0 teaser^1.5 body^0.1

bf=recip(rord(last_modifed),1,1000,1000)

...DisjunctonMaxQuery((teaser:foo^1.5 | ttle:foo^5.0 | body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 | ttle:bar^5.0 | body:bar^0.1)~0.01)FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))...

...DisjunctonMaxQuery((teaser:foo^1.5 | ttle:foo^5.0 | body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 | ttle:bar^5.0 | body:bar^0.1)~0.01)FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))...

Page 27: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Statc document boosts

FAST uses the «hwboost» feld to add a statc Quality boost to each document.

In Solr, you have more fexibility: Add a boost to each document

<doc boost="10.0">

Add a boost to each feld<feld name="ttle" boost="10.0">

Include any numeric document feld in a BoostFuncton

bf=sum(sqrt(popularity)^100.0, statcboost^20.0)bf=sum(sqrt(popularity)^100.0, statcboost^20.0)

Page 28: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Navigator statstcs

FAST navigators provide statstcs metadata (min/max/avg/sum)

Soluton: Use the StatsComponent

Page 29: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Navigator auto-buckets

FAST numeric navigators give auto-bucketng based on equal-frequency, equal-width, manual

Soluton: Create a new feld which is pre-computed

Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"

Or use facet queries (expensive)

Or implement auto-bucketng and contribute the patch :-)

Page 30: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

XRANK

FAST has a feature to boost documents satsfying an "XRANK" sub-query with a certain statc boost

In Solr, you can solve most XRANK use cases using FunctonQueries

Page 31: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Scope search

FAST ofers a feld type which holds arbitrary XML

Search in XPath-style: xml:companies:company:and(revenue:>1000, employees:>=100)

Have not found similar feld type in Lucene.

Anyone?

Page 32: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Migratng Connectors

FAST's connectors are many and mature

For simple use cases, consider Solr's DIH: Supports DB, RSS, Web-services, Local flesystem

Additonally throgh Lucene Connectors Framework:

EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS

New connectors should be writen for LCF-and be submited back to the community :)

Page 33: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Migratng Web Crawler

FAST's crawler is mature, performing & scalable

Solr has no built-in web crawler

Prepare for a lot of extra work migratng crawler

Alternatves: The Apache Nutch crawler (steep learning curve)

Apache Droids

Heritx + Solr (example in Solr1.4 book)

OpenPipeline has a (very) simple crawler

Page 34: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Migratng document processing Solr lacks a sophistcated processing pipeline.

Alternatves:

Solr's UpdateProcessorChain for simple pipelines: Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)

OpenPipeline for more advanced requirements: Check out FindWise's talk

Integrated with Solr

LingPipe NamedEnttyExtractor plugin

Page 35: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Document processing examples

Binary documents with metadata Actual customer request: Enrich library records with PDF content

Use Open Pipeline with Apache Tika processor

Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)

Custom XML using FAST's XMLMapper DIH's built-in XPath support

XSLT to Solr input XML

Write an new XMLMapper Update Request Handler?

Page 36: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Mult lingual

FAST is state of the art on linguistcs

FAST is language aware, e.g. the ttle feld is "analyzed" depending on detected language

Solr is not language aware

Each feld type has one and only one language

Most common soluton: One feld type per language: text_no, text_en, text_de

Dynamic felds: <dynamicField name="*_en" type="text_en"..../>

Implement language awareness in applicaton layer (feeding + querying)

Page 37: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Mult lingual – advanced

FAST ships with Lemmatzaton for most languages

Solr ships with Stemming – has limitatons

Solutons for mult lingual needs: Kstem is tghter. Free with

License 3rd party linguistcs

Example: BasisTech Rosete Linguistc PlatormLemmatzaton, POS etc..

Page 38: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Mult lingual – very advanced

FAST allows lemmatzaton by index expansion

This can be useful if your frontend does not know what languages are being queried, as all the word infectons are stored in the index.

There is no soluton for this in Solr today,

Workaround: DisMax query spanning all languages:q=eurocon&qf=text_en^2.0 text_no text_de text_it

Downside: This gets ugly and slow with increasing number of languages

Page 39: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Migratng Front ends / Query

Using a search middleware with Solr support? Lucky you!

If not, consider introducing one now:

Using FAST Java/.NET APIs? Choose SolrJ or SolrNET/SolrSharp

Query language diferences. &fq= instead of flter()

Solr facets do not require session/state as FAST's

Page 40: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Result views

FAST uses "result-view" and "search profle" to specify what felds to return.

Migrate FAST's «views» into named RequestHandler confgs with all default presets

No need to defne felds to return up-front!, use f=a,b,c...

Page 41: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Operatons

Solr has no central admin-server (untl "SolrCloud")

For GUI installer, use

Multple cores – allows smooth schema upgrade etc.

No built-in query reportng, log analysis or monitoring.But have a look at:

Page 42: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Summary

Many migratons are (quite) straight-forward!

Warning fags Mult-lingual and advanced linguistcs

Heavy use of Document Processing, including Entty Extracton

Scope search

Other enterprise complexites (security, connectors etc)

Follow a structured process Quick prototyping

Design spec for each area

Don't forget to analyze logs and measure user satsfacton!

Page 43: Key topics when migrating from FAST to Solr, EuroCon 2010

05/21/10Apache Lucene EuroCon

Thank You

www.cominvent.com

www.twiter.com/cominvent

[email protected]

linkedin.com/in/janhoy

This presentaton licensed under CC-by-sa licenseYou must atribute Cominvent with name and link