05/21/10 Apache Lucene EuroCon Key topics when Migratng from FAST to Solr By Jan Høydahl cominvent as
May 08, 2015
05/21/10Apache Lucene EuroCon
Key topics when
Migratng from FAST to Solr
By Jan Høydahl
cominvent as
05/21/10Apache Lucene EuroCon
Agenda
About Cominvent & Jan Høydahl
Quick overview of FAST ESP
The migraton step by step
Pain points
Q&A
05/21/10Apache Lucene EuroCon
Jan Høydahl: BIO
● Enterprise search consultant since 2000
● Background in Telecom, Mobile services & sofware development
● Second FAST Global Services engineer
● Founder of Cominvent AS
● Lucid Imaginaton certfed instructor & partner
● FAST Certfed instructor
Logos represent projects I've been involved in, and ™ are © of respectve companies
05/21/10Apache Lucene EuroCon
Cominvent AS: Consultng Vendor independent search consultng
05/21/10Apache Lucene EuroCon
Cominvent AS: Training Certfed Solr Training Partner with Lucid Imaginaton Certfed FAST ESP Training Partner
Photo: fuidpowerzone.com
05/21/10Apache Lucene EuroCon
Solr training Oslo June 1-3
05/21/10Apache Lucene EuroCon
Assumptons
Decision to migrate to Solr is already done This is not a "sales talk" for any partcular technology
Basic knowledge of Solr
None or limited knowledge of FAST ESP
Migraton to plain Solr or LucidWorks(LucidWorks Enterprise editon not considered)
05/21/10Apache Lucene EuroCon
Introducton to...
...for Solr people
05/21/10Apache Lucene EuroCon
Connectors
Security
05/21/10Apache Lucene EuroCon
05/21/10Apache Lucene EuroCon
FAST ESP architecture
Source: www.microsof.com
05/21/10Apache Lucene EuroCon
FormatConversion
LanguageDetection Entities
Linguistic Normalization
OntologyCustomPlug-in
AlertSearch
Taxonomy Sentiment
PARIS (Reuters) - Venus Williams
raced into the second round of the
$11.25 million French Open
Monday, brushing aside Bianka
Lamade, 6-3, 6-3, in 65 minutes.
Very strong & scalable document processing framework
05/21/10Apache Lucene EuroCon
FAST Document Processors (DP)
DPs transform documents prior to indexing
This is diferent from Solr feld centric analysis
Examples of stages: Encoding normalizaton, language identfcaton
Text extracton (HTML, PDF, MS Ofce, etc.)
Tokenizaton, lemmatzaton, entty extracton
DPs are chained in pipelines
ESP ships with lots useful DPs and pipelines
Writen in Python, very easy to script new ones
OntologyCustomPlug-in
Taxonomy Sentiment
05/21/10Apache Lucene EuroCon
Terminology
Lucene/Solr FASTReplica Search row
Shard Column
Facet Navigator
Spellcheck Did you mean
Update processor Document processor
Request Handler Query Transformer (QT)
Response Writer Result Processor(RP)/TWM
05/21/10Apache Lucene EuroCon
Terminology
Lucene/Solr FASTSchema Index profile
Index segment Index partition
Lucene IndexWriter/Rdr indexer/fsearch (RTS)
~Multi core ~Multi cluster
(Documents receiving same processing)
Collection
05/21/10Apache Lucene EuroCon
Important diferences
Lucene/Solr FASTMost features query-time Most features index-time
Field centric analysis Document centric analysis
One language per field Multi lingual fields
One Update handler per input type (XML, CSV)
Format conversion in document pipeline
Slim disk & memory footprint
Quite fat disk & memory footprint
One Java Web app 15-20 processes
05/21/10Apache Lucene EuroCon
Solr Architecture
Thanks to Christan Moen/ATILIKA for graphics
05/21/10Apache Lucene EuroCon
The migraton...
05/21/10Apache Lucene EuroCon
Steps of the migraton Review current features & architecture
Keep all features? Add new?
Install Solr and do a quick iteraton (1-2 days): Draf schema.xml & solrconfg.xml
Dump & index some real data
Play around with queries – Solritas is nice here
Design spec covering all migraton areas: Schema, Content, Feeding & Analysis
Frontends, Querying & API
Admin & Operatonal
Implement :)
05/21/10Apache Lucene EuroCon
Spreadsheet for planning the schema
05/21/10Apache Lucene EuroCon
Migratng index-profle -> Solr schema
ESP index profle -> Solr schema.xml FAST felds example:
Solr equivalent:
Example: A feld with "tokenize=auto" in FAST → type="text"
Create new <feldType>'s as needed
05/21/10Apache Lucene EuroCon
Product facets & generic felds With FAST you ofen use «generic1», «generic2» etc to
model product facets which may vary between product groups. Front ends need logic to convert.
05/21/10Apache Lucene EuroCon
Product facets & generic felds With Solr, using dynamic felds, each document can have
as many facets you like.
Makes it easy to e.g. Introduce a new «color» facet for cars or a «MegaPixels» facet for digital cameras
05/21/10Apache Lucene EuroCon
Composite felds -> DisMax ReqHandler FAST uses composite felds to search across multple
felds, with weightng defned in Rank Profles
FAST's composite felds & rank profles can be modelled as Solr «DisMax» queries
Set suitable defaults in solrconfg.xml using named requesthandler instances.
In case of many felds & performance issues, use <copyField> to group similarly ranked felds!
Freshness boost, GEO boost etc handled through Functon Queries
05/21/10Apache Lucene EuroCon
Composite felds -> DisMax ReqHandler Given a FAST composite feld / Rank Profle
05/21/10Apache Lucene EuroCon
Composite felds -> DisMax ReqHandler This Solr query will do the same, confgureable per query:
qt=dismax
q=oslo
qf=ttle^5.0 teaser^1.5 body^0.1
bf=recip(rord(last_modifed),1,1000,1000)
...DisjunctonMaxQuery((teaser:foo^1.5 | ttle:foo^5.0 | body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 | ttle:bar^5.0 | body:bar^0.1)~0.01)FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))...
...DisjunctonMaxQuery((teaser:foo^1.5 | ttle:foo^5.0 | body:foo^0.1)~0.01) DisjunctonMaxQuery((teaser:bar^1.5 | ttle:bar^5.0 | body:bar^0.1)~0.01)FunctonQuery(1000.0/(1.0*foat(top(rord(last_modifed)))...
05/21/10Apache Lucene EuroCon
Statc document boosts
FAST uses the «hwboost» feld to add a statc Quality boost to each document.
In Solr, you have more fexibility: Add a boost to each document
<doc boost="10.0">
Add a boost to each feld<feld name="ttle" boost="10.0">
Include any numeric document feld in a BoostFuncton
bf=sum(sqrt(popularity)^100.0, statcboost^20.0)bf=sum(sqrt(popularity)^100.0, statcboost^20.0)
05/21/10Apache Lucene EuroCon
Navigator statstcs
FAST navigators provide statstcs metadata (min/max/avg/sum)
Soluton: Use the StatsComponent
05/21/10Apache Lucene EuroCon
Navigator auto-buckets
FAST numeric navigators give auto-bucketng based on equal-frequency, equal-width, manual
Soluton: Create a new feld which is pre-computed
Example: Document A has price=200.000, add pricerange="150.000 – 1.299.999"
Or use facet queries (expensive)
Or implement auto-bucketng and contribute the patch :-)
05/21/10Apache Lucene EuroCon
XRANK
FAST has a feature to boost documents satsfying an "XRANK" sub-query with a certain statc boost
In Solr, you can solve most XRANK use cases using FunctonQueries
05/21/10Apache Lucene EuroCon
Scope search
FAST ofers a feld type which holds arbitrary XML
Search in XPath-style: xml:companies:company:and(revenue:>1000, employees:>=100)
Have not found similar feld type in Lucene.
Anyone?
05/21/10Apache Lucene EuroCon
Migratng Connectors
FAST's connectors are many and mature
For simple use cases, consider Solr's DIH: Supports DB, RSS, Web-services, Local flesystem
Additonally throgh Lucene Connectors Framework:
EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS
New connectors should be writen for LCF-and be submited back to the community :)
05/21/10Apache Lucene EuroCon
Migratng Web Crawler
FAST's crawler is mature, performing & scalable
Solr has no built-in web crawler
Prepare for a lot of extra work migratng crawler
Alternatves: The Apache Nutch crawler (steep learning curve)
Apache Droids
Heritx + Solr (example in Solr1.4 book)
OpenPipeline has a (very) simple crawler
05/21/10Apache Lucene EuroCon
Migratng document processing Solr lacks a sophistcated processing pipeline.
Alternatves:
Solr's UpdateProcessorChain for simple pipelines: Write a Solr UpdateProcessor (in Java, Jython etc, see SOLR-1725)
OpenPipeline for more advanced requirements: Check out FindWise's talk
Integrated with Solr
LingPipe NamedEnttyExtractor plugin
05/21/10Apache Lucene EuroCon
Document processing examples
Binary documents with metadata Actual customer request: Enrich library records with PDF content
Use Open Pipeline with Apache Tika processor
Implmenent Tika as an UpdateRequestProcessor (SOLR-1763)
Custom XML using FAST's XMLMapper DIH's built-in XPath support
XSLT to Solr input XML
Write an new XMLMapper Update Request Handler?
05/21/10Apache Lucene EuroCon
Mult lingual
FAST is state of the art on linguistcs
FAST is language aware, e.g. the ttle feld is "analyzed" depending on detected language
Solr is not language aware
Each feld type has one and only one language
Most common soluton: One feld type per language: text_no, text_en, text_de
Dynamic felds: <dynamicField name="*_en" type="text_en"..../>
Implement language awareness in applicaton layer (feeding + querying)
05/21/10Apache Lucene EuroCon
Mult lingual – advanced
FAST ships with Lemmatzaton for most languages
Solr ships with Stemming – has limitatons
Solutons for mult lingual needs: Kstem is tghter. Free with
License 3rd party linguistcs
Example: BasisTech Rosete Linguistc PlatormLemmatzaton, POS etc..
05/21/10Apache Lucene EuroCon
Mult lingual – very advanced
FAST allows lemmatzaton by index expansion
This can be useful if your frontend does not know what languages are being queried, as all the word infectons are stored in the index.
There is no soluton for this in Solr today,
Workaround: DisMax query spanning all languages:q=eurocon&qf=text_en^2.0 text_no text_de text_it
Downside: This gets ugly and slow with increasing number of languages
05/21/10Apache Lucene EuroCon
Migratng Front ends / Query
Using a search middleware with Solr support? Lucky you!
If not, consider introducing one now:
Using FAST Java/.NET APIs? Choose SolrJ or SolrNET/SolrSharp
Query language diferences. &fq= instead of flter()
Solr facets do not require session/state as FAST's
05/21/10Apache Lucene EuroCon
Result views
FAST uses "result-view" and "search profle" to specify what felds to return.
Migrate FAST's «views» into named RequestHandler confgs with all default presets
No need to defne felds to return up-front!, use f=a,b,c...
05/21/10Apache Lucene EuroCon
Operatons
Solr has no central admin-server (untl "SolrCloud")
For GUI installer, use
Multple cores – allows smooth schema upgrade etc.
No built-in query reportng, log analysis or monitoring.But have a look at:
05/21/10Apache Lucene EuroCon
Summary
Many migratons are (quite) straight-forward!
Warning fags Mult-lingual and advanced linguistcs
Heavy use of Document Processing, including Entty Extracton
Scope search
Other enterprise complexites (security, connectors etc)
Follow a structured process Quick prototyping
Design spec for each area
Don't forget to analyze logs and measure user satsfacton!
05/21/10Apache Lucene EuroCon
Thank You
www.cominvent.com
www.twiter.com/cominvent
linkedin.com/in/janhoy
This presentaton licensed under CC-by-sa licenseYou must atribute Cominvent with name and link