Sören Schneider, Alkacon Software WORKSHOP TRACK Using the SOLR Collector 27.11.2014
Jul 13, 2015
Sören Schneider, Alkacon Software
WORKSHOP TRACK
Using the SOLR Collector
27.11.2014
1. Brief Introduction Into Solr
2. Common Mistakes Using OpenCms & Solr
3. Using the Solr Collector (DEMO)
4. Spellchecking in OpenCms Using Solr
Agenda
● Solr is a very versatile and powerfool search
engine that supports various features
● This functionality comes with the price of
increased complexity to handle Solr
● Many customizations available
● All fields composing a single document are typed
Brief Solr Introduction
● Data structures of Solr‘s documents are
defined the file schema.xml
● Performing changes on this file requires reindexing
● Dynamic Fields cope with that limitiation
● Can be used without being explicitely defined in
the schema using wildcards
Defining Solr‘s Data Structure
Solr: Indexing Content
a: date
b: text
c: string
Solr processing
(through
analyzers, filters
and tokenizers)
a: date
b: string
c: string
● „Direct“ usage of OpenCms & Solr requires a
basic understanding of Solr
● Use proper datatypes in respect of individual
usecase, gain knowledge of filters
● Know the query syntax (for appropriate datatypes)
● Most common mistakes of OpenCms users
result in insufficient knowledge of Solr basics
OpenCms & Solr
1. Using inproper types
● „text“ vs „string“
● Formulating correct queries
2. Issues regarding mapping OpenCms <->Solr
3. (Encoding Problems)
Common Mistakes Using Solr &
OpenCms
● String
● Stores its content as exact string
● No tokenization / processing is being performed
● Useful when searching for exact value
● Text
● Tokenization and processing is performed
● Useful when a part of the content is searched for
„text“ vs „string“
● OpenCms‘s copies the entire XML content into
a single(!) locale-aware Solr field of type „text“
for each locale
● Particular information of a resource is made
searchable in OpenCms using two approaches
● Automatic mapping of properties to Solr fields
● Manual definintion of mappings
Making Your Content Searchable
Indexing Content w/o
Searchsettings
Solr processing
(through analyzers,
filters and tokenizers)
x: text a: date
b: string
c: string
Indexing Content with
Searchsettings
a: date
b: text
c: string
Solr processing
(through analyzers,
filters and tokenizers)
a: date
b: string
c: string
● Mapping happens in the scheme of the
appropriate resource type
● Excerpt
Solr – OpenCms Interaction:
Mapping
<xsd:schema
…
<xsd:annotation
<xsd:appinfo
<searchsettings>
<searchsetting element= "City" searchcontent="true">
<solrfield targetfield= "city" sourcefield="_s"
</searchsetting> …
Resource type
element name
Element Mapping Attributes
Attribute Name Effect on the Solr Field
targetfield* The resulting name
locale Write content only for specific locale
sourcefield Defines the resulting type
copyfields Copies the value to a different field
default Sets a default value
boost Sets a boost for the field
● Users complain about problems regarding
certain Characters – mostly German Umlauts –
in Solr results
● In nearly all cases the sole problem lies within the
integration of Solr to the servlet cotainer which is
not happening in UTF-8
● Extra note for Tomcat users: Please check
whether you appended the required attributes
all appropriate „<Connector>“s ;-)
Using UTF-8 in Solr
● Live Demo
15
Live Demo
Demo
Demo Demo
Demo
デモ
WYSIWYG Spellchecker
● The Spellchecker has been realized using Solr
● Solr already provides a flexible component named
„SpellCheckComponent“
● This component supports inline spellchecking of
Solr queries
● Source for suggestions can be specified by Solr
fields or text files
WYSIWIG Spellchecker
● The „SpellCheckComponent“ is widely used to
implement the „Did you mean?“-feature known
by popular search engines
● The component is
● Reliable and mature
● Fast
● Plus, Solr is already available in OpenCms
Why using Solr as Spellchecker
● If both usecases use the same component,
how do the implementations actually differ?
● „Did you mean?“ builds source of suggested words
based on the entire data, the search runs on.
Usually only a single hit is returned.
● The WYSIWYG spellchecker builds ist source of
suggestions based on a data that solely contains
the dictionary for a single language
Differences Between Usecases in
Regards of Implementation
● Spellchecking has been realized using another Solr
core that resides in WEB-INF/spellcheck
● As the only purpose of this core is to contain spellcheck
information, the schema.xml file is as simple as it gets
● Why using another Solr core instead of the default core
that‘s used by OpenCms?
● Dictionaries are stored as one Solr index per
language
How to model this scenario using
Solr?
● Sadly, the spellchecking interfaces of tinyMCE
and Solr are incompatible
Problems regarding tinyMCE and
Solr
Solr
tinyMCE
Comparison Spellcheck Responses
{
"id":"c0",
"result":{„hsoue":[„hous
e„, „has“]}
}
"spellcheck":{ "suggestions":[
„hsoue",{"numFound":5,
"startOffset":0, "endOffset":4,
"origFreq":0,
"suggestion":[{"word":„house","freq":
53}, {"word":"has","freq":271},
…
]}, "correctlySpelled",false,
"collation","hsue„
]},
● A new component had to be realized in
OpenCms that basically
● Accepts spellcheck requests from tinyMCE
● Handles tinyMCE and Solr communication and
message conversion
● Checks and (re-)builds spellcheck indices
● The appropriate code is found in
org.opencms.search.solr.spellcheck
Glueing the Pieces together
● Dictionaries can be edited easily in OpenCms
● Those indices are automatically filled by flat text
files, one word per line
● Support for multiple languages
● To access the dicts, have a look at the directory
org.opencms.workplace.spellcheck/resources/
Spellchecker in OpenCms
● Adding a new language
1. Create new Solr field in schema.xml
2. Create new dictionary file inside VFS
3. Restart OpenCms
● Adding words to the custom dict
Extending the Spellchecker
● Any Questions?
26
Any Questions?
Fragen? Questions ?
Questiones?
¿Preguntas? 質問
Sören Schneider
Alkacon Software GmbH
http://www.alkacon.com
http://www.opencms.org
Thank you very much for your
attention! 27