OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Post on 14-May-2015

3294 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well. Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code. In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.

Transcript

Rüdiger Kurz, Alkacon Software

WORKSHOP TRACK

Using Apache Solr to

retrieve content

25.09.2012

2

Project Collaboration

1. What is Solr?

2. Benefits

3. Searching

4. Indexing

5. Configuration

3

Agenda

●Apache Solr is hopefully not able to answer this question!

●BUT it will return the results in less than a second

4

Retrieving data fast

● Solr is an enterprise search platform from the Apache Lucene project

● Solr is highly scalable, providing distributed search and index replication

● Solr powers the search and navigation features

● Major features include

● Powerful full-text search

● Hit highlighting

● Faceted search

● Rich document (e.g., Word, PDF) handling

5

What is Apache Solr?

● Faceted search is the dynamic clustering of items or search results into categories

● That let users drill into search results (or even skip searching entirely)

● Each facet displayed typically shows the number of hits that match that category

● Users can then “drill down” by applying specific constraints to the search results

● Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search

6

What is faceted search?

7

What is Faceted Search?

“Resource types” is a

facet, a way of

categorizing the results

containerpage,

v8flwoer, v8textblock,

… are constraints, or

facet values

The breadcrumb trail

shows what constraints

have already been

applied and allows their

removal

The facet count shows

how many results

match each value

The tag bar shows other

facet values of the found

document that can be applied

Regular search results

Benefits

8

● DBs are proprietary

● Require elaborate infrastructures

● SQL queries are hard to formulate

● SQL on DB is slower than search queries

● A lot SQL statements make DB to bottleneck

● Also lower traffic sites will slow to run when

executing too many statements on DB layer

Overall performance starts to degrade

9

Database as bottleneck

● OpenCms stores the content in a RDBMS

● To access values of an XML content you have to

perform the following steps:

10

Content retrieval so far

1. Read the resource

2. Read binary content

3. Un-marshal content

4. Access with getters

Resource (dates, refs, attr)

Content (blob)

Marshaled XML

Java Access Bean

● “Read” whole resource content by a single query

● Increase ease of data structure by storing

documents

● New flexibility by using power of Solr query syntax

● Best performance based on optimized index

● HTTP interface for external applications

● Secure, scalable and cost-effective access

● Reduced DB traffic and increased performance

11

The new way of content retrieval

OpenCms 8.5 Solr Integration

Searching

13

●Querying OpenCms content using

the power of Solr’s query syntax

1. Send a HTTP request handler

2. Use the new Solr Collector

3. Call the Java API search method

14

Search with Solr in OpenCms

● The REST-like interface of Solr makes you able

to access indexed documents over HTTP

without any knowledge about CMS specific

syntax

● A permission check is performed by OpenCms

making sure no secure documents will be returned

● Using Solr based UI frameworks like “Ajax Solr” on

your website without development costs

● Providing an open interface for external

applications e.g. mobile applications

15

OpenCms Solr handler

16

Examples: REST / JAVA / Collector

http://localhost:8080/opencms/opencms/handleSolrSelect

?fq=type:v8flower 1

<cms:contentload

collector="byQuery"

param="type:v8flower">

<cms:contentaccess var="content" />

${content.value.Title}

</cms:contentload>

2

CmsObject cms = getCmsObject();

String query = "fq=type:v8flower";

CmsSearchManager mananger = OpenCms.getSearchManager();

CmsSolrIndex index = manager.getIndexSolr("Solr Online");

CmsSolrResultList results = index.search(cms, query);

3

17

Live Demo

Demo

Demo Demo

Demo

デモ

Indexing

18

● Data indexed by default (hard coded)

● Field configuration (opencms-search.xml)

● XSD field mapping (Content definition)

● Implement a custom field configuration (Java)

19

Indexed data

● The Schema file contains all of the details about which fields your documents can contain

● OpenCms uses an adjusted version of the schema.xml that is contained within Apache Solr standard distribution

WEB-INF/solr/conf/schama.xml ● If you want to add a new custom field or

field type for documents you can modify this file

20

Solr schema

●Types are checked during the index

process

● It enables easy rage queries even for

dates, what is real facilitation making

dev-life easier

●Custom types can be added, e.g.

key/value tuple or some special JSON

fields

21

Advantages of field types

● id - Structure id used as unique identifier for an document (The structure id of the resource)

● path - Full root path (The root path of the resource e.g. /sites/default/flower_en/.content/article.html)

● path_hierarchy - The full path as (path tokenized field type: text_path)

● parent-folders - Parent folders (multi-valued field containing an entry for each parent path)

● type - Type name (the resource type name)

● res_locales - Existing locale nodes for XML content and all available locales in case of binary files

● created - The creation date (The date when the resource itself has being created)

● lastmodified - The date last modified (The last modification date of the resource itself)

● contentdate - The content date (The date when the resource's content has been modified)

● released - The release and expiration date of the resource

● content A general content field that holds all extracted resource data (all languages, type text_general)

● contentblob - The serialized extraction result toimprove the extraction performance while indexing

● category - All categories as general text

● category_exact - All categories as exact string for faceting reasons

● text_<locale> - Extracted textual content optimized for the language specific search

● timestamp - The time when the document was indexed last time

● *_prop - All properties of a resource as searchable and stored text (<Property_Definition_Name>_prop)

● *_exact - All properties of a resource as exact not stored string (<Property_Definition_Name>_exact)

22

Default indexed data

● Additional field mappings for XML contents can

now be configured directly within the XSD Schema

● Without modifying opencms-search.xml No

restart of the servlet container required

23

XSD field mapping

<searchsetting element=“DisplayDate” searchcontent=“false”>

<solrfield targetfield=“myDisplayDateField” sourcefield=“*_dt” />

</searchsetting>

<searchsetting element=“Teaser”>

<solrfield targetfield=“ateaser”>

<mapping type=“item” default=“Homepage n.a.”>Homepage</mapping>

<mapping type=“property-search”>search.special</mapping>

<mapping type=“dynamic” class=“my.DynamicMapping”>special</mapping>

</solrfield>

</searchsetting>

Configuration

24

● When installing OpenCms v8.5 Solr will be enabled by default while Solr will be disabled after updating a system to OpenCms 8.5

● To enable Solr in after updating you must create a Solr home directory in the WEB-INF folder of your OpenCms application

● Copy the solr/ folder from the OpenCms standard distribution as a starting point for your configuration

● All search configurations are done as usual in the opencms-search.xml below WEB-INF/config

● Adding the following lines will enable the Embedded Server

25

Enable Solr in OpenCms

<opencms><search>

<solr enabled="true"/> […]

</search></opencms>

● You can add a custom Solr index with the known OpenCms search configuration syntax

● NOTE: class attributes are needed for the index and its field configuration

26

Search index configuration

<index

class="org.opencms.search.solr.CmsSolrIndex">

<name>Solr Online</name>

<rebuild>auto</rebuild>

<project>Online</project>

<locale>all</locale>

<configuration>solr_fields</configuration>

<sources>

<source>solr_source</source>

</sources>

</index>

● For converting a field configuration by:

1. Copy a <filedconfiguration>-node

2. Change / set the class attribute

3. Optionally add a type attributes for fields

27

Create field configuration (1/3)

<fieldconfiguration

class="org.opencms.search.solr.CmsSolrFieldConfiguration">

<name>example</name>

<description>Converted Lucene Index</description>

<field name="meta" store="false" index="true" type="en">

<mapping type="property">Title</mapping>

<mapping type="property">Description</mapping>

</field>

</fields>

</fieldconfiguration>

● As value for the type attribute of a field

definition inside the opencms-system.xml

you can use names of any dynamic field defined in the schema.xml

● For example:

28

Create field configuration (2/3)

i - type=“int”

dt - type=“date”

txt - type=“text_general”

en - type=“text_en”

es - type=“text_es”

fr - type=“text_fr”

● As previously said the field names are defined in the schema.xml <solr_name> of Solr, now

we define additional fields inside the opencms-search.xml <opencms_name>

● How does that work?

29

Create field configuration (3/3)

String fieldName = <opencms_name>_txt;

if (existsInSolrSchema(fieldName)) {

fieldName = <opencms_name>;

} else if (isTypeAttributeSet()) {

fieldName = <opencms_name>_<type>;

}

30

Live Demo

Demo

Demo Demo

Demo

デモ

● Having Solr and VIE integrated into OpenCms

we are well prepared start using Apache

Stanbol

● Stanbol is a top level Apache project

● Stanbol guarantees a quality standard

● Stanbol opens the perspective of sustainability

● We are looking to integrate Stanbol into

OpenCms 9

31

Future steps with IKS and Stanbol

32

Live Demo

Demo

Demo Demo

Demo

デモ

● Permission checked search (secure)

● Solr Request handler (accessible)

● Solr Collector (integrated)

● Result highlighting (user-friendly)

● Configuration opportunities (flexible)

● Search field mapping (sensitive)

● Type based field schema (type-safe)

● Lucene conversion (compatible)

33

Integration Conclusion

Rüdiger Kurz

Alkacon Software GmbH

http://www.alkacon.com

http://www.opencms.org

http://www.iks-project.eu

http://stanbol.apache.org

Thank you very much for your

attention! 34

35

Any Questions?

Fragen? Questions?

Questiones?

¿Preguntas? 質問

top related