Schemaless Solr and the Solr Schema REST API
Post on 08-May-2015
2573 Views
Preview:
DESCRIPTION
Transcript
SCHEMALESS SOLR AND THE SOLR SCHEMA REST API Steve Rowe Senior Software Engineer, LucidWorks Twitter: @steven_a_rowe
• LucidWorks employee • Lucene/Solr committer since 2010 • JFlex committer since 2008 • Previously at the Center for Natural Language Processing
at Syracuse University’s iSchool (School of Information) • Twitter: @steven_a_rowe
Who am I?
• As of version 4.4, Solr can operate in schemaless mode:
– No need to pre-configure fields in the schema
– As documents are indexed, previously unknown fields are automatically added to the schema
– Field types are auto-detected from a limited set of basic types:
• Long, Double, Boolean, Date, Text (default)
• All are multi-valued – Works in standalone Solr and SolrCloud
Schemaless Solr
• Solr features used to implement schemaless mode:
– Managed schema • Required for runtime
schema modification – Field value class guessing
• Parsers attempt to detect the Java class of String-valued field content
– Automatic schema field addition
• Java class(es) mapped to schema field type
• “Schemaless” does not mean that there is no schema • Search applications need schemas to support non-trivial document models
– No schema needed when there is only one field, or only one field type, i.e. all fields share:
• Document & query processing, including analysis • Index features & format • Similarity implementation • (etc.)
– Otherwise, search apps need to manage per-field processing configuration (i.e. a schema) to consistently index documents and effectively serve queries
• So what does “schemaless” mean for Solr? – No up-front schema configuration required – Schema discovery: document structure is either not fixed or not fully known
The slide about the nature and utility of schemalessness
• Convention over configuration • Glob-like patterns match field names with field types
!
<dynamicField name="*_i" type="int" indexed="true” stored="true"/>!<fieldType name="int" class="solr.TrieIntField"! precisionStep="0" positionIncrementGap="0"/>!!
• Dynamic fields solve the problem of assigning field types to unknown fields by inferring a field’s type from its name
• By contrast, Solr’s schemaless mode infers an unknown field’s type from its value or values
• These two approaches are complementary • The Solr schemaless example defines a number of dynamic fields, including the
*_i ! int mapping above
Dynamic fields
From example/example-schemaless/solr/collection1/conf/schema.xml: !
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />! <field name="_version_" type="long" indexed="true" stored="true"/>! From example/exampledocs/books.csv:
id,cat,name,price,inStock,author,series_t,sequence_i,genre_s! 0441385532,book,Jhereg,7.95,false,Steven Brust,Vlad Taltos,1,fantasy! ...!!$ cd example && java -Dsolr.solr.home=example-schemaless/solr -jar start.jar!!
$ cd exampledocs && java -Dtype=text/csv -jar post.jar books.csv!!
SimplePostTool version 1.5!Posting files to base url http://localhost:8983/solr/update using content-type text/csv..!POSTing file books.csv!1 files indexed.!COMMITting Solr index changes to http://localhost:8983/solr/update..!Time spent: 0:00:00.147!
Schemaless mode example
$ curl http://localhost:8983/solr/schema/fields!!
{ "fields":[{ "name":"_version_", "type":"long", "indexed":true, "stored":true },! { "name":"author", "type":"text_general" },! { "name":"cat", "type":"text_general" },! { "name":"id", "type":"string", "multiValued":false, "indexed":true,! "required":true, "stored":true,! "uniqueKey":true },! { "name":"inStock", "type":"booleans" },! { "name":"name", "type":"text_general" },! { "name":"price", "type":"tdoubles" }]}!!!!!!
From example/example-schemaless/solr/collection1/conf/schema.xml: !
<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>! <fieldType name="tdoubles" class="solr.TrieDoubleField" precisionStep="8" ! positionIncrementGap="0" multiValued="true"/>!!!
Schemaless mode example
id! cat! name! price! inStock! author! series_t! sequence_i! genre_s!
0441385532! book! Jhereg! 7.95! false! Steven Brust!
Vlad Taltos!
1! fantasy!
• The schema resource is managed by Solr, rather than hand edited
• On first startup, Solr auto-converts schema.xml to managed-schema
• Managed schema format is currently XML, but may change in the future
• XML comments don’t survive the conversion.
• mutable=true enables runtime schema modification
– Automatic schema field addition – Schema REST API
Managed schema From example/example-schemaless/solr/collection1/conf/solrconfig.xml: ! <schemaFactory class="ManagedIndexSchemaFactory">! <bool name="mutable">true</bool>! <str name="managedSchemaResourceName">managed-schema</str>! </schemaFactory>!
conf/ before startup
currency.xml!elevate.xml!lang/!protwords.txt!schema.xml!solrconfig.xml!stopwords.txt!synonyms.txt!
conf/ after startup
currency.xml!elevate.xml!lang/!managed-schema!protwords.txt!schema.xml.bak!solrconfig.xml!stopwords.txt!synonyms.txt!
• Unknown fields’ String-typed values are speculatively parsed
– Cascading parsers attempt to recognize field values
– On failure, the next one is tried – First successful parse wins
• Reconfigurable – Integer parser could be swapped
in for the Long parser, etc. – Numeric parsers can take a locale
for java.text.NumberFormat!– Date parser, implemented using
Joda-Time, can be configured with other patterns, a locale, and/or a default time zone
Field value class guessing <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">! <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>! <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>! <processor class="solr.ParseLongFieldUpdateProcessorFactory"/>! <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>! <processor class="solr.ParseDateFieldUpdateProcessorFactory">! <arr name="format">! <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>! <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>! <str>yyyy-MM-dd'T'HH:mm:ssZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss</str>! <str>yyyy-MM-dd'T'HH:mmZ</str>! <str>yyyy-MM-dd'T'HH:mm</str>! <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>! <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>! <str>yyyy-MM-dd HH:mm:ss.SSS</str>! <str>yyyy-MM-dd HH:mm:ss,SSS</str>! <str>yyyy-MM-dd HH:mm:ssZ</str>! <str>yyyy-MM-dd HH:mm:ss</str>! <str>yyyy-MM-dd HH:mmZ</str>! <str>yyyy-MM-dd HH:mm</str>! <str>yyyy-MM-dd</str>! </arr>! </processor>! !
• Field value classes are mapped to field types
• First match wins • If none of the typeMapping-s
match, the default field type is assigned
• If a multi-valued field contains a mix of value classes, the first mapping that matches all values’ classes wins
• The new field is added to the schema with the mapped field type
• Reconfigurable
Automatic schema field addition
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">! <str name="defaultFieldType">text_general</str>! <lst name="typeMapping">! <str name="valueClass">java.lang.Boolean</str>! <str name="fieldType">booleans</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.util.Date</str>! <str name="fieldType">tdates</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.lang.Long</str>! <str name="valueClass">java.lang.Integer</str>! <str name="fieldType">tlongs</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.lang.Number</str>! <str name="fieldType">tdoubles</str>! </lst>! </processor>!
• Automatically adding new schema fields in production may not be a good idea – Unwanted fields, e.g. field name typos, won’t trigger an error
• First instance wins: field type detection can’t know about the full range of a field’s values
• Wasted space: e.g. Longs are always used, when Integers might suffice • Limited gamut of detectable field types • Single analysis specification for text fields • Single processing model for all fields
Schemaless mode limitations
Schema REST API
• Each element of the schema is individually readable via the Schema REST API • Output format can be JSON or XML (wt request param) • Read-only elements:
– The entire schema • In addition to JSON and XML output formats, output can also be in
schema.xml format (?wt=schema.xml) – All fields, or a specified set of them – All dynamic fields, or a specified set of them – All field types, or a specific one – All copy field directives – The schema name, version, uniqueKey, and default query operator – The global similarity
• Managed schema is not required to use the read-only schema REST API.
Schema REST API: read-only
$ SOLR=http://localhost:8983/solr/collection1!!$ curl $SOLR/schema/dynamicfields/*_i!!
{! "responseHeader":{! "status":0,! "QTime":1},! "dynamicField":{! "name":"*_i",! "type":"int",! "indexed":true,! "stored":true}}!
Schema REST API: read-only examples !!$ curl $SOLR/schema/uniquekey?wt=xml!!
<?xml version="1.0" encoding="UTF-8"?>!<response>!<lst name="responseHeader">! <int name="status">0</int>! <int name="QTime">1</int>!</lst>!<str name="uniqueKey">id</str>!</response>!
• Schema REST API URLs employ the downcased form of all schema elements, but the responses use the same casing as schema.xml.
• For full details on the Solr Schema REST API, see the Schema API section of the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Schema+API
• To enable schema modification via the schema REST API, the schema must be managed, and must be configured as mutable.
• Schema modifications possible as of Solr 4.4: – Fields may be added
• Copy field directives may optionally be added at the same time – Copy field directives may be added
• Works under both standalone Solr and SolrCloud – Under SolrCloud, conflicting simultaneous requests are detected using a form of
optimistic concurrency and automatically retried • Core/collection reload not required for schema modifications that are compatible with
previously indexed documents – Generally additions are not sources of schema incompatibility
• Schema incompatibility-inducing operations will require core/collection reload: – Modifying or removing (dynamic) fields or copy field directives – Modifying all other schema elements
Schema REST API: runtime schema modification
Schema REST API: add field example $ SOLR=http://localhost:8983/solr/collection1!!$ curl $SOLR/schema/fields/claimid -X PUT -H 'Content-type: application/json' --data-binary '!{ ! "type":"string",! "stored":true,! "copyFields": [ ! "claims", ! "all"! ]!}’!!
• The copyField destinations “claims” and “all” must already exist in the schema. • For full details on the Solr Schema REST API, see the Schema API section of the Solr
Reference Guide: https://cwiki.apache.org/confluence/display/solr/Schema+API
• https://issues.apache.org/jira/browse/SOLR-4898 is the umbrella JIRA issue under which further schema REST API work will be done, including:
– adding dynamic fields – adding field types – enabling wholesale replacement by PUTing a new schema. – modifying and removing fields, dynamic fields, field types, and copy field
directives – modifying all remaining aspects of the schema: Name, Version, Unique Key,
Global Similarity, and Default Query Operator
Schema REST API TODOs
• Add arbitrary metadata at the top level of the schema and at each leaf node • Allow read/write access to that metadata via the REST API. • Uses cases:
– Round-trippable documentation • Conversion to managed schema format drops all comments
– Documentable tags – When modifying the schema via REST API, a "last-modified" annotation could
be automatically added. – User-level arbitrary key/value metadata
• W3C XML Schema has a similar facility: http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-annotation
Proposal: Schema Annotations
<schema name="example" version="1.5">! <annotation>! <description element="tag" ! content="plain-numeric-field-types">! Plain numeric field types store and index the! text value verbatim.! </description>! <documentation element="copyField">! copyField commands copy one field to another at! the time a document is added to the index. It's! used either to index the same field differently,! or to add multiple fields to the same field for! easier/faster searching.! </documentation>! <last-modified>2014-03-08T12:14:02Z</last-modified>! …! </annotation>!…!
Schema Annotation example <fieldType name="pint" class="solr.IntField">! <annotation>! <tag>plain-numeric-field-types</tag>! </annotation>! </fieldType>! <fieldType name="plong" class="solr.LongField">! <annotation>! <tag>plain-numeric-field-types</tag>! </annotation>! </fieldType>! …! <copyField source="cat" dest="text">! <annotation>! <todo>Copy to the catchall field?</todo>! </annotation>! </copyField>! …! <field name="text" type="text_general">! <annotation>! <description>catchall field</description>! <visibility>public</visibility>! </annotation>! </field>!
• Schemaless Solr mode enables quick prototyping with minimal setup
• Schema REST API provides programmatic read/write access to Solr’s schema • More elements writeable soon
• Schema annotations would enable round-trippable documentation, tagging, and arbitrary user-provided metadata
Summary
top related