Top Banner
Retrieving Information from Solr JOSA Data Science Bootcamp
50

Retrieving Information From Solr

Feb 14, 2017

Download

Technology

Ramzi Alqrainy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Retrieving Information From Solr

Retrieving Information from SolrJOSA Data Science Bootcamp

Page 2: Retrieving Information From Solr

● Head of Technology @ OpenSooq.com

● Technical Reviewer for “Scaling Apache Solr” and “Apache Solr Search Patterns” (Books)

● Contributor in Apache Solr● Built 10 search engines in the

last 2 years

Ramzi Alqrainy

Page 3: Retrieving Information From Solr

Topics to be covered ● Exploring Solr’s Query Form● Basic Queries and Parameters● Matching Multiple Terms● Fuzzy Matching● Range Searches● Sorting● Pseudo Fields● Geospatial Searches● Filter Queries● Faceting and Stats● Tuning Relevance

Page 4: Retrieving Information From Solr

Detailed Architectural Diagram

Page 5: Retrieving Information From Solr

Basic Queries and Parameters

Page 6: Retrieving Information From Solr

Exploring Solr’s Query Form

Page 7: Retrieving Information From Solr

Basic Queries and Parameters

Page 8: Retrieving Information From Solr

Matching Multiple Terms

Page 9: Retrieving Information From Solr

Boolean Queries

● Search for two different terms, new and house,requiring both to match

● Search for two different terms, new and house, requiring only one to match

● Default operator is OR, can be changed using the q.op query parameter.

Page 10: Retrieving Information From Solr

Negation

● Exclude documents containing specific terms

Page 11: Retrieving Information From Solr

Inverted Index—Revisited

● All terms in the index map to 1 or more documents.● Terms in inverted index are stored in ascending

lexicographical order● When searching for multiple terms/ expressions, Solr (and

Lucene) returns multiple document result sets corresponding to the various terms in the query and then does the specified binary operations on these result sets in order to generate the final result set.

● Scoring is performed on the result set o generate final result

Page 12: Retrieving Information From Solr

Grouped Expressions

● Represent arbitrarily complex queries

Page 13: Retrieving Information From Solr

Exact Phrase Queries

● Search for exact phrase “new house” ● Can Combine with Boolean Queries

Page 14: Retrieving Information From Solr

Proximity Searches● Represent arbitrarily complex queries● Solr/Lucene not only stores the documents that contain the

terms, but also their positions within a document (term positions), which is used to provide phrase and proximity search functionality

● The number of the “~” is called a slop factor and has a hard limit of 2, above which the number of permutations get too large to provide results within a reasonable time

Page 15: Retrieving Information From Solr

Fuzzy Matching

Page 16: Retrieving Information From Solr

Fuzzy Edit-Distance Searching

● Flexibility to handle misspellings and different spellings of a word

● Character variations based on Damerau-Levenshtein distances

● Accounts for 80% of human misspellings

Page 17: Retrieving Information From Solr

Wildcard Matching

● Robust functionality, but can be expensive if not properly used.

○ First all terms that match parts of the term before wildcard expression are extracted

○ Then all those terms are inspected to see if they match the entire wildcard expression

○ Expensive if your expression matches a large number of terms (for example the query e*)

Page 18: Retrieving Information From Solr

Range Searches

Page 19: Retrieving Information From Solr

Query on a Range

● Solr Date Time uses a format that is a restricted form of the canonical representation of dateTime in the XML Schema specification (inspired by ISO 8601). All times are assumed to be UTC (no timezone specification)

● Based on a lexicographically sorted order for the field being queried● Solr has Trie field types (tint, tdate, etc.) that should be used when you are

doing a large number of range queries● Various field types will be covered later in the course

Page 20: Retrieving Information From Solr

Solr Date Syntax

● Uses UTC and Restricted DateTime format● Allows rounding down by YEAR, MONTH, WEEK, DAY,

MINUTE, SECOND● NOW represents current time and using DateMath, we can

specify yesterday, tomorrow, last year, etc.

Page 21: Retrieving Information From Solr

Sorting

Page 22: Retrieving Information From Solr

Sorting

● Sort by score● Values of Fields ● Ascending or Descending ● Multiple Fields

Page 23: Retrieving Information From Solr

Pseudo Fields

● Dynamically added at query time and calculated from fields in the schema using in- built functions

● Through functions, you can manipulate the values of any field before it is returned

● Can also be used to modify the order of documents by sorting on the pseudo field

Page 24: Retrieving Information From Solr

Geospatial Searches

Page 25: Retrieving Information From Solr

Geospatial Searches● Solr provides location-based search● Define a “location” field that contains latitude and longitude● You can use a Query parser called “geofilt” to search on this

field, specifying the point and radius around it● Another query parser bbox uses a square around the point to

do faster but approximate calculations● Other types of searches (grids, polygons, etc. are possible

and covered in advanced course

Page 26: Retrieving Information From Solr

Returning Calculated Distances

● You can use a pseudo field (a field that is calculated at query time) to achieve this

Page 27: Retrieving Information From Solr

Filter Queries

Page 28: Retrieving Information From Solr

The fq and q Parameters● Indistinguishable at first glance: same query parameters passed to either

parameters will return same documents.● But,

○ fq serves a single purpose, to limit what is returned○ q limits what is returned AND supplies the relevancy algorithm with a set

of terms used for scoring● fq results are cached and can be reused between searches● Using fq we can avoid unnecessary relevancy calculations● You can use multiple fq’s in a request (each individually cached), but only one q

parameter

Page 29: Retrieving Information From Solr

Faceting and Stats

Page 30: Retrieving Information From Solr

Faceted Search● High-level breakdown of search

results based on one or more aspects (facets) of their documents

● Allows users to filter by (drill down into) specific components

● Can facet on values of fields, or facet by queries

Page 31: Retrieving Information From Solr

Types of Facet

● Field Facets

● Range Facets

● Pivot Facets

Page 32: Retrieving Information From Solr

Field Faceting

● Request back the unique values found in a particular field

● Most commonly used● Works for single- and multi-valued

fields● Values are based on the indexed

values of the field● Common practice is to facet on a

String field and search on a text field (to be discussed later). So, some schema preparation is required for faceting

Page 33: Retrieving Information From Solr

Range Facet

● Divide a range into equal size buckets

Page 34: Retrieving Information From Solr

Range Faceting

Page 35: Retrieving Information From Solr

Date Range Facets

● Recall Solr Date Syntax covered earlier in class● Uses UTC and Restricted DateTime format● Allows rounding down by YEAR, MONTH, WEEK, DAY,

MINUTE, SECOND● NOW represents current time and using DateMath, we can

specify yesterday, tomorrow, last year, etc.

Page 36: Retrieving Information From Solr

Stats and Facets

● Can get aggregations on various fields● From Solr 5.x onwards, stats on pivot facets is also available● See https://lucidworks.com/blog/you-got-stats-in-my-facets/ for

a great explanation of faceting

Page 37: Retrieving Information From Solr

Pivot Facets

● Functions like pivot tables in spreadsheet apps● Aggregate calculations that pivot on values from multiple fields● Example: give me a count of 3,4 and 5 star hotels in the top

three cities● Solr 5.x also allows you to stats calculations on pivots

Page 38: Retrieving Information From Solr

Facet by Query

● Sometimes, you need unequal ranges ● You can use the facet.query parameter ● Provides counts for subqueries

Page 39: Retrieving Information From Solr

Tuning Relevance

Page 40: Retrieving Information From Solr

Precision and recall

Precision and recall

Are the top results we show to users relevant?

Recall

Of the full set of documents found, have we found all of the relevant content in the index?

Page 41: Retrieving Information From Solr

RelevancyOur goal is to give users relevant results Relevance is a soft or fuzzy thing

● Depends upon the judgment of users

Scoring is our attempt to predict relevance

Similarity classes hold the implementations

● DefaultSimilarity ( TF-IDF )● BM25Similarity● DFRSimilarity● IBSimilarity● LMDirichletSimilarity● LMJelinekMercerSimilarity

Page 42: Retrieving Information From Solr

Lucene Scoring

Similarity scoring formula

• Used to rank results by measuring the similarity between a query and the documents that match the query

Page 43: Retrieving Information From Solr

Domain knowledgeExamples

● Cheaper● Newer or more recent● More popular or higher user clicks Higher average user ratings

Interesting combinations

● Value = average user ratings ÷ price ● Staying power = recent popularity ÷ age

Page 44: Retrieving Information From Solr

Boosting and biasing

Lucene uses a standardized scoring approach

Lucene does not know:

● Your data● Your users● Their queries Their preferences

Page 45: Retrieving Information From Solr

Domain knowledge

What do you know about your data?

● Any specific rules about your data that wouldn't be suitable in a generic IR scoring algorithm

● In many data domains, there are fundamental numeric properties that make some objects generally "better" than others

Page 46: Retrieving Information From Solr

Domain knowledgeMore subtle examples

● Novelty factor○ Quantity of user ratings × stdDev of ratings Profit margin

● Profit margin○ Retail price ‒ factory cost Scarcity

● Scarcity○ Quantity remaining

● Popularity by association or categorization○ Sweaters sell better then swimsuits in November

● Manual ranking○ New York Times bestseller list

Page 47: Retrieving Information From Solr

Request parameters

We are going to make substantial use of request parameters, so let's recap:

Page 48: Retrieving Information From Solr

How can you improve search results?

Using a sledge hammer

● Ignore score, sort on X

● Filter by X, retry if 0 results

Page 49: Retrieving Information From Solr

How can you improve search results?

● Boost functions and queries● Apply domain knowledge based on numeric properties by

multiplying functions directly into the score

Page 50: Retrieving Information From Solr

Retrieving Information from SolrJOSA Data Science Bootcamp