LUCENE / SOLR 4 SPATIAL DEEP DIVE David Smiley Software Systems Engineer, Lead
May 25, 2015
LUCENE / SOLR 4 SPATIAL DEEP DIVE
David Smiley Software Systems Engineer, Lead
© 2013 The MITRE Corporation. All rights reserved.
LUCENE / SOLR 4 SPATIAL
DEEP-DIVE
2013 Lucene Revolution
Presented by David Smiley, MITRE
About David Smiley
• Working at MITRE, for 13 years
• web development, Java, search
• 3 Solr apps, 1 Endeca
• Published 1st book on Solr; then 2nd edition (2009, 2011)
• Apache Lucene / Solr committer/PMC member (2012)
• Specializing on spatial
• Presented at Lucene Revolution (2010) & Basis O.S.
Search Conference (2011, 2012)
• Taught Solr classes at MITRE (2010, 2011, 2012)
• Solr search consultant within MITRE and its sponsors,
and privately
3
Agenda
• Background, overview
• Spatial4j
• Lucene spatial
• PrefixTree / Trie / Grid
• Solr spatial
• Demo
• Interesting use-cases
BACKGROUND &
OVERVIEW
What is Spatial Search?
Popular features:
• Spatial filter query
• Spatial distance sorting
• Spatial distance relevancy (i.e. spatial query score)
NOT “geocoding” – resolve “Boston” to its latitude and longitude
Typical use-case:
1. Index a location for each Lucene document given a
latitude & longitude
2. Then search for matching documents by a circle (point-
radius) or bounding box
3. Then sort results by distance
History of Spatial for Lucene & Solr
• 2007: Local-Lucene
• by Patric O’Leary (AOL)
• 2009-09: LL -> Lucene spatial contrib in Lucene 2.9.0
• Local-Lucene graduates to an official Lucene contrib module
• 2009-12: Spatial Search Plugin (SSP) for Solr
• by Chris Male (JTeam -> Orange11, ElasticSearch)
• 2010-10: SOLR-2155 a geohash prefix tree filter
• by David Smiley (MITRE)
• 2011-01: Lucene Spatial Playground (LSP)
• by Ryan McKinley (Voyager GIS), David, and Chris
• 2011-03: Solr 3.1 new spatial features
• by Grant Ingersoll and Yonik Seeley (LucidWorks)
• 2012-03: LSP -> Lucene 4 spatial module + Spatial4j + SSP
• replaces former Lucene spatial contrib module
Lucene Spatial Committers
• David Smiley
• Works for MITRE
• Boston area
• Ryan McKinley
• Works for Voyager GIS
• Silicon Valley
• Chris Male,
• Formerly at Elastic Search
• New Zealand
Spatial decomposed
• Spatial4j
• Shapes, WKT, Distance calculations, JTS adapter
• Lucene spatial
• Strategies: PrefixTree (TermQuery & Recursive impl.), BBox,
PointVector
• Solr adapters
• Misc: Spatial Solr Sandbox
• LSE
• JtsGeoStrategy
• Spatial-Demo (web app)
Lines of Code for Spatial Components
Spatial4j 43%
Lucene spatial 35%
Solr adapters 6%
Misc 16%
Total: 4,781 Non-Comment Source Statements (without javadocs or tests)
as of 2012-09
CarrotSearch Labs’ RandomizedTesting
• http://labs.carrotsearch.com/randomizedtesting.html • Provides plumbing for repeatable randomized JUnit tests
• All the spatial test code uses it extensively
Randomized testing more generally is a certain philosophy / approach on how to test
• A typical hard-coded test will only catch some regressions
• A randomized test will catch just about anything eventually, especially nasty edge cases
• Although it’s hard to read / write / maintain these tests
• Randomized testing helped find bugs related to… • Computing the bounding box of a circle
• Computing the relationship of a circle to a rectangle that has all 4 of its corners inside it
SPATIAL4J It’s all about the shapes
Spatial4j: It’s all about the shapes
https://github.com/spatial4j/spatial4j (spatial4j.com redirect)
• Shapes
• A “Shape” abstraction with multiple implementations
• Geodetic (sphere) & Cartesian/2D implementations
• Computes intersection relationship with other shapes
• Also…
• Distance and area math utilities, Geohash utilities
• Parsing Well Known Text (WKT) formatted shapes
• ASL licensed project independent of Apache on GitHub
• Requires JTS (LGPL licensed) for polygons & WKT*
• JTS is “JTS Topology Suite”
• * WKT parsing soon to be implemented directly by Spatial4j
• Ported to .NET as Spatial4n and used by RavenDB
• by Itamar Syn-Herskhko
The case for Spatial4j’s existence
• Just for shapes? How much code could there be?
• You’d be surprised. Determining the relationship between a lat-lon
rectangle and a geodetic circle (Within, Contains, Intersects, Disjoint)
is non-trivial, and that’s just one shape.
• Lots of non-trivial test code go with it.
• Why isn’t it a part of Lucene spatial?
• Parts of Spatial4j depend on JTS, an LGPL licensed library. The
Lucene PMC voted not to introduce this compile-time dependency.
• Spatial4j is independently useful.
• Is this duplication of other open-source that could be used?
• Spatial4j needs to be ASL licensed to be a dependency of Lucene.
• Still… I haven’t found existing code that does what Spatial4j does.
• Can’t only the JTS dependent parts be external to Lucene?
The Shape interface
(may become an abstract class in the next version)
• interface Shape {
• Point getCenter();
• Rectangle getBoundingBox();
• boolean hasArea();
• double getArea();
• SpatialRelation relate(Shape other);
• Must support Point & Rectangle
• enum SpatialRelation
• DISJOINT, INTERSECTS, WITHIN, CONTAINS
• Note: simpler set than the “DE-9IM” spatial standard
• no “equals” or “touches”
Spatial4j shapes
Ca
rte
sia
n
Ca
rte
sia
n
wit
h
da
teli
ne
wra
p
Ge
od
eti
c
Point Y Y Y
Line & LineString (w/ buffer)
Y N N
Rectangle Y Y Y
Circle Y N Y
ShapeCollection Y Y Y
JTS Geometry
(incl. polygons) Y Y N
• Cartesian (AKA
Euclidean): a flat plane
• Dateline wrap assumes
the plane circles back on
itself
• Geodetic: a spherical
mathematical model
Well Known Text (WKT)
(see Wikipedia)
• A popular standard for representing shapes as strings
• Requires JTS’s WKT Parser but Spatial4j has its own in-progress
• Extensions are TBD for Rectangles and Circles
• Limited support for EMPTY and “Z” and “M” dimensions (future)
• Some Examples: • POINT (3, -2)
• LINESTRING(30 10, 10 30, …
• POLYGON ((30 10, 10 20, 20 40, 40 40, 30 10))
• MULTIPOLYGON (((…
• …
• Deprecated (may move to Solr):
• -90, -180
• -180 -90 180 90
• CIRCLE(4.56,1.23 d=0.071)
• TBD / Pending: • ENVELOPE(-180,180,90,-90)
• BOX2D(-180 -90, 180 90)
Spatial4j code sample
SpatialContext ctx = SpatialContext.GEO;
Rectangle r = ctx.makeRectangle(-71, -70, 42, 43);
Circle c = ctx.makeCircle(-72, 42, 1);
SpatialRelation rel = r.relate(c);
System.out.println(rel);
rel.intersects();//boolean
ctx = JtsSpatialContext.GEO;
Shape s = ctx.readShape(“POLYGON ((30 10, 10 20, 20 40, 40
40, 30 10))”);
double distanceDegrees = ctx.getDistCalc().distance(
ctx.makePoint(2, 2), ctx.makePoint(3, 3) );
Distances (including circle
radius) are in “Degrees”, not
radians or KM
Spatial4j Future
• Built-in WKT support (no JTS dependency)
• Extensible to user-defined shapes
• API improvements
• Shape argument validation via WKT but not via ctx.makeShape(…)
• ShapeCollection visitor design pattern
• Refactor to remove need for isGeo()
• LineString dateline & geodetic support
• Projection / Datum support
LUCENE SPATIAL Spatial index information retrieval
Lucene 4 Spatial Module
• There isn’t one best way to implement spatial indexing for all use-cases • Index just points, or other shapes too? Which?
• Multiple shapes per field?
• Query by Intersection? Contains? Within? Equals? Disjoint? …
• Distance sorting? Query boost by distance?
• Or more exotic shape relevancy like overlap percentage?
• Tradeoff shape precision for speed?
• Multiple SpatialStrategy implementations: • RecursivePrefixTreeStrategy and TermQueryPrefixTreeStrategy
• PointVectorStrategy
• BBoxStrategy (currently in trunk, not 4x)
• JtsGeoStrategy (in Spatial Solr Sandbox)
Strategy: PointVector
• Similar to Solr’s PointType / LatLonType
• X & Y trie double fields; caching via FieldCache
• Characteristics
• Indexes points (only)
• Single-valued field (no multi)
• Query by rectangle or circle (only)
• Circle uses FieldCache (requires memory)
• Circle does bbox pre-filter for performance
• Relations: Intersects, Within (only)
• Exact precision for x & y coordinates and query shape
• Distance sort
• Uses FieldCache (requires memory)
Strategy: BBox
• Implemented with 4 doubles & 1 boolean
• Ported from ESRI GeoPortal (Open Source)
• Characteristics:
• Indexes rectangles (only)
• Single-valued field (no multi)
• Query by rectangle (only)
• Supports all relations: Intersects, Within, Contains, …
• Distance sort from box center
• Uses FieldCache (requires memory)
• Area overlap sorting
• Sort results by percentage overlap between query and indexed boxes
• Uses FieldCache (requires memory)
• Note: FieldCache needs are somewhat high
Strategy: JtsGeoStrategy
• Stores a JTS geometry in Lucene 4’s DocValues • Stores WKB (WKT in binary format)
• Full vector geometry is retained for search
• DocValues is mostly a better FieldCache • Faster loading into memory
• Can be disk resident or memory
• Multi-valued
• Characteristics: • Indexes any shape, including Multi… varieties
• Query by any shape • Uses DocValues (memory use optional)
• Supports all relations: intersect, within, contains, … • Could easily also support JTS’s exotic DE-9IM based relations
• Exact precision to the vector geometry
• No sorting
• Experimental / immature status More of a proof-of-concept for now
PREFIXTREE STRATEGY Spatial grid indexing
Strategy: RecursivePrefixTree
• Grid / Tile / Trie / Prefix-Tree based • With recursive decent
algorithms
• Or TermQueryPrefixTree alternative
• Choose Geohash (geo only) or Quad tree
• The most mature strategy to date • Highly tested
• The current evolution of SOLR-2155
Strategy: RecursivePrefixTree
• Characteristics:
• Indexes all shapes
• Variable precision of shape edges
• Highly precise shapes other than Point won’t scale
• LineString possibly not precise enough for your needs
• Multi-valued field support
• Query by any shape
• Variable precision for query shape
• Highest precision usually scales
• All Relations: Intersects, Within, Contains, Disjoint
• Distance sort (w/ multi-value support)
• Warning: immature, won’t scale
• Uses significant amounts of memory
• Fast scalable spatial filtering; no caches needed
new in Lucene 4.3
How many search /
NoSQL systems have
these capabilities?
Geohashes
• What is a Geohash?
• A lat/lon geocode system
• Has a hierarchical spatial structure
• Gradual precision degradation
• In the public domain
http://en.wikipedia.org/wiki/Geohash
• Example: (Boston) DRT2Y
Demo
http://openlocation.org/geohash/geohash-js/
Zooming In: D
Zooming In: DR
Zooming In: DRT
Zooming In: DRT2
Zooming In: DRT2Y
Geohash Grids
DRT2Y
Internal coordinates of an odd length geohash…
…and an even length geohash
DRT2
Demo
• Spatial Solr Playground • Demo KML grid generation from geometries
• A sample point with quad tree indexes to these tokens: • A, AD, ADB, ADBA
• A sample circle with quad tree indexes to these tokens: • A, AB, ABA, ABAB+, ABAC+, ABAD+, ABB, ABBA+,
ABBB+, ABBC+, ABBD+, ABC, ABCA+, ABCB+, ABCC+,
ABCD+, ABD+, AD, ADA, ADAA+, ADAB+, ADAC+, ADAD+,
ADB+, ADC, ADCA+, ADCB+, ADCD+, ADD, ADDA+,
ADDB+, ADDC+, ADDD+, B, BA, BAA, BAAC+, BAAD+,
BAC, BACA+, BACB+, BACC+, BACD+, BC, BCA, BCAA+,
BCAB+, BCAC+, BCC, BCCA+, BCCC+, C, CB, CBB,
CBBA+
• Tokens with a ‘+’ are actually indexed with and without the ‘+’
PrefixTreeStrategy Architecture
Shape
calc rect relationship
SpatialPrefixTree & Cell
byte string to/from Cell (rect)
PrefixTreeStrategy
index & search algorithms
Lucene
TermsEnum IntersectsPrefixTreeFilter
ContainsPrefixTreeFilter
WithinPrefixTreeFilter
Lucene Spatial example code
ctx = SpatialContext.GEO;
strategy = new RecursivePrefixTreeStrategy(
new GeohashPrefixTree(ctx,11), “myGeoField”);
… // make indexWriter and a Document
for (Field f : strategy.createIndexableFields(shape))
doc.add(f);
indexWriter.addDocument(doc);
…
filter = strategy.makeFilter(
new SpatialArgs(SpatialOperation.Intersects,
ctx.makeCircle(-80.0, 33.0,
DistanceUtils.dist2Degrees(200,
DistanceUtils.EARTH_MEAN_RADIUS_KM))));
indexSearcher.search(userKeywordQuery, filter, 10);
See SpatialExample.java in Lucene spatial tests for more
Future
• Possible de-emphasis of SpatialStrategy abstraction
• A better options for distance sorting of PrefixTree strategies
• Better PrefixTree encoding than both geohash & quad tree • Google Summer of Code 2013 -- TBD
• Performance improvements to spatial Intersects RecursivePrefixTree Filter
• Remove the need to double-index leaf-nodes (with and without ‘+’)
• Exact geometry search by blending benefits of PrefixTree and JtsGeoStrategy
• A Single-dimensional PrefixTree (for numeric range index)
SOLR SPATIAL Adapters to Lucene 4 spatial
Solr 3 Spatial: LatLonType & friends
• Solr 3 was Solr’s first release to include spatial support • Not based on Lucene’s old spatial contrib module
• Similar to TwoDoublesStrategy but more optimized • Single-valued only, fast distance sorting, can choose floats (save
memory)
• Fields: • LatLonType (Geodetic)
• PointType (Cartesian)
• Query parsers (spatial filters): • {!geofilt} (circle) “p” and “sfield” and “d” params
• {!bbox} (bounding box of a circle)
• Distance function: • geodist() and some esoteric others
NOT completely
superseded by Solr 4
spatial fields
Solr 4 Spatial
• See
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial
4 <fieldType name="location_rpt"
class="solr.SpatialRecursivePrefixTreeFieldType”
spatialContextFactory=”
com.spatial4j.core.context.jts.JtsSpatialContextFactory”
distErrPct="0.025”
maxDistErr="0.000009”
units="degrees” />
If you don’t need JTS
(polygons) don’t set this
Non-point shapes
approximated to
grid up to 2.5% of
radius
Max precision (1m) as
measured in degrees
Indexing
• Point: Latitude, Longitude (i.e. Y, X) <field name="geo">43.17614, -90.57341</field>
• Point: X Y <field name="geo">-90.57341 43.17614</field>
• Rect: minX minY maxX maxY <field name="geo">-74.093 41.042 -69.347 44.558</field>
• Circle: point then d=radius (in degrees)
• will be deprecated
<field name="geo">Circle(4.56,1.23 d=0.0710)</field>
• WKT (preferred; it’s a standard) <field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20,
0 0, -10 30))</field>
Filter (search)
• Using Solr 3’s bbox or geofilt query parsers
• Distance radius ‘d’ is interpreted as kilometers, just like LatLonType
• Limited to bbox and bbox of a circle fq={!geofilt}&sfield=geo&pt=45.15,-93.85&d=5
• Range query style (bounding box) • Handles dateline wrap
fq=geo:[-90,-180 TO 90,180]
• Field query style • Unique to Lucene 4 spatial; see SpatialArgsParser
fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40
20, 0 0, -10 30))) distErrPct=0”
• Predicates: Intersects, IsDisjointTo, IsWithin, Contains, …
• distErrPct (& distErr) optional; override field type’s default
SOLR-4242: A
better spatial
query parser
Distance Sort & Relevancy Boost
• geodist() is for Solr 3 LatLonType only sort=geodist(lltField,45.15,-93.85) desc
• Solr 4 spatial queries can return the distance as the score q={!geofilt sfield=geo pt=45.15,-93.85 d=5
score=distance}&sort=score asc&fl=*,score
• Without a filter sort=query($sortsq) asc&sortsq={!geofilt filter=false
score=distance sfield=geo pt=45.15,-93.85 d=0}
• Relevancy boost defType=edismax&boost=query($mysq)&mysq={!geofilt
filter=false score=recipDistance pt=45.15,-98.85
d=5}
Distance Faceting
• sfield=geo (the field)
• pt=45.15,-93.85 (point of reference)
• Within 10km • facet.query={!geofilt d=10}
• Within 50km • facet.query={!geofilt d=50}
• Within 100km • facet.query={!geofilt d=100}
Future
• A more Solr-friendly spatial query parser SOLR-4242
• Retrofit geodist() to support the SpatialStrategies?
• Expose more tunables
• A grid based heat-map faceting component
• Idea: a multi-strategy spatial field encompassing
• A PrefixTree field for points
• A PrefixTree field for non-points
• A TwoDoubles field for good distance sorting / relevancy
• Knows whether its single vs. multi-valued
• A FieldType for multi-value numeric ranges
DEMO
INTERESTING USE CASES
1. Geohash each point to multiple lengths and index each
length into its own field
• geohash_1:D, geohash_2:DR, geohash_3:DRT, geohash_4:DRT2
2. Search with a rectangle (bbox) filter, and…
3. Facet on the geohash field with the desired resolution
• facet.field=geohash_4
&facet.limit=10000
• Lots of tuning / customization
options
• Projected / quad tree
• facet.prefix may help
Heatmap / Grid faceting
Plotting many points on a map
• Why not ask Solr for rows=1000 ?
• It’s slow
• If variable-points per doc then could yield be 1 distinct point or 1M
• Instead facet on a geohash with facet.limit=1000
• Fast
• Guaranteed <= 1000 points
• But might need lots of memory
• Or result-grouping on a geohash
But do you really want
to plot 1000+ points
on a map?
Filter by indexed distance constraints
• Imagine a dating site where both potential parties have a
maximum distance they’re willing to travel
• Q: For the current user, who is not “too far” for you but is
also not “too far” for them?
• A: Index each user’s location as a point in one field and
as a circle in another. Query by the current user’s circle to
the indexed point field as well as the current user’s point
to the indexed circle field.
Multi-valued durations
• What if your documents needed a variable number of time (or other numerical value) durations
• This approach won’t work: <field name=“start” type=“tdate” multiValued=“true”/>
<field name=“end” type=“tdate” multiValued=“true”/>
• Solr (without Solr 4 spatial fields) can’t do it!
• You need to think differently to solve this…
http://wiki.apache.org/solr/SpatialForTimeDurations
• Example use-cases
• Searching for hotel-room vacancies
• Searching for movie show-times
• (next slides) Each document is a person with a variable number of “shifts” that they are working…
… model durations as points
… queries become rectangles
… some config & search details
• Configuration
<fieldType name="days_of_year”
class="solr.SpatialRecursivePrefixTreeFieldType"
geo="false" units="degrees"
worldBounds="0 0 365 365"
distErrPct="0" maxDistErr="1"/>
• Sample search: Find shifts that have any overlap with 19th day to 23rd
daysOfYear:Intersects(0 18.5 23.5 365)
• Caveat: Won’t scale to the full precision of a java Long (timestamp)
Thank you!
• References
• Lucene 4 spatial javadocs
• https://builds.apache.org/job/Lucene-Artifacts-4.x/javadoc/spatial/
• Spatial4j at GitHub
• https://github.com/spatial4j/spatial4j ( spatial4j.com redirect)
• http://spatial4j.16575.n6.nabble.com -- [email protected]
• Solr
• http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4
• Spatial Solr Sandbox
• https://github.com/ryantxu/spatial-solr-sandbox
• Contact me:
• David Smiley [email protected] [email protected]