Top Banner
GeoWave Geospatial Indexing on Accumulo Eric Robertson Rich Fecher
49

Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Jul 15, 2015

Download

Technology

Accumulo Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

GeoWave

Geospatial Indexing on Accumulo

Eric Robertson Rich Fecher

Page 2: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• Geographic Information Systems (GIS)

• GeoWave Overview

– Features

– Components

– Data Types

• The Fundamentals

– How does GeoWave organize geospatial data?

• Set of problems and solutions with Accumulo

– Deduplication

– WFS-T Transaction Isolation

– Map Occlusion Culling

– Raster Data

– Statistics

OUTLINE

Page 3: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• GIS Technology Explosion

– E.g. Smart Phone and GPS Applications

• Data Explosion

– Satellite Imagery, Ground Based Imagery, Aerial Photography

• Problems:

– Generate Maps: Create base image and add vector data (shapes):

• points of interest

• roads

• boundaries

– Find Features

“restaurants near you”

– Analysis

Density, Surface Analysis, Interpolation,

Pattern Discovery

GIS: GEOGRAPHIC INFORMATION SYSTEM

Generated by OpenStreetMap.org

Page 4: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• Leverage Accumulo offerings as distributed data store

– High-performance ingest

– Horizontally scalable

– Per-entry access constraints

• Fast geospatial retrieval

• Geo-temporal indexing

• Pre-calculated statistics:

– Counts per Data Type

– Bounding Region

– Time Range

– Numeric Range

– Histograms

FEATURES OF GEOWAVE

Page 5: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Accumulo 1.5.1, 1.6.x

Cloudera 2.0.0-cdh4.7.0, 2.5.0-cdh5.2

Hortonworks HDP 2.1

Apache 2.6

GeoTools 11.4, 12.1, 12.2

Geoserver 2.5.2 ,2.6.1

Accumulo Data Store

Hadoop Map-Reduce input/output formats

GeoServer integration with GeoTools

Vector and Raster Data

Multi-Threaded Ingest Tools

Administrative RESTful Services

Layers and Data Stores

Analytics

Kernel Density

K-means Clustering

Sampling

INTEGRATED COMPONENTS

Tested Versions

PDBScan coming soon with Apache Spark 1.2.1

Page 6: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• Data Structures

– Simple Feature (ISO 19125) via GeoTools (http://www.geotools.org/).

– Raster Images

– Custom

• Provided Ingest Types

– Vector Data Sources (GeoTools)• Examples: Shapefiles, GeoJSON, PostGIS, etc.

– Grid Formats (GeoTools)• Examples: ArcGrid, GeoTIFF, etc.

– GeoLife GPS Trajectories (http://research.microsoft.com/en-us/projects/GeoLife/)

– GPX (http://www.topografix.com/gpx.asp)

– T-Drive (http://research.microsoft.com/en-us/projects/tdrive/)

– PDAL

DATA TYPES

Page 7: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• Basic Problem: Efficiently locate and retrieve vectors or tiles intersecting a polygon (e.gbounding box).

• Accumulo: Each table organized into blocks of sorted row identifiers.

• Revised Problem: Two-way mapping between multiple dimensions and a single dimension row ID to support location efficient storage and retrieval of vectors or tiles given constraints in terms of multi-dimensional boundaries.

MAIN PROBLEM:INDEX TWO DIMENSION IN SINGLE DIMENSION INDEX

Page 8: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

GENERALIZED PROBLEMSSolve the general problem first. Then apply to Geospatial specific problems.

Multi-Dimension Index supporting efficient data retrieval given bounded set of constraints for each dimension.

Indexed data includes scalars and intervals per dimension.

For example, a range of time or a polygon.

Index over a mix of bounded and unbounded dimensions.

Page 9: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Curves are constructed iteratively. Each iteration produces a sequence of piecewise linear continuous curves, each one more closely approximating the space-filling limit.

Each discrete value on the curve represents a hyper-rectangle in n-dimensional space.

Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube.

FUNDAMENTAL APPROACH:SPACE FILLING CURVES TRAVERSE N-DIMENSIONAL SPACE

Page 10: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Achieve optimal read performance through contiguous series of values across two or more dimensions.

Reading 11 records over a contiguous range 23->33 is faster than reading non-contiguous range such as 15,18,34,56-58,83,99,101-102.

Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should map to the least number of ranges on the space filling curve.

Haverkort and Walderveen[1] describe 3 metrics to help quantify this.

CURVE SELECTION : SEQUENTIAL IO OPTIMIZATION

Worst Case Dilation Average Bounding BoxWorst Case Bounding Box

𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞

𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞

𝑎𝑟𝑒𝑎 𝑜𝑓 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 (𝑏𝑙𝑢𝑒)

𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 (𝑔𝑟𝑒𝑒𝑛)

Page 11: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Z-Order Hilbert H-order Peano AR2W2 BΩ

Worst Case Dilation

Average Box Area

Worst Case Area

L∞

L2

L1

∞ 6 4 8 5.40 5.00

∞ 6 4 8 6.04 5.00

∞ 9 8 10.66 12.00 9.00

∞ 2.40 3.00 2.00 3.05 2.22

2.86 1.41 1.69 1.42 1.47 1.40

[1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two-

Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2

CURVE SELECTION : LOCALITY

Page 12: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• Place a grid on the globe (dotted lines)

• Connect all the points on the grid with a Hilbert SFC.

• Curve provides linear ordering over two dimensional space.

• Bounding box is defined by the set of ranges covered by the Hilbert SFC.

HILBERT CURVE MAPPING IN 2D: THE GLOBAL

Page 13: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

• Precision determined by the ‘depth’ of the curve. In this example globe is defined by a 16X16 grid.

• Resolution is 22.5 degrees latitude and 11.25 degrees longitude per cell.

• Each elbow (discrete point) in the Hilbert SFC maps to a grid cell.

• The precision, defined in terms of the number of bits, of the Hilbert SFC determines the grid. Thus, more bits equates to finer grained cell.

HILBERT CURVE PRECISION

Page 14: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Recursively decompose the Hilbert region to find only those covered regions that overlap the query box.

The figure depicts a third order (23

“buckets” per dimension) Hilbert curve in 2D.

Forms a quad-tree view over the data.

Each two bits, from most significant to least represents a “quadrant.”

00 01

1011

10

11 00

01

11

10

00

01

Hilbert Index (52) = 11 01 00

RECURSIVE DECOMPOSITION : TWO DIMENSION EXAMPLE

Page 15: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right)

Decompose cells intersectingbounding box as shown in the blue.

Range decomposes to three (color coded) ranges –• 70 -> 75• 92 -> 99• 116 -> 121

Note: Bounding box from a geospatial query window does not necessarily “snap” perfectly to the grid cells. (e.g. 6.2, 8.8 instead of 6, 9). The bounding box is expanded to encompass all intersecting cells.

DECODE THE BOUNDING BOX: RANGE DECOMPOSITION

Page 16: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Here we see the query

range fully decomposed

into the underlying

“quadrants.”

Decomposition stops

when the query window

fully contains the quad.

(See segment 3 and

segment 8)

RANGE DECOMPOSITION OPTIMIZATION

Page 17: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

INTERVALS: POLYGONS AND MULTI-POLYGON

Duplicate entry for each intersecting hyper-rectangle over the interval.

Polygon covers 66 cells in the example

Remove duplicate data for each cell – 66 duplicates.

De-Duplication is applied in AccumuloIterator as well as client-side.

Query is defined by a range per dimension(a bounding rectangle in 2D)

Page 18: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

INTERVALS: POLYGONS AND MULTI-POLYGONSHigh resolution curves force excessive number of duplicates for large intervals.

A high resolution 2D curve – 231 x 231 and a large polygon such as the pacific ocean. The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion duplicate entries.

Solution: Tiered Indexing[8]

• Each tier has a resolution of 2nx2n, where n is the tier number. Thus, each lower tier has a two order increase in resolution.

• Polygons are stored in the lowest tier possible that minimizes the number of duplicates.

• Example: Blue polygon indexed in tier 2; Red polygon indexed in tier 3.

Page 19: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

TIERS: QUERY REGIONS WITH FALSE POSITIVES

Balance between an acceptable amount of duplicates and false positives due to lower granularity of higher tiers.

Consider a query region in orange. It does not intersect either polygons. However, it does intersect shared quadrants at the respective tiers for both shapes. Thus, more rows are filtered during range scan.

Without tiers, using a higher resolution, this false positive does not occur. However, consider that, for a resolution of 10 (e.g. 210), hundreds of duplicates occur.

Page 20: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

TIERS: WORST CASE

Cap the amount of duplicates by choosing an appropriate tier.

Our analysis indicates that an optimal number of duplicates is represented by 2d

where d is the number of dimensions (ie. in 2 dimensions, cap at 4)

Consider the worst case, a small square polygon centered on the inner intersecting boundary (example polygon in red).

Regardless of size, there is always four duplicates at all tiers except at a 20 tier—the orange box, representing the entire world

Page 21: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

UNBOUNDED DIMENSION: TIME

To normalize real-world values to fit on a space filling curve, the sample space must be bound.

Solution: Binning• A bin represents a period for EACH

dimension. For example, a periodicity of a year can be used for time.

• Each bin covers its own Hilbert space.

• Entries that contain ranges may span multiple bins resulting in duplicates.

• The Bin ID is part of row identifier.

1997 1998 1999

A single bin for an unbounded dimension :

[min + (period * period duration), min + ((period+1) * period duration))

Page 22: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

BIN: VARIABILITY OVER DIMENSIONS

Time

Elevation

Velocity

Each Bin is a hyper-rectangle representing ranges of data labeled by points on a Hilbert curve.

Bounded dimensions assume a single Bin.For example, Latitude and Longitude.

Page 23: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

THAT’S ENOUGH THEORY, LET’S APPLY IT

ACCUMULO TECHNIQUES YOU MIGHT FIND INTERESTING

Page 24: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

SFC Curve

Hierarchy

Feature

Type

Feature

ID

Hint to

Dedupe

Filter

From

Field

Visibility

Handlers

VECTOR DATA PERSISTENCE MODEL

Column per feature identifier.Column per each feature attribute.

Types include:

GeometryIntegerDoubleBigDecimalDateTimeStringBooleanetc.

Feature

Attribute

Name

Page 25: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

MAP OCCLUSION CULLING

A specific determined zoom level, each pixel signifies a range in degrees.

Scanning the data, only one entry is needed within each pixel range. The rest of the entries can be skipped.

The block identified in red represents many data points, but is rendered by the 9 pixels.

Page 26: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

1

2 3

4

1

2 3

4

Database Data

The accumulo iterator starts at the first pixel, scans until it hits a geometry, then skips to the next pixel.

Scan to the first pixel

Seek to the beginning of the next pixel

The rendering engine received only these points

Points that were all skipped.

MAP OCCLUSION CULLING: ITERATORS

Displayed Pixels

Page 27: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

GeoServer(GeoWave Plugin)

DISTRIBUTED RENDERING

Map Request

Map Response

LayerStyle

Accu

mu

lo(G

eo

Wave

Iterators)

RenderedMap

Each scan result is an imagewith the data in the range

All resultant imagesare composited together

Page 28: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

DISTRIBUTED RENDERING WITH OCCLUSION CULLING

Page 29: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

SFC Curve

Hierarchy

SFC Value is

Effectively

a Tile ID

Coverage

Name

RASTER DATA PERSISTENCE MODEL

Image Data Buffer+ Image Metadata

Image Metadata is customizable.

Default is to store “no data” values,

but can be customized

Tiles are unique,

ignore duplication

Unique name for

global coverage

Page 30: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

RASTER DATA: GRID COVERAGE

Tiled, each “cell” fit to boundary.

“No Data” values must be maintained.

Multi-band, more than just RGB.

Page 31: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Histogram Equalization [10]

Image Pyramid [11]

Tile Merge Strategy

t1t2

t3

f ( f( , ), ) = t1 t2 t3 final

tn

Image Data

Buffer

Coverage Name

-1 Coverage Name

Meta-

data

Value

Custom data per tile,in scope for f(x)

RASTER DATA: ADVANCED OPTIONS

Page 32: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

STATISTICS: STRUCTUREStatistics infrastructure supports summary data.

Currently, each row ID includes adapter ID and a statistics ID.

Current statistics types include population bounding boxes, counts and ranges.

Key

Statistic ID

Row IDColumn

Value

Adapter ID

Family Qualifier Visibility

“STATS”

Matches represented data

Attribute Name & Statistic Type.

Time

Page 33: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

STATISTICS: COMBINER

Statistic IDValueAdapter

ID

Family Qualifier Visibility

“STATS”

“Count” 300xA43E“STATS” A&B

“Count” 600xA43E“STATS” A&C

“Count” 200xA43E“STATS” A&B

“Count” 500xA43E“STATS” A&B

MERGE

Time

2

4

7

9

BBOX: Grow Envelope to Minimum and Maximum corners.RANGE: Minimum and MaximumHISTOGRAM: Update bins from coverage over raster image

Page 34: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

STATISTICS: TRANSFORMATION ITERATOR

Statistic IDValueAdapter

ID

Family Qualifier Visibility

“STATS”

“Count” 500xA43E“STATS” A&B

“Count” 600xA43E“STATS” A&C

“Count” 1100xA43E“STATS” A&B&C

MERGE

Time

9

4

9

Query authorization may authorize multiple rows.

Query with authorization A,B & C

Page 35: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

WFS-T[12] TRANSACTIONS: ISOLATION

• Problem: Isolation of updates and new records until commit.

• Solution:

– Use a managed set of transaction identifiers as authorization tags. A single transaction places an authorization tag in all new entries.

– Upon commit, the authorization tag is removed using a transforming iterator.

Role1, role2, tx123

Role1, role2

Commit

Page 36: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

SO WHAT?EYE-CANDY YOU’VE BEEN WAITING FOR

Page 37: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Microsoft GeoLife

Microsoft research has made available a trajectory data set that contains the GPS coordinates of 182 users over a three year period (April 2007 to August 2012).

There are 17,621 trajectories in this data set with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours recorded by GPS loggers and GPS phones often sampling every 1-5 seconds or every 5-10 meters.

http://research.microsoft.com/jump/131675

Page 38: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

GeoLife – Just the tracks

Page 39: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Let’s bring out some detail –Kernel Density Estimate (Guassian Kernel)

Page 40: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Let’s zoom in a bit

Page 41: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Density estimate again

Page 42: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

OSM – Planet GPX dump

Every track ever uploaded to Open Street Map

Complete data attribution

2.9 Billion spatial entities (points)

https://blog.openstreetmap.org/2013/04/12/bulk-gpx-track-data/

Page 43: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Level 0 Overview (all the points!)

Page 44: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Let’s go deeper..

Page 45: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]
Page 46: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Let’s bring out some detail again –Kernel Density Estimate (Guassian Kernel)

Page 47: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Let’s zoom a bit – and try some different styling options

Page 48: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

Questions?

Page 49: Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Retrieval in Accumulo [Geo]

[1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008

arXiv:0806.4787v2

[2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008

Information Processing Letters 105 (155-163)

[3] Hayes Crinkly Curves 2013 American Scientist 100-3 (178). DOI: 10.1511/2013.102.1

[4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering:

23rd Workshop Proceedings. 2004. American Institude of Physics 0-7354-0182-9/04

[5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary 2013

[6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve 2013

[7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java http://code.google.com/p/uzaygezen/ 2008 Google Inc.

[8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation.

[9] Open Geospatial Consortium Standard List http://www.opengeospatial.org/standards/is

[10] Remote Sensed Image Processing on Grids for Training in Earth Observation

http://www.intechopen.com/source/html/6674/media/image3.jpeg

[11] OSGeo Wiki http://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpg

[12] WFS-T (http://www.opengeospatial.org/standards/wfs )

BIBLIOGRAPHY