HGrid A Data Model for Large Geospatial Data Sets in HBase

Dan Han and Eleni StrouliaUniversity of [email protected]://ssrg.cs.ualberta.ca

104/12/23 Cloud 2013

04/12/23 2Cloud 2013

The General Research ProblemThe Geospatial Problem Instance

The Data Set HBase data-organization alternatives Performance analysis

Some Lessons Learned

04/12/23 3Cloud 2013

04/12/23 4Cloud 2013

04/12/23 Cloud 2013 5

04/12/23 6Cloud 2013

Appropriate data models for time-series (MESOCA 2012) Geospatial (CLOUD 2013)applications

In progress: spatio-temporal applications

04/12/23 7Cloud 2013

04/12/23 9Cloud 2013

04/12/23 10Cloud 2013

[1] built a multi-dimensional index layer on top of a one-dimensional key-value store HBase to perform spatial queries.

[2] presented a novel key formulation schema, based on R+-tree for spatial index in HBase.

Focus on row-key designno discussion about column and version design

04/12/23 11

[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile Data Management (1) 2011: 7-16[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26

Cloud 2013

Two Synthetic Datasets Uniform and ZipF distribution Based on Bixi dataset, each object includes

▪ station ID, ▪ latitude, longitude, station name, terminal name, ▪ number of docks▪ number of bikes

100 Million objects (70GB) in a 100km*100km simulated space

04/12/23 12Cloud 2013

Regular Grid Indexing Row key: Grid rowID Column: Grid columnID Version: counter of Objects Value: one object in JSON format

04/12/23 13

Coun

ter

Column ID

Row

ID

00 01 02 03

00

01

02

03

Cloud 2013

Tie-based quad-tree Indexing Z-value Linearization Rowkey: Z-value Column: Object ID Value: one object in JSON Format

04/12/23 14

Z-Value

Object IDZ-value

Cloud 2013

Quad-Tree data model More rows with deeper

tree Z-ordering linearization

(violates data locality) In-time construction vs.

pre-construction implies a tradeoff between query performance and memory allocation

Regular Grid data model Very easy to locate a

cell by row id and column id

Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints

04/12/23 15

How much unrelated data is examined in a query matters a lot!

Cloud 2013

04/12/23 16

Obj

ect Att

ribu

te

Columnid-ObjectId

QTId

-Row

Id

A A A

A A A

A A A

B B B

B B B

B B B

C C C

C C C

C C C

D D D

D D D

D D D

00

0111

10

01 02 03 01 02 03

Space

Cloud 2013

04/12/23 17Cloud 2013

The row key is the QT Z-value + the RG row index.

The row key is the QT Z-value + the RG row index.

The column name is the RG column and the object-ID

The column name is the RG column and the object-ID

The attributes of the data point are stored in the third dimension.

The attributes of the data point are stored in the third dimension.

1. Compute minimum bounding square based on the query input location and the range

2. Compute the quad-tree tiles that overlap with the bounding square Z-codes

3. Compute all the regular-grid cells indexes in these quad-tree tiles the secondary index of rows and columns

4. Issue one sub-query for each selected tile of the quad-tree; process with user-level coprocessors on the HBase regions

5. Collect the results of the sub-queries at the client-side

04/12/23 18Cloud 2013

04/12/23 20Cloud 2013

04/12/23 21

00

02

04

06

00

02

04

06

Cloud 2013

04/12/23 22

00

02

04

06

00

02

04

06

09-00

09-04

Cloud 2013

1. Estimate the search range (density-based range estimation)

2. Compute indices of rows and columns (steps 2 and 3 of Range Query)

3. Issue a scan query to retrieve the relevant data points

4. If fewer than K data points are returned, re-estimate the search range and repeat steps 2-3

5. Sort the return set in increasing distance from the input location

04/12/23 23Cloud 2013

Experiment Environment A four-node cluster on virtual machines with

Ubuntu on OpenStack Hadoop 1.0.2 (replication factor is 2), HBase 0.94 HBase Configuration

▪ 5K Caching Size▪ Block cache is true▪ ROWCOL bloom filter

Query processing Implementation Native java API User-Level Coprocessor Implementation04/12/23 24Cloud 2013

The granularity of grid affects query-processing performance

Explore the “best” cell configuration of each model Quad-tree=>(t= 1) RG=>(t=0.1) HGrid=>(T=10,t=0.1)

04/12/23 25Cloud 2013

HG:≈10:0.1 fewer sub-queries more false positives

HG:≈1:0.1 more sub-queries fewer false positives

HG:≈10:0.01 more rows

HG:≈10:0.1 fewer rows

04/12/23 26

Given a location and a radius, Return the data points, located within

a distance less or equal to the radius from the input location

Cloud 2013

Given the coordinates of a location,Return the K points nearest to the

location

04/12/23 27Cloud 2013

04/12/23 28Cloud 2013

04/12/23 29Cloud 2013

Data Organization Short row key and column name Better to have one column family and few columns Not large amount of data in one row Row key design should ease pruning unrelated data 3rd dimension can store data as well Bloom Filter should be configured to prune rows and

columns Compression can reduce the amount of data

transmission

04/12/23 30Cloud 2013

Query Processing Scanned rows for one query should not exceed the

scan cache size, otherwise, split the query into sub-queries.

“Scan” is better than “Get” for retrieving discontinuous keys, even though the unrelated data

“Scan” for small queries, while Coprocessor for large queries

Better to split one large query into multiple sub-queries than use one query with row filter mechanism

04/12/23 31Cloud 2013

Benefits from the good locality of the RG index; suffers from the poor locality of the z-ordering QT linearization Performance could be improved with other linearization

techniques Can be flexibly configured and extended

The QT index can be replaced by the hash code of each sub-space

The granularity in the second stage can be varied from sub-space to sub-space based on the various densities

Is more suitable for homogeneously covered and discontinuous spaces

04/12/23 32Cloud 2013

A Data Model for spatio-temporal dataset

Towards a General Systematic Guidance for Column Families and other NoSQL databases

To apply the data model into cloud-based applications and big data analytics system

04/12/23 33Cloud 2013

HGrid A Data Model for Large Geospatial Data Sets in HBase

Technology

cloud data managements

space cloud

geospatial cloud

input location cloud

unrelated data

data locality

zvalue column

quadtree data model