Top Banner
Dan Han and Eleni Stroulia University of Alberta [email protected] http://ssrg.cs.ualberta.ca 1 06/26/22 Cloud 2013
31

HGrid A Data Model for Large Geospatial Data Sets in HBase

May 11, 2015

Download

Technology

Dan Han
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HGrid A Data Model for Large Geospatial Data Sets in HBase

Dan Han and Eleni StrouliaUniversity of [email protected]://ssrg.cs.ualberta.ca

104/12/23 Cloud 2013

Page 2: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 2Cloud 2013

Page 3: HGrid A Data Model for Large Geospatial Data Sets in HBase

The General Research ProblemThe Geospatial Problem Instance

The Data Set HBase data-organization alternatives Performance analysis

Some Lessons Learned

04/12/23 3Cloud 2013

Page 4: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 4Cloud 2013

Page 5: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 Cloud 2013 5

Page 6: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 6Cloud 2013

Page 7: HGrid A Data Model for Large Geospatial Data Sets in HBase

Appropriate data models for time-series (MESOCA 2012) Geospatial (CLOUD 2013)applications

In progress: spatio-temporal applications

04/12/23 7Cloud 2013

Page 8: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 9Cloud 2013

Page 9: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 10Cloud 2013

Page 10: HGrid A Data Model for Large Geospatial Data Sets in HBase

[1] built a multi-dimensional index layer on top of a one-dimensional key-value store HBase to perform spatial queries.

[2] presented a novel key formulation schema, based on R+-tree for spatial index in HBase.

Focus on row-key designno discussion about column and version design

04/12/23 11

[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile Data Management (1) 2011: 7-16[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26

Cloud 2013

Page 11: HGrid A Data Model for Large Geospatial Data Sets in HBase

Two Synthetic Datasets Uniform and ZipF distribution Based on Bixi dataset, each object includes

▪ station ID, ▪ latitude, longitude, station name, terminal name, ▪ number of docks▪ number of bikes

100 Million objects (70GB) in a 100km*100km simulated space

04/12/23 12Cloud 2013

Page 12: HGrid A Data Model for Large Geospatial Data Sets in HBase

Regular Grid Indexing Row key: Grid rowID Column: Grid columnID Version: counter of Objects Value: one object in JSON format

04/12/23 13

Coun

ter

Column ID

Row

ID

00 01 02 03

00

01

02

03

Cloud 2013

Page 13: HGrid A Data Model for Large Geospatial Data Sets in HBase

Tie-based quad-tree Indexing Z-value Linearization Rowkey: Z-value Column: Object ID Value: one object in JSON Format

04/12/23 14

Z-Value

Object IDZ-value

Cloud 2013

Page 14: HGrid A Data Model for Large Geospatial Data Sets in HBase

Quad-Tree data model More rows with deeper

tree Z-ordering linearization

(violates data locality) In-time construction vs.

pre-construction implies a tradeoff between query performance and memory allocation

Regular Grid data model Very easy to locate a

cell by row id and column id

Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints

04/12/23 15

How much unrelated data is examined in a query matters a lot!

Cloud 2013

Page 15: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 16

Obj

ect Att

ribu

te

Columnid-ObjectId

QTId

-Row

Id

A A A

A A A

A A A

B B B

B B B

B B B

C C C

C C C

C C C

D D D

D D D

D D D

00

0111

10

01 02 03 01 02 03

Space

Cloud 2013

Page 16: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 17Cloud 2013

The row key is the QT Z-value + the RG row index.

The row key is the QT Z-value + the RG row index.

The column name is the RG column and the object-ID

The column name is the RG column and the object-ID

The attributes of the data point are stored in the third dimension.

The attributes of the data point are stored in the third dimension.

Page 17: HGrid A Data Model for Large Geospatial Data Sets in HBase

1. Compute minimum bounding square based on the query input location and the range

2. Compute the quad-tree tiles that overlap with the bounding square Z-codes

3. Compute all the regular-grid cells indexes in these quad-tree tiles the secondary index of rows and columns

4. Issue one sub-query for each selected tile of the quad-tree; process with user-level coprocessors on the HBase regions

5. Collect the results of the sub-queries at the client-side

04/12/23 18Cloud 2013

Page 18: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 20Cloud 2013

Page 19: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 21

00

02

04

06

00

02

04

06

Cloud 2013

Page 20: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 22

00

02

04

06

00

02

04

06

09-00

09-04

Cloud 2013

Page 21: HGrid A Data Model for Large Geospatial Data Sets in HBase

1. Estimate the search range (density-based range estimation)

2. Compute indices of rows and columns (steps 2 and 3 of Range Query)

3. Issue a scan query to retrieve the relevant data points

4. If fewer than K data points are returned, re-estimate the search range and repeat steps 2-3

5. Sort the return set in increasing distance from the input location

04/12/23 23Cloud 2013

Page 22: HGrid A Data Model for Large Geospatial Data Sets in HBase

Experiment Environment A four-node cluster on virtual machines with

Ubuntu on OpenStack Hadoop 1.0.2 (replication factor is 2), HBase 0.94 HBase Configuration

▪ 5K Caching Size▪ Block cache is true▪ ROWCOL bloom filter

Query processing Implementation Native java API User-Level Coprocessor Implementation04/12/23 24Cloud 2013

Page 23: HGrid A Data Model for Large Geospatial Data Sets in HBase

The granularity of grid affects query-processing performance

Explore the “best” cell configuration of each model Quad-tree=>(t= 1) RG=>(t=0.1) HGrid=>(T=10,t=0.1)

04/12/23 25Cloud 2013

HG:≈10:0.1 fewer sub-queries more false positives

HG:≈1:0.1 more sub-queries fewer false positives

HG:≈10:0.01 more rows

HG:≈10:0.1 fewer rows

Page 24: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 26

Given a location and a radius, Return the data points, located within

a distance less or equal to the radius from the input location

Cloud 2013

Page 25: HGrid A Data Model for Large Geospatial Data Sets in HBase

Given the coordinates of a location,Return the K points nearest to the

location

04/12/23 27Cloud 2013

Page 26: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 28Cloud 2013

Page 27: HGrid A Data Model for Large Geospatial Data Sets in HBase

04/12/23 29Cloud 2013

Page 28: HGrid A Data Model for Large Geospatial Data Sets in HBase

Data Organization Short row key and column name Better to have one column family and few columns Not large amount of data in one row Row key design should ease pruning unrelated data 3rd dimension can store data as well Bloom Filter should be configured to prune rows and

columns Compression can reduce the amount of data

transmission

04/12/23 30Cloud 2013

Page 29: HGrid A Data Model for Large Geospatial Data Sets in HBase

Query Processing Scanned rows for one query should not exceed the

scan cache size, otherwise, split the query into sub-queries.

“Scan” is better than “Get” for retrieving discontinuous keys, even though the unrelated data

“Scan” for small queries, while Coprocessor for large queries

Better to split one large query into multiple sub-queries than use one query with row filter mechanism

04/12/23 31Cloud 2013

Page 30: HGrid A Data Model for Large Geospatial Data Sets in HBase

Benefits from the good locality of the RG index; suffers from the poor locality of the z-ordering QT linearization Performance could be improved with other linearization

techniques Can be flexibly configured and extended

The QT index can be replaced by the hash code of each sub-space

The granularity in the second stage can be varied from sub-space to sub-space based on the various densities

Is more suitable for homogeneously covered and discontinuous spaces

04/12/23 32Cloud 2013

Page 31: HGrid A Data Model for Large Geospatial Data Sets in HBase

A Data Model for spatio-temporal dataset

Towards a General Systematic Guidance for Column Families and other NoSQL databases

To apply the data model into cloud-based applications and big data analytics system

04/12/23 33Cloud 2013