Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang , Dept. of Computer Science The City College of the City University of New York Michael Gertz, Institute of Computer Science University of Heidelberg Le Gruenwald, School of Computer Science The University of Oklahoma
27
Embed
Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficiently Managing Large-Scale Raster Species
Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science
The City College of the City University of New YorkMichael Gertz, Institute of Computer Science
University of HeidelbergLe Gruenwald, School of Computer Science
Ecosystem hierarchy: Community – Ecosystem – Biomes – Biosphere
Related Work• Quadtree Indexing :One of the oldest and most extensively studied indexing and
query processing approach [Gaede and Günther 1998; Samet 2005]– Research prototypes: QUILT (Shaffer et al 1990) and SAND (Esperanca and
Samet 2002, Samet and Webber 2006 ) - Lacking of full SQL support– Window query in linear quadtree (Aboulnaga and Aref 2001) -Key values
are stored as Morton codes for B-Tree Indexing – Implementation is non-trivial
– SP-GIST (Aref and Ilyas 2001, Eltabakh et al 2006) - Quad-tree based indexing of line-segments in PostgreSQL
• Commercial products (Oracle/SQL Server) [Kothuri et al 2002, Fang et al 2008 ] Quadtree based indexing of polygonal data Filtering mechanism to facilitate querying spatial relationships at the
polygon level Not directly accessible to application developers
Related Work
• Pieces that contribute to the research– Quadtree representation of geo-referenced data and linear quadtrees
(Hunter et al 1979, Gargantini 1982, Samet et al 1983, Shaffer et al 1990, Esperanca and Samet 2002, Samet and Webber 2006)
– Efficient window query decomposition (Aref and Samet 1993, Aref and Samet 1997, Proietti 1999, Aboulnaga and Aref 2001, Tsai et al 2004)
– Microsoft SQL Server Spatial’s implementation of quadtree indexing based on path query (Fang et al 2008)
– LTREE Tree path indexing module in PostgreSQL
•Open source productsPostgreSQL: Quadtree indexing for binary raster data is not availableRasdaman: support storing and querying dense multi-dimensional real-valued arrays based on tiling (chunking) techniques
Overview of the Proposed Approach
Build individual quadtrees
Combine quadtree leaf nodes and generate database tuples (tree_path, species_id)
Store the database tuples and index on tree_path column
Query window decomposition
Convert decomposed cellsinto tree paths
Query string formulation
Database Preparation Query Transformation
Database Backend
Result Combination• Eliminate duplicates• Combine values that have
the same key
Client applications
1
1
2
2
Synchronization onraster tesselation
Coordination on treepath formulation
Database Preparation
• Polygons representing species distributions vary in sizes and shapes significantly
• Distributions among a large number of species overlap greatly – cross layer query
• It is inefficient to create a quadtree for each polygon - quadtree paths will be duplicated
• Associate a quadtree node with a set of species identifiers instead of a single species identifier
• How to combine individual quadtrees for efficient query processing?
Database PreparationIndividual Quadtrees for Species Distributions
Database Preparation
Classic Combination
Database Preparation
Improved Combination
Database Preparation
• Each of quadtree nodes (leaf or non-leaf) and their associated species identifiers become a tuple in PostgreSQL database
• Index the table based on the paths of the quadtree nodes – offline
• Sample query: – Select bk_id, sp_ids from TB where bk_id <@ ‘3’– selecting all the tuples whose paths are decedents of tree path ‘3’ – Results: A (16), B(4), C(4)
Query Window Decomposition• Transform a spatial query window into tree paths to
match with the database tuples - Decomposition•Exact matches + searching for ancestors. •Exact matches + searching for descendents
Condition to match a query window cell C with a database tuple R
(C.ID is an ancestor of R.path) or (C.ID is a descendant of R.path)
A
B
C C
3
3.2
3.2.0 3.2.2
A
BC
Query Window Decomposition
Complexity analysis•Decomposition algorithm O(m) (Tsai et al 2004), m is the larger of the width and height of a query window •Converting the outputs of (Tsai et al 2004) to tree paths O(l*d). l is the number of decomposed cells and d is the depth of quadtree for a raster tessellation. •l is proportional to one dimension of the query window, i.e., m. (Tsai et al 2004) •The overall complexity O(m)+ O(l*d) O(m*d) •d is a relative small number O(m)
We adopt the approach reported in (Tsai et al 2004) -efficiency and easy of implementation
m=8d=3l=13 (1 level 2 and 12 level 3)
Query Optimization
A
B
C C
3
3.2
3.2.0 3.2.2
•For a large query window that does not align with quadrant divisions very well, the decomposed cells can be thousands or even more and most of them will be small cells with same ancestor nodes •As the queries are sent to the server independently, duplicated tuples may be returned and need to be removed when combining query results
How to minimize duplication while ensure correctness?
Can we remove duplication and make combination as simple additions?
Query Optimization
1 1 1 1 1 1 1
222
2 4
8
8
Least Common Ancestor (LCA)
Retrieve all the tuples whose quadtree paths are the descendents (inclusive) of the nodes below
Retrieve all the tuples whose quadtree paths match the cell identifiers exactly
Retrieve all the tuples whose paths are the ancestors of the root identifier
Proposed Approach -Discussions• Summary
– Utilizes existing database storage and indexing functions – no need to define new data types, develop new indexing approaches, modify query syntax and revise database query engines
– User queries are transformed into formats that are supported by existing database backend and the results are combined in the middleware to answer users’ query effectively and efficiently.
• Advantages:– Use SQL query syntax instead of being forced to use APIs – Use a variety of databases (as long as they support efficient path
query)
– The underlying database systems are left untouched • Reduce technical complexities• Does not depend on the availability of source code
Experiment Setups
– Dell Precision T5400 workstation/PostgreSQL 8.3.5– Species Distribution Data
– Quadtree • West hemisphere, i.e., (-180,-90, 0, 90) • Depth=14 (214=16384) • Spatial resolution is finer than 1 arc minute (180*60=10800)
– Four query window sizes: 0.1, 0.5, 1 and 5 degrees – For each query window size: 100 queries with random centers
Experiments on Database Preparation • Rasterized bird species distributions (finest resolution)
– 46,139,247 cells – 1,318,136,140 pairs of (cell, identifier) combinations – 28.7 species per cell
• # of quadtree nodes (database tuples)– Classic combination: 7,511,823 leaf nodes– Proposed combination: 4,957,050 leaf and non-leaf nodes– Proposed combination: 1:9.3 compression ratio
• # of species identifiers– Classic combination: 831,903,250 (110.7 per node) – Proposed combination: 23,865,343 (4.8 per node)
• 34.9 times less with respect to the total # of species identifiers • 23 time less with respect to the average # of identifiers per node
Experiments on Query Processing
# of species
Query response time (ms): Baseline (lt) and Optimized (op) approaches
0
5000
10000
15000
20000
25000
0 200 400 600 800 1000
lt
op
Experiments on Query Processing
Window Size
Baseline Query Optimized Query
AVG MAX AVG MAX
0.1 0.14 0.36 0.03 0.06
0.5 0.93 2.39 0.12 0.25
1 1.63 3.93 0.20 0.50
5 9.68 21.13 1.16 3.38
Average and Maximum Response Times under Four Query Windows for the Three Approaches (in seconds)
Conclusions and Future Work•This research tackles the problem of storing, indexing and querying large-scale species distribution data in the form of binary rasters in a database environment
•The approach does not require any modifications on database backend and is applicable to many database systems that support tree path matching.
•A middleware approach has been adopted by utilizing existing PostgreSQL database support for tree paths and by transforming spatial window query into tree path matching.
•An end-to-end system to manage large-scale species distribution datasets has been developed with demonstrated efficiency based on 4000+ bird species distribution data in the West Hemisphere.
Conclusions and Future Work• Future work
– Further extend the solution to manage even larger scale of species distribution datasets. The ultimate goal is to support all known species at the million scale and reduce the average query response time to below one second for realistic query window sizes to support interactive customer applications
– Possible strategies• New efficient data structures and algorithms (e.g. column store, bitmap)
• Combing pre-computation and on-demand processing (query window decomposition)
• Using main-memory database techniques (indexing/querying)
• Using GPU for faster query processing (Fermi, CUDA, Nexus)• Using Map-Reduce/Hadoop for parallel/distributed query processing
(HadoopDB)
Latest Progresses
• Main-memory database and query processing algorithm– Memory Consumption: ~200M for 4000+ birds species
– Query response time: ~1/4 second for query window size as large as 5 by 5 degrees
– Can be used as a cache system for extremely large scale species distribution data