Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.

Efficiently Managing Large-Scale Raster Species

Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science

The City College of the City University of New YorkMichael Gertz, Institute of Computer Science

University of HeidelbergLe Gruenwald, School of Computer Science

The University of Oklahoma

http://images.google.com/imgres?imgurl=http://www.ccny.cuny.edu/public_safety/LOGO-NEW-2.gif&imgrefurl=http://www.ccny.cuny.edu/public_safety/&h=1518&w=1670&sz=419&hl=en&start=5&usg=__RzwNqxW2tlvxCgyaIMmAFcqnUOo=&tbnid=LQS2VxavYEbsuM:&tbnh=136&tbnw=150&prev=/images%3Fq%3Dccny%2Blogo%26gbv%3D2%26hl%3Den%26sa%3DX

Outline•Introduction•Background and Related Work

•Species Distribution Data

•Quadtree Indexing and Query Processing

•The Proposed Solution•Database Preparation

•Query Window Decomposition

•Query Optimization and Result Combination

•Experiments and Evaluation•Conclusion and Future Work

Introduction

EnvironmentSpecies

Taxonomic (Linnaean ranks) Kingdom Phylum Class Order Family Genus Species SubSpecies

Phylogenentic

Area

Water-Energy

Latitude

Altitude

Productivity

Environmental Gradient

Community – Ecosystem – Biome – Biosphere

Phylogeography

3

IntroductionGeographical View

Taxonomic View

Environmental View

Ecoregion View

Linked Environment for Exploratory Analysis of Large-Scale Species Distribution Data

ACMGIS’08

Introduction

Taxonomic

Geographical

Correlation

DistributionConfiguration

Environmental

Phylogenetic

Approximation

Distribution

Background: Species Distribution Data

MuseumCollections or Species checklists

Global Biodiversty Facility (GBIF)data portal (http://data.gbif.org/)

177,887,193 occurrence records from 294 data providers as of 07/25/2009

Species 2000 Annual Checklist (www.sp2000.org)

1,160,711 species as of 2009

SpeciesRangeMaps

NatureServe Birds of the WesternHemisphere http://www.natureserve.org

4253 birds species ~3 Gigabytes

Little Tree Specieshttp://esp.cr.usgs.gov/data/atlas/

little/

679 tree species, 137 Megabytes

Compiled Speciesdatabases

WWF Wildfinder database http://www.worldwildlife.org/

science/data/wildfinder.cfm

29112 species, 4815 genus, 445families, 69 orders, 350045 species-ecoregion records, ~80Megabytes

USDA Plant databaseshttp://plants.usda.gov/

89759 plant species in 3141 UScounties

Background: Species Distribution Data

Species A

Species B

Species C

Species Area

A 14

B 10

C 9

… …

Dynamic user defined query window

Cross-layer query result table

Taxonomic Tree

Phylogenetic Tree

Ecosystem hierarchy: Community – Ecosystem – Biomes – Biosphere

Related Work• Quadtree Indexing :One of the oldest and most extensively studied indexing and

query processing approach [Gaede and Günther 1998; Samet 2005]– Research prototypes: QUILT (Shaffer et al 1990) and SAND (Esperanca and

Samet 2002, Samet and Webber 2006 ) - Lacking of full SQL support– Window query in linear quadtree (Aboulnaga and Aref 2001) -Key values

are stored as Morton codes for B-Tree Indexing – Implementation is non-trivial

– SP-GIST (Aref and Ilyas 2001, Eltabakh et al 2006) - Quad-tree based indexing of line-segments in PostgreSQL

• Commercial products (Oracle/SQL Server) [Kothuri et al 2002, Fang et al 2008 ] Quadtree based indexing of polygonal data Filtering mechanism to facilitate querying spatial relationships at the

polygon level Not directly accessible to application developers

Related Work

• Pieces that contribute to the research– Quadtree representation of geo-referenced data and linear quadtrees

(Hunter et al 1979, Gargantini 1982, Samet et al 1983, Shaffer et al 1990, Esperanca and Samet 2002, Samet and Webber 2006)

– Efficient window query decomposition (Aref and Samet 1993, Aref and Samet 1997, Proietti 1999, Aboulnaga and Aref 2001, Tsai et al 2004)

– Microsoft SQL Server Spatial’s implementation of quadtree indexing based on path query (Fang et al 2008)

– LTREE Tree path indexing module in PostgreSQL

•Open source productsPostgreSQL: Quadtree indexing for binary raster data is not availableRasdaman: support storing and querying dense multi-dimensional real-valued arrays based on tiling (chunking) techniques

Overview of the Proposed Approach

Build individual quadtrees

Combine quadtree leaf nodes and generate database tuples (tree_path, species_id)

Store the database tuples and index on tree_path column

Query window decomposition

Convert decomposed cellsinto tree paths

Query string formulation

Database Preparation Query Transformation

Database Backend

Result Combination• Eliminate duplicates• Combine values that have

the same key

Client applications

1

1

2

2

Synchronization onraster tesselation

Coordination on treepath formulation

Database Preparation

• Polygons representing species distributions vary in sizes and shapes significantly

• Distributions among a large number of species overlap greatly – cross layer query

• It is inefficient to create a quadtree for each polygon - quadtree paths will be duplicated

• Associate a quadtree node with a set of species identifiers instead of a single species identifier

• How to combine individual quadtrees for efficient query processing?

Database PreparationIndividual Quadtrees for Species Distributions


Classic Combination


Improved Combination


• Each of quadtree nodes (leaf or non-leaf) and their associated species identifiers become a tuple in PostgreSQL database

• Index the table based on the paths of the quadtree nodes – offline

• Sample query: – Select bk_id, sp_ids from TB where bk_id <@ ‘3’– selecting all the tuples whose paths are decedents of tree path ‘3’ – Results: A (16), B(4), C(4)

Query Window Decomposition• Transform a spatial query window into tree paths to

match with the database tuples - Decomposition•Exact matches + searching for ancestors. •Exact matches + searching for descendents

Condition to match a query window cell C with a database tuple R

(C.ID is an ancestor of R.path) or (C.ID is a descendant of R.path)

A

B

C C

3

3.2

3.2.0 3.2.2

A

BC

Query Window Decomposition

Complexity analysis•Decomposition algorithm O(m) (Tsai et al 2004), m is the larger of the width and height of a query window •Converting the outputs of (Tsai et al 2004) to tree paths O(l*d). l is the number of decomposed cells and d is the depth of quadtree for a raster tessellation. •l is proportional to one dimension of the query window, i.e., m. (Tsai et al 2004) •The overall complexity O(m)+ O(l*d) O(m*d) •d is a relative small number O(m)

We adopt the approach reported in (Tsai et al 2004) -efficiency and easy of implementation

m=8d=3l=13 (1 level 2 and 12 level 3)

Query Optimization

A

B

C C

3

3.2

3.2.0 3.2.2

•For a large query window that does not align with quadrant divisions very well, the decomposed cells can be thousands or even more and most of them will be small cells with same ancestor nodes •As the queries are sent to the server independently, duplicated tuples may be returned and need to be removed when combining query results

How to minimize duplication while ensure correctness?

Can we remove duplication and make combination as simple additions?

Query Optimization

1 1 1 1 1 1 1

222

2 4

8

8

Least Common Ancestor (LCA)

Retrieve all the tuples whose quadtree paths are the descendents (inclusive) of the nodes below

Retrieve all the tuples whose quadtree paths match the cell identifiers exactly

Retrieve all the tuples whose paths are the ancestors of the root identifier

Proposed Approach -Discussions• Summary

– Utilizes existing database storage and indexing functions – no need to define new data types, develop new indexing approaches, modify query syntax and revise database query engines

– User queries are transformed into formats that are supported by existing database backend and the results are combined in the middleware to answer users’ query effectively and efficiently.

• Advantages:– Use SQL query syntax instead of being forced to use APIs – Use a variety of databases (as long as they support efficient path

query)

– The underlying database systems are left untouched • Reduce technical complexities• Does not depend on the availability of source code

Experiment Setups

– Dell Precision T5400 workstation/PostgreSQL 8.3.5– Species Distribution Data

• NatureServe: http://www.natureserve.org/getData/animalData.jsp • Mammals (1693 species), birds (4148 species) and amphibian (5816

species)

– Quadtree • West hemisphere, i.e., (-180,-90, 0, 90) • Depth=14 (214=16384) • Spatial resolution is finer than 1 arc minute (180*60=10800)

– Four query window sizes: 0.1, 0.5, 1 and 5 degrees – For each query window size: 100 queries with random centers

Experiments on Database Preparation • Rasterized bird species distributions (finest resolution)

– 46,139,247 cells – 1,318,136,140 pairs of (cell, identifier) combinations – 28.7 species per cell

• # of quadtree nodes (database tuples)– Classic combination: 7,511,823 leaf nodes– Proposed combination: 4,957,050 leaf and non-leaf nodes– Proposed combination: 1:9.3 compression ratio

• # of species identifiers– Classic combination: 831,903,250 (110.7 per node) – Proposed combination: 23,865,343 (4.8 per node)

• 34.9 times less with respect to the total # of species identifiers • 23 time less with respect to the average # of identifiers per node

Experiments on Query Processing

# of species

Query response time (ms): Baseline (lt) and Optimized (op) approaches

0

5000

10000

15000

20000

25000

0 200 400 600 800 1000

lt

op

Experiments on Query Processing

Window Size

Baseline Query Optimized Query

AVG MAX AVG MAX

0.1 0.14 0.36 0.03 0.06

0.5 0.93 2.39 0.12 0.25

1 1.63 3.93 0.20 0.50

5 9.68 21.13 1.16 3.38

Average and Maximum Response Times under Four Query Windows for the Three Approaches (in seconds)

Conclusions and Future Work•This research tackles the problem of storing, indexing and querying large-scale species distribution data in the form of binary rasters in a database environment

•The approach does not require any modifications on database backend and is applicable to many database systems that support tree path matching.

•A middleware approach has been adopted by utilizing existing PostgreSQL database support for tree paths and by transforming spatial window query into tree path matching.

•An end-to-end system to manage large-scale species distribution datasets has been developed with demonstrated efficiency based on 4000+ bird species distribution data in the West Hemisphere.

Conclusions and Future Work• Future work

– Further extend the solution to manage even larger scale of species distribution datasets. The ultimate goal is to support all known species at the million scale and reduce the average query response time to below one second for realistic query window sizes to support interactive customer applications

– Possible strategies• New efficient data structures and algorithms (e.g. column store, bitmap)

• Combing pre-computation and on-demand processing (query window decomposition)

• Using main-memory database techniques (indexing/querying)

• Using GPU for faster query processing (Fermi, CUDA, Nexus)• Using Map-Reduce/Hadoop for parallel/distributed query processing

(HadoopDB)

Latest Progresses

• Main-memory database and query processing algorithm– Memory Consumption: ~200M for 4000+ birds species

– Query response time: ~1/4 second for query window size as large as 5 by 5 degrees

– Can be used as a cache system for extremely large scale species distribution data

• Web-based tool: http://geoteci.engr.ccny.cuny.edu/geoteci/SPTestMap.html

Efficiently Managing Large-Scale Raster Species Distribution Data in PostgreSQL Jianting Zhang, Dept. of Computer Science The City College of the City.

Documents

birds species

plant species

samet et

speciesecoregion records

data providers

future work slide

related work quadtree

fang et