Top Banner
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring *The Ohio State University Los Alamos National Laboratory
22

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

Feb 24, 2016

Download

Documents

lihua

Indexing and Parallel Query Processing Support for Visualizing Climate Datasets. Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University † Los Alamos National Laboratory. Outline. Motivation and Introduction Background System Overview and Optimization Experiment - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Indexing and Parallel Query Processing Support for Visualizing

Climate Datasets

Yu Su*, Gagan Agrawal*, Jonathan Woodring†

*The Ohio State University†Los Alamos National Laboratory

Page 2: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Outline• Motivation and Introduction• Background• System Overview and Optimization• Experiment• Conclusion

Page 3: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Motivation

• Science becomes increasingly data driven;• Strong desire for efficient data visualization;• Challenges:

– Fast data generation speed– Slow disk IO and network speed – Worse performance during visualization– Different kinds of subsetting requests

• Difficult and Unnecessary to visualize all the data

Page 4: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Data Subsetting in Paraview• A widely used data analysis and visualization

application• Problems: Load + Filter mode

– Load the entire data set– Data filtering in visualization level

• Threshold Filter: based on values• Extract Subset Filter: based on dimension info

– Grid transformation needed during filtering• Regular Structured Grid -> Unstructured Grid

Page 5: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

A Faster Solution• Subset at the I/O level

– User specifies the subset in one query for both dimension and value ranges

– Reduced I/O time and memory footprint• SQL queries in ParaView

– Query over Dimensions – API support– Query over Values - Indexing

• Bitmap Indices and Parallel Bitmap Indices– Efficient subsetting over values

Page 6: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Background: Bitmap Indexing• Fastbit: widely used in Scientific Data Management

• Suitable for float value for binning small ranges• Run Length Compression(WAH, BBC)

– Compress bitvector based on continuous 0s or 1s

Page 7: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Bitmap Index and Dim Subset• Run-length Compression(WAH, BBC)

– Good: compression rate, fast bitwise operation;– Bad: ability to locate dim subset is lost;

• Two traditional methods: – With bitmap indices: post-filter on dim info;– Without bitmap indices: post-filter on values;

• Two-phase optimization: – Index Generate: Distributed Indices over sub-

blocks;– Index Retrieval: Transform dim subsetting info into

bitvectors, and support fast bitwise operation;

Page 8: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

System Overview

Parse the SQL expression

Parse the metadata file

Generate Query Request

Index Generation if not generated; Index Retrieving after that.

Page 9: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Optimization 1: Distributed Index Generation

Study relationship betweenQueries and Partitions.

Partition the data based onQuery Preference

Page 10: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Index Partition Strategy• α rate: Participation rate of data elements

– Number of elements in indexing / Total data size– Worst: All elements have to be involved – Ideal: Elements exact the same as dim subset

• Partition Strategies: – Strategy 1: α is proportional to dim subsetting percentage and inversely

proportional to number of partitions.

– Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim.

– Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.

Page 11: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Optimization 2: Index Retrieval

Post-filter?

Page 12: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Parallel Index Architecture

L3: data block

L1: data file

L2: variable

Page 13: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Experiment Setup• Goals:

– SQL subsetting vs. Load + Filter in Paraview– Scalability of parallel indexing method– Indexing and Partition Strategy vs. FastQuery

• Dataset: – Parallel Ocean Program– Data size: 33.6 GB– Data format: NetCDF(array based)

• Environment: – IBM Xeon Cluster 8 cores, 2.53GHZ– 12 GB memory

Page 14: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Efficiency Comparison with Filtering in Paraview

• Data size: 5.6 GB• Input: 400 queries• Depends on subset

percentage• General index method is

better than filtering when data subset < 60%

• Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method

Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter

Page 15: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Memory Comparison with Filtering in Paraview

• Data size: 5.6 GB• Input: 400 queries• Depends on subset

percentage• General index method has

much smaller memory cost than filtering method

• Two phase optimization only has small extra memory cost

Index m1: Bitmap Indexing, no optimizationIndex m2: Use bitwise operation instead of post-filteringIndex m3: Use both bitwise operation and index partition Filter: load all data + filter

Page 16: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Scalability with Different Proc#

• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: time• Each process take care of

one sub-block• Good scalability as

number of processes increases

Page 17: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Alpha Rate with Different Proc#

• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: Alpha Rate• More number of processes

means more index partitions

• Good participation rate when selecting a smaller percentage data subset

Page 18: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Alpha Rate and IO Access Times Comparison with FastQuery

• FastQuery: • Build relational table view over scientific dataset• Difference: doesn’t consider multi-dimension data features

• Data size: 8.4 GB, 48 processes• Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall• Input: 100 queries for each query type

Page 19: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Efficiency Comparison with FastQuery

• Data size: 8.4 GB• Proc#: 48• Input: 100 queries for each

query type• Achieved a 1.41 to 2.12

speedup compared with FastQuery

Page 20: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Page 21: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012

Conclusion

• Big data issue in data analysis and visualization• Find exact data subset in IO level with SQL

interface and bitmap indexing• A good speedup compared with filtering method• Data partition strategy and parallel indexing• A good speedup compared with FastQuery

Page 22: Indexing and Parallel Query Processing Support for Visualizing Climate Datasets

ICPP 2012 22

Thanks