Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Supporting SQL-3 Aggregations on Grid-based Aggregations on Grid-based Data Repositories Data Repositories Li Weng, Gagan Agrawal, Umit Catalyurek, Joel Saltz
24
Embed
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Supporting SQL-3 Aggregations on Grid-based Data
RepositoriesSupporting SQL-3 Aggregations on Grid-based Data
Repositories
Li Weng, Gagan Agrawal,
Umit Catalyurek, Joel Saltz
Scientific Data Analysis
Multi-dimensional datasets
Motivating Scientific Applications
Magnetic Resonance Imaging
Oil Reservoir Management
Cancer Studies using MRI Telepathology with Digitized Slides
Satellite Data Processing Virtual Microscope
…
Current Approaches
Good! But is it too heavyweight for read-mostly scientific data
?
Manual implementation based on low-level datasets
Need detailed understanding of low-level formats
HDF5, NetCDF, etc
No single established standard
BinX, BFD, DFDL
Machine readable descriptions, but application is dependent on a
specific layout
Ohio State University
Our Approach
Express the query & the computing declaratively on a virtual
relational table view
Dataset in complex, low-level layouts can be abstracted as SQL-3
table to scientists.
Support basic SELECT query for specifying subset of interest.
Data analysis on subset of interest can be defined as SQL-3
aggregate function on SQL-3 relation.
Ohio State University
Our Approach
Generate data extracting service & data aggregation
service
A lightweight layer on top of datasets
A runtime middleware STORM is used to work in coordination with the
generated services.
Ohio State University
System Overview
Extraction
Service
STORM
Query
frontend
Outline
Introduction
Motivation
Compiler analysis and code generation
Design a meta-data descriptor
Experimental results
Related work
Canonical Query Structure
From <Dataset Name>
<SQL statement list>
Oil Reservoir Management (IPARS)
FROM IPARS
WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200
GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0;
CREATE AGGREGATE ipars_bypass_sum
SELECT CASE WHEN $2.soil > 0.7 AND
|/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz *
$2.oilz)<30.0
THEN $1 & 1
Compiler Analysis and Code Generation
Transform the canonical query into two pipelined sub-queries.
Data Extraction Service
Data Aggregation Service
SELECT <attribute list> , <AGG_name(Dataset Name)> FROM
TempDataset GROUP BY <group-by attribute_list>;
Ohio State University
Design a Meta-data Descriptor
The dataset comprises several simulations on the same grid
For each realization, each grid point, a number of attributes are
stored.
The dataset is stored on a 4 node cluster.
Component I: Dataset Schema Description
[IPARS] // { * Dataset schema name *}
TIME = int
X = float
Y = float
Z = float
SOIL = float
SGAS = float
[IparsData] //{* Dataset name *}
An Example
Oil Reservoir Management
Use LOOP keyword for capturing the repetitive structure within a
file.
The grid has 4 partitions (0~3).
“IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the
data files with the spatial coordinates’ stored; “ipars2” specifies
the data files with other attributes stored.
Component III: Dataset Layout Description
DATASET “IparsData” { //{* Name for Dataset *}
DATATYPE { IPARS } //{* Schema for Dataset *}
DATAINDEX { REL TIME }
DATASET “ipars1” {
X Y Z
} // {* end of DATASET “ipars1” *}
SOIL SGAS
$DIRID = 0:3:1 }
Generate Data Extraction Service
Our tool parses the meta-data descriptor and generates function
codes.
At run time, the query would provide parameters to invoke the
generated functions to create Aligned File Chunks.
Ohio State University
Generate Data Aggregation Service
Aggregate function analysis
Projection push-down helps to extract data only needed for a
particular query and its aggregation.
TempDataset = SELECT <useful attributes> From <Dataset
Name> WHERE <Expression> ;
As for the IPARS application, only 7 out of the 22 attributes are
actually needed for the considered query. The reduction of the data
volume to be retrieved and communicated is 66%.
As for the TITAN application, 5 out of the 8 attributes are
actually needed and the reduction is 38%.
Ohio State University
Generate Data Aggregation Service
2. Aggregate function decomposition
The first step involves computations applied on each tuple; The
second step updates the aggregate status variable.
Replace the largest expression with TempAttr. As for the IPARS, the
number of attributes is reduced further from 7 to 4.
CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS '
SELECT CASE WHEN $2.TempAttr
Generate Data Aggregation Service
Partition the subset of interest based on the values of the
group-by attributes if more client nodes are provided as the
computing unit.
Construct a hash-table using the values of the group-by attributes
as the hash-key. And translate the aggregate function in SQL-3 into
the imperative C/C++ code.
Ohio State University
Experimental Setup & Design
A Linux cluster connected via a Switched Fast Ethernet. Each node
has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE
disks.
Scalability test when varying the number of nodes for hosting data
and performing the computations;
Performance test when the amount of data to be processed is
increased;
Comparison with hand-written code;
Ohio State University
Experimental Results for IPARS
Scale the number of nodes hosting the data and the number of nodes
for processing.
Extract a subset of interest at the size of 640MB from scanning the
1.9GB data.
The execution times scale almost linearly.
The performance difference varies between 6%~20%, with an average
difference of 14%.
The aggregate decomposition can reduce the difference to be between
1% and 10%.
Chart1
1
1
1
2
2
2
4
4
4
8
8
8
Hand
Comp
Experimental Results for IPARS
Evaluate the system’s ability to scale to larger datasets.
Use 8 data source nodes and 8 client nodes.
The execution time stays proportional to the amount of data to be
retrieved and processed.
Chart2
1.9
1.9
1.9
3.8
3.8
3.8
5.7
5.7
5.7
7.6
7.6
7.6
Hand
Comp
Experimental Results for TITAN
Scale the number of nodes hosting the data and the number of nodes
for processing.
Extract a subset of interest at the size of 228MB from scanning the
456MB data.
The execution times scale almost linearly.
The performance difference is 17%.
The aggregate decomposition can reduce the difference to be
6%.
Chart3
1
1
1
2
2
2
4
4
4
8
8
8
Hand
Comp
Experimental Results for TITAN
Evaluate the system’s ability to scale to larger datasets.
Use 8 data source nodes and 8 client nodes.
The execution time stays proportional to the amount of data to be
retrieved and processed.
Chart4
228
228
228
456
456
456
684
684
684
912
912
912
Hand
Comp
Related Work
Data cubes
Runtime strategies for supporting reductions in a distributed
environment
Ohio State University
Conclusions
A compiler-based system for supporting SQL-3 aggregate function and
select query with group-by operator on flat-file scientific
datasets.
Both the extraction of the subset of interest and the aggregate
computing can be expressed declaratively.
By using a meta-data descriptor to represent the layout of the
dataset, our compiler generates efficient data extraction
service.
The compiler analyzes the user-define aggregate function and
generate code in a parallel environment.
Processing Remotely
(AVHRR)
•
are gathered to form an
instantaneous field of view
A single file of
IFOV’s