Visualization-Driven Data Aggregation Presentation of the "M4" Research Paper [1] Uwe Jugel, SAP SE [email protected]VLDB 2014, Hangzhou, China, Sep 4, 2014 [1] U. Jugel, Z. Jerzak, G. Hackenbroich, V. Markl. M4: A Visualization-Oriented Time Series Data Aggregation. Proceedings of the VLDB Endowment 7 (10), 797 - 808
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Visualization-Driven Data AggregationPresentation of the "M4" Research Paper [1]
[1] U. Jugel, Z. Jerzak, G. Hackenbroich, V. Markl. M4: A Visualization-Oriented Time Series Data Aggregation. Proceedings of the VLDB Endowment 7 (10), 797 - 808
Motivation andOverview
Big Data causes Slow Visual Analytics
big data
datadata
datadata
datadata
range query(visualization-related)
dataengine
raw data
valu
e
time
pixel width: 500 px
Existing BI-tools suffer from long transfertimes caused by high-volume query results.
Consumption of raw datamay cause slow rendering.
2/14
Visualization Scenario
value1234.12223.73197.01154.85
...
...
...
...
...
time1007101810261039
High-Volume Sensor Data- millions of records per day- values present a continuous signal- voltage, velocity, stock prices- potentially large query results (100k+)
Line charts are most common and useful for continuous signals.time
valu
e
line chart
valu
etime
scatter plot
time
bar chart
valu
e
focus
Potential Time Series Visualizations
3/14
1. Model data reduction as query (SQL)2. Preserve visual information: vis(Qreduce(data)) == vis(data)
Goals
lineChart = (data) -> transform data to data_wh draw discrete line pixels for each two points in data_wh
ObservationVisualizations conduct animplicit data reduction byrendering data to pixels
Solution: Visualization-Driven Data Aggregation
VisualizationClient
selected time range
Query RewriterRDBMS
data-reduced query result
visualizationparameters
queryreductionquerydata
data reduction
data flow
+
Solution Architecture
4/14
Research Task: Find a data aggregation modelthat simulates the rasterization process!
Existing approaches: averaging, sampling,line simplification, etc. cannot reproducethe original visualization of the raw data.
vis(avg(data)) vis( (data))vis(data) minmax
original lossyvery lossy
5/14
M4 Principle
c) vis(MinMaxFirstLast(T))
"M4"
a) vis(T)
1 2 3 4
b) vis(MinMax(T))
1
2
3E1
E2
E3
Analysis + Elimination of the Remaining Errors
7/14
M4: Data Aggregation for Perfect Line Charts
vis(M4(data))
lossless
vis(data)
original
==
Theorem 1. "vis(T) == vis(M4(T))"Theorem 2. "error-free line chart from 4*w tuples"Parameters: width, original query Output: "perfect" data subset
Input: big time series data
8/14
WITH Q AS (SELECT t,v FROM sensors WHERE id = 1 AND t >= $t1 AND t <= $t2), QC AS (SELECT count(*) c FROM Q)SELECT * FROM Q WHERE (SELECT c FROM QC) <= 800UNIONSELECT * FROM (
) AS QD WHERE (SELECT c FROM QC) > 800
1) original query Q
2) cardinality query QC3a) use Q if low card.
3b) use QD if high card.
reduction query QD:compute aggregatesfor each pixel-column
Query Rewriting Template
SELECT t,v FROM Q JOIN(SELECT round($w*(t-$t1)/($t2-$t1)) as k, --define key min(v) as v_min, max(v) as v_max, --get min,max min(t) as t_min, max(t) as t_max --get 1st,last FROM Q GROUP BY k) as QA --group by kON k = round($w*(t-$t1)/($t2-$t1)) --join on k AND (v = v_min OR v = v_max OR --&(min|max| t = t_min OR t = t_max) -- 1st|last)
9/14
Evaluation
Performance Measurements
base pa
arou
nd
rando
mfirs
t
minmax M4
base pa
arou
nd
rando
mfirs
t
minmax M4
Main Cost Factorsbaseline query:data reduction queries:
DB-out network bandwidthquery execution time and in-DB memory bandw.
5
50
5
no query cost high transfer cost
low query costno additional transfer cost
11/14
Performance with Increasing Data Volume
80
60
40
20
03,000,0002,000,0001,000,000
tota
l tim
e (s
)
number of rows
t < 5s
near-interactive response times
12/14
Respect the application!1. Take query + presen- tation parameters2. Rewrite query (SQL)3. Fast, In-DB processing
Faster Visual Analytics No information loss+ 10x speed+ 100x bandwidth savings
Conclusion
13/14
Image source by cybaea, https://www.flickr.com/photos/cybaea/54679441/