29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

10/04/23

Efficient Updates for a Shared Nothing

Analytics Platform

Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris{katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr

Computing Systems LaboratoryNational Technical University of Athens

10/04/23

Motivation• Large volumes of data

Everyday life, science and business domain

• Time-series data Temporally ordered, organized in hierarchies (Day<Month<Year)

• E.g., date of a credit card purchase, time of a phone call

Important for monitoring a process of interest

• On-line processing Fast retrieval – Point, range, aggregate queries Detection of real time changes in trends

• Intrusion or DoS detection, effects of product’s promotion Online, cost-efficient updates

10/04/23

Up till now• Data Warehouses

Centralized, off-line approaches Distributed warehousing systems

• Functionality remains centralized

• Distributed Warehouse-like initiative: Brown Dwarf Distribution of centralized Dwarf Deployed on shared-nothing, commodity hardware

• Scalability, fault tolerance, performance

No special consideration for time-series data Update procedure costly → unfit for frequent updates

10/04/23

Our Goals• Cloud based DataWarehousing-like system

Targeted to time-series data• Arriving at high rate

Store, update, query data at various granularity levels• Multidimensional, hierarchical

Shared nothing architecture• Commodity nodes

Without use of any proprietary tool• Java libraries, socket APIs

10/04/23

Our Contribution

• Complete system for multidimensional time-series data Store with one pass Update online Query efficiently

• Point, aggregate

• Various levels of granularity

• Adaptive materialization According to data recency Accelerate cube creation/update Minimize storage consumption

10/04/23

Dwarf• Dwarf computes, stores, indexes and updates materialized cubes

• Eliminates prefix and suffix redundancies

• Any query (point or aggregate) is answered through traversal of structure

10/04/23

Brown Dwarf• Dwarf nodes mapped to overlay nodes

• UID for each node• Hint tables of the form (currAttr, child)

• Insertion• One-pass over the fact table• Gradual structure of hint tables

• Queries • Overlay path of d hops

• Incremental Updates

• Elasticity through adaptive mirroring

10/04/23

Advantages and Drawbacks

• Store even larger amounts of data! Dwarf reduces but may also blow-up data

• High dimensional, sparse >1,000 times

• Handle many more requests

• Query the system online

• Accelerate creation (up to 5 times ) and querying (up to 60 times) Parallelization

• Update remains costly

10/04/23

Time Series Dwarf (TSD)

• A concept hierarchy characterizes time and any other dimension

• Updates are applied in temporal order

• Temporal granularity of queries relative to the time of querying More detailed queries for recent events More coarse grained queries for past events

10/04/23

TSD Operations - Insertion• Time first in order

• Lack of ALL cell in Time

• Aggregate created after completion of a level

10/04/23

TSD Operations - Querying• Follow path along the structure

• Roll-up query for aggregate already created Within d hops (e.g., <Y1, ALL, P1>)

• Roll-up query for recent records Initial query substituted by multiple lower level queries

(e.g., <Y2, S1, P1>)

10/04/23

TSD Operations - Updating• Insertion of a new tuple

• Longest common prefix with existing structure

• Underlying nodes recursively updated

• Lack of ALL cell for Time + temporal ordering = fewer existing cells affected

• Example: 3 TSD nodes vs. 12 Dwarf nodes affected

10/04/23

Adaptive Materialization

• A daemon process asynchronously creates roll-up views deletes corresponding drill-down ones

• The period of this process depends on application

• Tradeoff: cube size vs. response accuracy

10/04/23

Experimental Evaluation

• 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory)

• Synthetic and real datasets• APB-1 Benchmark generator

• 4-d, 3 levels for Time, various densities

• DARPA Intrusion Detection audit data• 1M tuples, 7-d, 3 levels for Time

• TSD: static mode

• TSDad: adaptive mode

10/04/23

Cube Construction

• Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset)• Lack of the ALL cell in the first dimension

• Acceleration of cube creation up to 89% compared to Dwarf• Better use of resources through parallelization (BD)• Further reduction due to lack of ALL and selective materialization

Size (MB) Time (sec)Dataset #Tuples Dwarf BD TSD TSDad Dwarf BD TSD TSDad

APB-A 1.2M 56 59 53 9 485 101 100 57APB-B 2.5M 102 115 93 24 957 220 198 123APB-C 3.7M 163 182 146 32 1530 321 289 167DARPA 1.1M 178 191 156 127 614 222 208 189

10/04/23

Updates

• 10k updates

• TSD up to 3 times faster than Dwarf and 30% faster than BD• Ordered updates – do not affect already created views• No recursive updates for ALL cell of first dimension → smaller communication

overhead (3-fold reduction)

• TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%

Time(sec) Msgs/update

Dataset Dwarf BD TSD TSDad BD TSD TSDad

APB-A 1123 603 404 315 22 9 8

APB-B 1158 611 418 323 23 10 9

APB-C 1203 624 424 328 25 11 9

DARPA 1535 649 458 380 29 13 9

10/04/23

Queries

• DARPA 10k datasets – 3 kinds of querysets, 50% aggregates • Q1: Ideal• Q2: Recent records are queried upon in more detail (Zipfian)• Q3: Random

• As queryset approximates uniform distribution• Message cost increases• Accuracy decreases

Time(sec) Msgs/query %Inaccurate queries

%Resp.DeviationQueryset BD TSD TSDad BD TSD TSDad

Q1 5 6 6 7 7 7 0 0

Q2 5 9 8 7 9 9 15 19

Q3 5 24 21 7 32 32 33 32

10/04/23

Questions

29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

time tsd

updates tsd

tsd nodes

time series dwarf tsd

fold reduction tsd ad

static mode tsd ad

time of querying

queries darpa

Documents

Katerina Zdravkova keti@ii.mk

Miroshnikova Katerina

Architectural portfolio - Katerina

Katerina vaseva portfolio_lowres

Katerina Vaseva portfolio highres

Katerina Martchouk Women

Made by Katerina!

Katerina Gogou

Diplomatiki Katerina Vasileiou

060213 katerina

Katerina Andrianou portfolio

Korobkova katerina pecha_cucha13

Katerina Geislerova Look Book

Katerina roma

Happy Birthday Katerina

Portfolio Katerina Mavroidis