29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr.

Post on 27-Mar-2015

216 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

Transcript

10/04/23

Efficient Updates for a Shared Nothing

Analytics Platform

Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris{katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr

Computing Systems LaboratoryNational Technical University of Athens

10/04/23

Motivation• Large volumes of data

Everyday life, science and business domain

• Time-series data Temporally ordered, organized in hierarchies (Day<Month<Year)

• E.g., date of a credit card purchase, time of a phone call

Important for monitoring a process of interest

• On-line processing Fast retrieval – Point, range, aggregate queries Detection of real time changes in trends

• Intrusion or DoS detection, effects of product’s promotion Online, cost-efficient updates

2

10/04/23

Up till now• Data Warehouses

Centralized, off-line approaches Distributed warehousing systems

• Functionality remains centralized

• Distributed Warehouse-like initiative: Brown Dwarf Distribution of centralized Dwarf Deployed on shared-nothing, commodity hardware

• Scalability, fault tolerance, performance

No special consideration for time-series data Update procedure costly → unfit for frequent updates

3

10/04/23

Our Goals• Cloud based DataWarehousing-like system

Targeted to time-series data• Arriving at high rate

Store, update, query data at various granularity levels• Multidimensional, hierarchical

Shared nothing architecture• Commodity nodes

Without use of any proprietary tool• Java libraries, socket APIs

4

10/04/23

Our Contribution

5

• Complete system for multidimensional time-series data Store with one pass Update online Query efficiently

• Point, aggregate

• Various levels of granularity

• Adaptive materialization According to data recency Accelerate cube creation/update Minimize storage consumption

10/04/23

Dwarf• Dwarf computes, stores, indexes and updates materialized cubes

• Eliminates prefix and suffix redundancies

• Any query (point or aggregate) is answered through traversal of structure

6

10/04/23

Brown Dwarf• Dwarf nodes mapped to overlay nodes

• UID for each node• Hint tables of the form (currAttr, child)

• Insertion• One-pass over the fact table• Gradual structure of hint tables

• Queries • Overlay path of d hops

• Incremental Updates

• Elasticity through adaptive mirroring

7

10/04/23

Advantages and Drawbacks

• Store even larger amounts of data! Dwarf reduces but may also blow-up data

• High dimensional, sparse >1,000 times

• Handle many more requests

• Query the system online

• Accelerate creation (up to 5 times ) and querying (up to 60 times) Parallelization

• Update remains costly

8

10/04/23

Time Series Dwarf (TSD)

• A concept hierarchy characterizes time and any other dimension

• Updates are applied in temporal order

• Temporal granularity of queries relative to the time of querying More detailed queries for recent events More coarse grained queries for past events

9

10/04/23

TSD Operations - Insertion• Time first in order

• Lack of ALL cell in Time

• Aggregate created after completion of a level

10

10/04/23

TSD Operations - Querying• Follow path along the structure

• Roll-up query for aggregate already created Within d hops (e.g., <Y1, ALL, P1>)

• Roll-up query for recent records Initial query substituted by multiple lower level queries

(e.g., <Y2, S1, P1>)

11

10/04/23

TSD Operations - Updating• Insertion of a new tuple

• Longest common prefix with existing structure

• Underlying nodes recursively updated

• Lack of ALL cell for Time + temporal ordering = fewer existing cells affected

• Example: 3 TSD nodes vs. 12 Dwarf nodes affected

12

10/04/23

Adaptive Materialization

• A daemon process asynchronously creates roll-up views deletes corresponding drill-down ones

• The period of this process depends on application

• Tradeoff: cube size vs. response accuracy

13

10/04/23

Experimental Evaluation

• 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory)

• Synthetic and real datasets• APB-1 Benchmark generator

• 4-d, 3 levels for Time, various densities

• DARPA Intrusion Detection audit data• 1M tuples, 7-d, 3 levels for Time

• TSD: static mode

• TSDad: adaptive mode

14

10/04/23

Cube Construction

• Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset)• Lack of the ALL cell in the first dimension

• Acceleration of cube creation up to 89% compared to Dwarf• Better use of resources through parallelization (BD)• Further reduction due to lack of ALL and selective materialization

15

Size (MB) Time (sec)Dataset #Tuples Dwarf BD TSD TSDad Dwarf BD TSD TSDad

APB-A 1.2M 56 59 53 9 485 101 100 57APB-B 2.5M 102 115 93 24 957 220 198 123APB-C 3.7M 163 182 146 32 1530 321 289 167DARPA 1.1M 178 191 156 127 614 222 208 189

10/04/23

Updates

• 10k updates

• TSD up to 3 times faster than Dwarf and 30% faster than BD• Ordered updates – do not affect already created views• No recursive updates for ALL cell of first dimension → smaller communication

overhead (3-fold reduction)

• TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%

16

Time(sec) Msgs/update

Dataset Dwarf BD TSD TSDad BD TSD TSDad

APB-A 1123 603 404 315 22 9 8

APB-B 1158 611 418 323 23 10 9

APB-C 1203 624 424 328 25 11 9

DARPA 1535 649 458 380 29 13 9

10/04/23

Queries

• DARPA 10k datasets – 3 kinds of querysets, 50% aggregates • Q1: Ideal• Q2: Recent records are queried upon in more detail (Zipfian)• Q3: Random

• As queryset approximates uniform distribution• Message cost increases• Accuracy decreases

17

Time(sec) Msgs/query %Inaccurate queries

%Resp.DeviationQueryset BD TSD TSDad BD TSD TSDad

Q1 5 6 6 7 7 7 0 0

Q2 5 9 8 7 9 9 15 19

Q3 5 24 21 7 32 32 33 32

10/04/23

Questions

18

top related