Samza memory capacity_2015_ieee_big_data_data_quality_workshop

A Memory Capacity Model for High Performing Data-‐filtering

Applica:ons in Samza Framework

1

Tao Feng, Zhenyun Zhuang, Yi Pan, Haricharan Ramachandra LinkedIn Corp

Agenda

•  Introduc:on •  Memory capacity model •  Evalua:on •  Summary

2

INTRODUCTION

3

What Is Samza

4

Input Stream

Task 1 Task 2 Task 3

Output Stream Changelog Stream

Local state store

Checkpoint

Container

Samza-‐based Data Filtering Systems

•  Two main scenarios

5

Data Filtering By Rules Data Filtering By Joining Streams

MEMORY CAPACITY MODEL

6

Mo:va:on

•  We need an accurate resource predic:ve model for beSer capacity planning

•  We could have more containers within single node •  Higher density without SLA viola:on •  Lower business cost

7

Memory Capacity Model

•  L = TPE(B + Bk + Bm) •  L: live data set size •  T: Number of input topics •  P: Number of par::on per topic •  E: Number of unique entry per par::on •  B: bytes per treemap entry •  Bk: bytes of key serializa:on •  Bm: bytes of value message serializa:on

•  Required Heap Size 1H = 2*L •  Details of proof could be found in our paper

8

EVALUATION

9

Test Setup

10

0

broker

Ka^a Clusters

1 … N

Contaier

Test System

•  Test System config •  24 cores •  1gbps nic •  45GB mem

•  JVM op:on: •  UseG1GC •  G1HeapRegion

Size= 4M

broker

broker

Evalua:on Methodology

•  Firstly we deduct the heap size based on the model as 1H •  e.g with T: 1, P: 8, E: 5 million, B: 40 bytes, Bk: 24 bytes, Bm: 24 bytes, 1H = 2*L = 2*TPE(B + Bk + Bm) = 7G

•  Secondly we compare Samza job throughput, system performance metrics(GC :me, CPU:me) with 2H, 3H cases

11

Performance Results

12

Performance Results(conc)

13

Performance Results(conc)

14

1H 2H 3H

Young GC of G1 Count 88 29 32

Total :me(ms) 9850 5063 6144

Mixed GC of G1 Count 24 0 0

Total :me(ms) 70166 0 0

Total Count 112 29 31

Total :me(ms) 80117 5063 6144

•  No full GC involved in 1H case •  Expected Higher CPU :me and GC :me for 1H case

Summary

•  The model predicts memory usage of Samza accurately and guarantees Samza job SLA w/o much Samza SLA viola:on

•  It allows 2X dense Samza containers deployments within the same node with the accurate memory es:ma:on

15

Q & A

16

Samza memory capacity_2015_ieee_big_data_data_quality_workshop

Engineering