Top Banner
Implement a scalable statistical aggregation system using Akka Scala by the Bay, 12 Nov 2016 Stanley Nguyen, Vu Ho Email Security@Symantec Singapore
30

[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Feb 07, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Implement a scalable statistical aggregation system using Akka

Scala by the Bay, 12 Nov 2016

Stanley Nguyen, Vu Ho

Email Security@Symantec Singapore

Page 2: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

The system

Provides service to answer time-series analytical questions such as COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set of data streams by using statistical approach.

Page 3: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Motivation The system collects data from multiple sources in streaming

log format Some common questions in Email Anti-Abuse system

Most frequent Items (IP, domain, sender, etc.) Number of unique items Have we seen an item before?

=> Need to be able to answer such questions in a timely manner

Page 4: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Data statistics 6K email logs/second One email log is flatten out to subevents

Ip, sender, sender domain, etc Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week,

etc)

Total ~200K messages/second

Page 5: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Challenges Our system needs to be

Responsive Space efficient Reactive Extensible Scalable Resilient

Page 6: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Sketching data structures

How many times have we seen a certain IP? Count Min Sketch (CMS): Counting things + TopK

How many unique senders have we seen yesterday? HyperLogLog (HLL): Set cardinality

Did we see a certain IP last month? Bloom Filter (BF): Set membership

SPACE / SPEED

Page 7: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Implement data structure for finding cardinality (i.e. counting things); set membership; top-k elements – solved by using streamlib / twitter algebird

Implement a dynamic, reactive, distributed system for answering cardinality (i.e. counting things); set membership; top-k elements

What we try to solveWhat is available

Page 8: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Sketching data structures

Update internal count-min-instance

Initialize streamlib count-min-instance

Page 9: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Responsive Space efficient Reactive Extensible Scalable Resilient

Page 10: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Akka ActorProduce

r

BF Domai

nActor

HLL SenderActor

CMS Ip

Actor

Producer

MasterActor BACK PRESSURE?

Page 11: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Akka Stream

GraphDSL

FLOW-SHAPE NODE

Page 12: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Using GraphDSL

(msg-type, @timestamp, key, value)

Page 13: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

GraphDSL - Limitations

FIXED

Page 14: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Our design – Dynamic stream

Page 15: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Merge Hub Provided by Akka Stream:

Allow dynamic set of TCP producers

Page 16: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Splitter Hub Split the stream based on event type to a dynamic set of

downstream consumers. Consumers are actors which implement CMS, BF, HLL, etc

logic. Not available in akka-stream.

Page 17: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Splitter Hub API Similar to built-in akka stream’s BroadcastHub; different in back-

pressure implementation.

[[SplitterHub]].source can be supplied with a predicate/selector function to return a filtered subset of data.

Splitter

Consumer

selector

Page 18: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Splitter Hub’s Implementation

Page 19: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Splitter Hub The [[Source]] can be materialized any number of times — each

materialization creates a new consumer which can be registered with the hub, and then receives items matching the selector function from the upstream.

Consumer can be added at run time

Page 20: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Consumers Can be either local or remote. Managed by coordination actor. Implements a specific data structure (CMS/BF/HLL) for a particular

event type from a specific time-range. Responsibility:

Answer a specific query. Persisting serialization of internal data structure such as count-min-

table, etc. regularly.

SplitterHub

HLLConsumer

Coordination actor

COUNT-QUERY

CMSConsumer

forward

ref

DBsnapshot

Page 21: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Responsive Space efficient Reactive Extensible Scalable Resilient

Page 22: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Scaling out If data does not fit in one machine. Server crashes. How to maintain back pressure end-to-end.

Page 23: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Scaling out

Page 24: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Akka stream TCP Handled by Kernel (back-pressure, reliable). For each worker, we create a source for each message type

it is responsible for using SplitterHub source() API. Connect each source to a TCP connection and send to

worker. Backpressure is maintained across network.

TCP.bind()IP

domainSplitterHub

~>~>

Page 25: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Master-Worker communication

Page 26: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Master Failover The Coordinator is the Single Point of Failure. Run multiple Coordinator Actors as Cluster Singleton . Worker communicates to master (heartbeat) using Cluster

Client.

Page 27: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Worker Failover Worker persists all events to DB journal + snapshot.

Akka Persistent. Redis for storing Journal + Snapshot.

When a worker is down, its keys are re-distributed. Master then redirects traffic to other workers. CMS Actors are restored on new worker from Snapshot +

Journal.

Page 28: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

BenchmarkAkka-stream on single node 100K+ msg/second (one msg-

type)Akka-stream on remote node (remote TCP)

15-20K msg/second (one msg-type)

Akka-stream on remote node (remote TCP) with akka persistent journal

2000+ msg/second (one msg-type)

Page 29: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Conclusion Our system is

Responsive Reactive Scalable Resilient

Future works: Make worker metric agnostics Scale out master Exactly one delivery for worker More flexible filter using SplitterHub

Page 30: [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Q&A