Top Banner
Alex Poon VP of Engineering Storm @ Visual Revenue (an Outbrain Company)
8

Open analytics meetup alex poon (1)

Jul 07, 2015

Download

Documents

Open Analytics

Visual Revenue's p
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open analytics meetup   alex poon (1)

Alex Poon VP of Engineering

Storm @ Visual Revenue (an Outbrain Company)

Page 2: Open analytics meetup   alex poon (1)

Who are we?

Page 3: Open analytics meetup   alex poon (1)

What we do? CustomerTraffic

WebServers

DataTransform/Aggrega8on

Databases

Dashboard Algo

Automa8on

Ka=a

Storm

•  14B page views per month

•  At peak, 8000-10000 per sec

•  Deployed Storm to production ~ 1 month ago

•  Storm cluster of ~50 instances on AWS

Page 4: Open analytics meetup   alex poon (1)

Before Storm •  Built our own distributed data processing

•  ZMQ

•  Batch based process

•  Hashing processing by customers

•  Advantages

•  Simple in-house system built from very basic components

•  Well understood

•  Disadvantages

•  Hard to scale, constant battle for keeping up with pings

•  Machine management was clumsy

•  Uneven distribution of traffic

•  Multiple processes doing similar work, wasting resources

Page 5: Open analytics meetup   alex poon (1)

Why Kafka/Storm? •  Kafka

•  open-sourced, distributed publish-subscribe messaging system

•  Storm

•  open-sourced, real-time computation system for continuous computation

•  They are awesome

•  Distributed, highly scalable, and fault tolerance

•  High throughput

•  Reliable

•  Real-time

•  Great at in-memory analytics, and real-time decision support

Page 6: Open analytics meetup   alex poon (1)

Data Aggregation

URL

15s

Aggregate

15s

Customer

15s

Front Page

15s

Position

5m

Arrangement

15s

Tweet

5m

Aggregate

15s

@HandleSpout

Bolt

Page 7: Open analytics meetup   alex poon (1)

Learning / Ideas 1. Kafka + zookeeper is extremely scalable and easy to setup. Check out the Brod library if you are doing Python

2. Use the Storm UI (Ganglia based) to monitor your cluster

3. Shell Bolts were inefficient and hard to debug (at least for us)

4. Upgrade to at least Storm version 0.8.2 which gives you capacity metrics on top of other goodies

5. Storm’s anchoring/replay capability is awesome but comes with a visible overhead

6. Use a good framework to manage your cluster, we use Salt Stack

7. Our unit tests are built in Junit. Most built in unit tests for Storm are only available in Clojure for now

Page 8: Open analytics meetup   alex poon (1)

Thank You

Alex Poon

@alexpoon06 @Outbrain

Yes, it is true. We are Hiring!!

www.visualrevenue.com/jobs