Top Banner
Monal Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink to the Cloud @ Netflix Sep 12 2016
70

Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

May 20, 2018

Download

Documents

dolien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Monal Daxini

Engineering ManagerStream Processing, Real Time Data Infrastructure

@monaldax @Netflix #keystone

Beaming Flink to the Cloud @ Netflix

Sep 12 2016

Page 2: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

World’s Leading Internet Streaming Service(Global launch Jan 6, 2016)

Page 3: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

● 83+ Million Members, 190+ Countries

● 1000+ device types

● 35% of downstream Internet traffic

Netflix Service Scale

Page 4: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Netflix Service Scale - Daily viewing hours

125,000,000,000+Whoa!

Page 5: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Events Processed / day

1,000,000,000,000+1.4 PB

That’s a huge number!

Page 6: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Event Scale

Peak

● 1T unique events ingested/day● 16M / sec● 43GB / sec● 10MB / message

Daily Averages

● 1T+ events processed● 600B unique events ingested● 1.4 PB / day

● 4K / event

Page 7: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

99.99% + Availability / Four 9s

Keystone Scale

Page 8: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Events Trend

1/2014 80B / day 1/2015 300B / day 1/2016 1T+ / day

Page 9: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Evolution

SPaaS

Page 10: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Goal - Migrate 1.3 PB of event data to a new Pipeline in flight, while ensuring data diff < 0.1%

Page 11: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Few years ago (Season 1)

EMR

EventProducers

Page 12: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

A year ago.. (Season 2)

EventProducer

Stream Consumers

EMR

ConsumerKafka

Suro Router

EventProducer

Suro

Kafka

SuroProxy

Page 13: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Season 3

Page 14: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Keystone Stream Processing

(SPaaS)

Keystone Management

Keystone Messaging

Schema Support

100% in AWS

Page 15: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Create DuploⓇ Blocks:Let reusability drive new value

Our Philosophy

Page 16: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Q4 2015

EventProducers

Sinks

SPaaS

Page 17: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Management

Page 18: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Per Stream Auto Dashboard

Page 19: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone SPaaS

Stream Consumers

Router

EMR

FrontingKafka

ConsumerKafka

EventProducer

KS

Prox

yControl PlaneSelf Service UI

100% in AWS

24 x 7

Region failover

Page 20: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Event flowKeystone Pipeline As a Service

Page 21: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

KS

Prox

y

KS

Lib Control Plane

Self Service UI

Page 22: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 23: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 24: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 25: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 26: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Routing

Page 27: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Checkpoint Cluster

Page 28: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Details..

○ Massively parallel use-case

■ Per element processing - declarative filtering &

projection

○ Stateless except Kafka offset checkpointing state

Page 29: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Routing Infrastructure

+

CheckpointingCluster

+ 0.9.1Go

C language

Page 30: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

EC2 InstancesZookeeper

(Instance Id assignment)

JobJob

Job

NodeAgent

Checkpointing Cluster

ASG

Immutable Job Config

Self-Service UI / Control Plane

Page 31: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Custom Go Executor

./runJob

LogsSnapshots

Attach Volumes

./runJob./runJob

Reconcile Loop - 1 minHealth Check

What’s running on the host?

Zookeeper(Instance Id assignment)

Page 32: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Yes! You inferred right!

No Mesos & No Yarn

Page 33: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Salient Features

Page 34: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Scale

● 4000+ Kafka brokers● 14000+ Samza Jobs

○ Running in docker containers○ On 1600+ nodes

Page 35: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Salient Features

● At-least-once delivery semantics*● Multi-Tenant● Self Service● Scalable● Fault Tolerant

100% in AWS

Page 36: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Salient Features

● Event payload is immutable● Inject event metadata

○ guid, timestamp, host, app● Custom extensible wire protocol● Kafka producer wrapper

Page 37: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone Salient Features

● Kafka Cluster failover○ Kafka Kong

● Routing regional / failover● Scales based on historical traffic

○ Externally driven

Page 38: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Season 4Plot & Pilot...

Page 39: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Stream Processing As a Service (SPaaS)

(more capable)

Page 40: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

SPaaS Vision (plot)

● Multi-tenant support for stateful stream processing apps

● Autoscaling managed infrastructure

● Support for schemas

● Self Service Tooling

Page 41: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

SPaaS Architecture (plot)

SPaaS ManagerTitus Container

Runtime

Framework Specific APIor

Common API (Beam)[ Dockerized Job ]

1. Create

2. Submit 3. Launch

RunnerFlink / Spark /

Mantis

Running Job

1. Submit Job DSL (SQL)

Tooling/ Dashboard

Page 42: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Why Apache Beam?

○ Portable API layer to build sophisticated data processing apps

■ Support multiple execution engines

○ Unified model API over bounded and unbounded data sources

○ Millwheel, FlumeJava, Dataflow model lineage

SPaaS - “Beam Me Up, Scotty ! "

Page 43: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Iterative build out: then

● First - Flink on Titus in VPC, AWS○ Titus is a cloud runtime platform for container based jobs

● Next - Apache Beam and Flink runner

SPaaS - Pilot

Page 44: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

2.

Keystone SPaaS-Flink Pilot Use Cases

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

Demux Merge Control Plane

Self Service UI

Page 45: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

2.

1.

Keystone SPaaS-Flink Pilot Use Cases

Stream Consumers

Router

EMR

FrontingKafka

EventProducer

ConsumerKafka

3.

Demux Merge Control Plane

Self Service UI

Page 46: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Broker

Router - Massively parallel use case

Router

Page 47: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Titus Job

Task Manager

IP

Titus Host 4 Titus Host 5

Flink (1.2) Program Deployment (prod shadow)

Zookeeper

Job Manager (standby)

Job Manager (master)

Task Manager

Titus Host 1IP

Titus Host 2….

Task Manager

Titus Host 3IP

Titus JobIPIP

AWS VPC ENI

Page 48: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Titus High Level Architecture

Titus UITitus UI

Docker RegistryDocker Registry

Rheacontainer

containercontainer

docker

Titus Agentmetrics agent

containercontainerSPaaS-Flink

Titus executor

logging agent

zfsmesos agent

docker

RheaTitus API

Cassandra

Titus Master

Job Management & Scheduler

S3

ZookeeperDocker Registry

EC2 Autoscaling API

Mesos Master

Titus UI

(CI/CD)

Fenzo

Page 49: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Titus UI

Page 50: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

CD

Page 51: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Dashboard

Page 52: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink
Page 53: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink
Page 54: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink
Page 55: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink
Page 56: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone SPaaS-Flink Pilot Use Case - 2

Page 57: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Keystone SPaaS-Flink Pilot Use Case - 2

Page 58: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Flink Router perf test (YMMV)

○ Note

■ The tests were performed on a specific use case,

■ running in a specific environment, and with

■ with one specific event stream, and setup.

Page 59: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

1.

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneSelf Service UI

Page 60: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Details..

○ Different runtimes for Flink & Samza routers

○ Massively parallel use-case

■ Per element processing

○ Focused on net outcomes

Page 61: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Titus Job

Task Manager

IP

Titus Host 4 Titus Host 5

Flink (1.2) Router

Zookeeper

Job Manager (standby)

Job Manager (master)

Task Manager

Titus Host 1IP

Titus Host 2….

Task Manager

Titus Host 3IP

Titus JobIPIP

AWS VPC ENI

Backed state

Page 62: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Flink Router outcome (YMMV)

○ Cost ≅17% savings

○ Memory utilization ≅16% better

○ Cpu utilization ≅ 40% better

○ Network utilization ≅ 10% better

○ Msg. throughput ≅ 1% (avg) - 4% (peak) better

Page 63: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Awaiting Flink Features

Page 64: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Titus Job

Task Manager

IP

Titus Host 4 Titus Host 5

Fine Grained Recovery FLIP-1

Zookeeper

Job Manager (standby)

Job Manager (master)

Task Manager

Titus Host 1IP

Titus Host 2….

Task Manager

Titus Host 3IP

Titus JobIPIP

X Flink Job Restarts

Page 65: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Awaiting Flink Features (now)

● Checkpoints and savepoints

○ FLINK-4484 - Unify, Persistent checkpoints, periodic savepoints

○ Compatible across Flink version upgrades

○ Inspection tool for debugging

Page 66: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Awaiting Flink Features (now)

● Atomic savepoint and stop (pause) the job

● Dynamic Scaling

○ Resume from savepoint with different parallelism

○ Job elasticity

● FLINK-4545 Adjust TaskManager network buffer when scaling

DONE

Page 67: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Awaiting Flink Features (now)

● Cluster Runtime and Elasticity (FLIP - 6)

● Extending Window Function Metadata (FLIP-2)

● Large State support

○ Incremental checkpointing

○ hot standby

Page 68: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Awaiting Flink Features

● Metric tags as key-value pairs - FLINK-4245

● FLIP-9 Window Trigger DSL

● FLIP-11 Table API

● Side Inputs / Side Outputs (handle late data)

DONE

Page 69: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

Ponder over

Could Stream Processing Engine enable building non-analytics Applications?

Page 70: Beaming Flink to the Cloud @ Netflix - FlinkForward | 11 … Daxini Engineering Manager Stream Processing, Real Time Data Infrastructure @monaldax @Netflix #keystone Beaming Flink

More brain food...

● Netflix Keystone Pipeline Evolution

● Netflix Kafka in Keystone Pipeline

● Samza Meetup Presentation

● Titus talk

● Netflix OSS