Top Banner
Resilient Predictive Data Pipelines Sid Anand ( @r39132 ) QCon London 2016 1
60

Resilient Predictive Data Pipelines (QCon London 2016)

Jan 07, 2017

Download

Internet

Sid Anand
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Resilient Predictive Data Pipelines (QCon London 2016)

Resilient Predictive Data PipelinesSid Anand (@r39132) QCon London 2016

1

Page 2: Resilient Predictive Data Pipelines (QCon London 2016)

About Me

2

Work [ed | s] @

Maintainer on

Report to

Co-Chair for

airbnb/airflow

Page 3: Resilient Predictive Data Pipelines (QCon London 2016)

MotivationWhy is a Data Pipeline talk in this High Availability Track?

3

Page 4: Resilient Predictive Data Pipelines (QCon London 2016)

Different Types of Data Pipelines

4

ETL

• used for : loading data related to business health into a Data Warehouse • user-engagement stats (e.g.

social networking) • product success stats (e.g.

e-commerce)

• audience : Business, BizOps

• downtime? : 1-3 days

Predictive

• used for : •building recommendation

products (e.g. social networking, shopping)

•updating fraud prevention endpoints (e.g. security, payments, e-commerce)

• audience : Customers

• downtime? : < 1 hour

VS

Page 5: Resilient Predictive Data Pipelines (QCon London 2016)

5

ETL

• used for Data Warehouse —> Reports • user-engagement stats (

social networking• product success stats (

e-commerce

• audience

• downtime?

Predictive

• used for : •building recommendation

products (e.g. social networking, shopping)

•updating fraud prevention endpoints (e.g. security, payments, e-commerce)

• audience : Customers

• downtime? : < 1 hour

VS

Different Types of Data Pipelines

Page 6: Resilient Predictive Data Pipelines (QCon London 2016)

Why Do We Care About Resilience?

6

d4d5d6d7d8

DB

Search Engines Recommenders

d1 d2 d3

Page 7: Resilient Predictive Data Pipelines (QCon London 2016)

DB

Why Do We Care About Resilience?

7

d4

d6d7d8

Search Engines Recommenders

d1 d2 d3d5

Page 8: Resilient Predictive Data Pipelines (QCon London 2016)

Why Do We Care About Resilience?

8

d7d8

DB

Search Engines Recommenders

d1 d2 d3 d4

d5

d6

Page 9: Resilient Predictive Data Pipelines (QCon London 2016)

Why Do We Care About Resilience?

9

d7d8

DB

Search Engines Recommenders

d1 d2 d3 d4

d5

d6

Page 10: Resilient Predictive Data Pipelines (QCon London 2016)

Any Take-aways?

10

d7d8

DB

Search Engines Recommenders

d1 d2 d3 d4

d5

d6• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius • The bugs can affect customers and a company’s

profits & reputation!

Page 11: Resilient Predictive Data Pipelines (QCon London 2016)

Design GoalsDesirable Qualities of a Resilient Data Pipeline

11

Page 12: Resilient Predictive Data Pipelines (QCon London 2016)

12

• Scalable

• Available

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

Page 13: Resilient Predictive Data Pipelines (QCon London 2016)

13

• Scalable • Build your pipelines using [infinitely] scalable components

• The scalability of your system is determined by its least-scalable component

• Available

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

Page 14: Resilient Predictive Data Pipelines (QCon London 2016)

Desirable Qualities of a Resilient Data Pipeline

14

• Scalable • Build your pipelines using [infinitely] scalable components

• The scalability of your system is determined by its least-scalable component

• Available • Ditto

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Page 15: Resilient Predictive Data Pipelines (QCon London 2016)

Instrumented

15

Instrumentation must reveal SLA metrics at each stage of the pipeline!

What SLA metrics do we care about? Correctness & Timeliness

• Correctness • No Data Loss • No Data Corruption • No Data Duplication • A Defined Acceptable Staleness of Intermediate Data

• Timeliness • A late result == a useless result • Delayed processing of now()’s data may delay the processing of future

data

Page 16: Resilient Predictive Data Pipelines (QCon London 2016)

Instrumented, Monitored, & Alert-enabled

16

• Instrument • Instrument Correctness & Timeliness SLA metrics at each stage of the

pipeline

• Monitor • Continuously monitor that SLA metrics fall within acceptable bounds (i.e.

pre-defined SLAs)

• Alert • Alert when we miss SLAs

Page 17: Resilient Predictive Data Pipelines (QCon London 2016)

Desirable Qualities of a Resilient Data Pipeline

17

• Scalable • Build your pipelines using [infinitely] scalable components

• The scalability of your system is determined by its least-scalable component

• Available • Ditto

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Page 18: Resilient Predictive Data Pipelines (QCon London 2016)

Quickly Recoverable

18

• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius

• Optimize for MTTR

Page 19: Resilient Predictive Data Pipelines (QCon London 2016)

ImplementationUsing AWS to meet Design Goals

19

Page 20: Resilient Predictive Data Pipelines (QCon London 2016)

SQSSimple Queue Service

20

Page 21: Resilient Predictive Data Pipelines (QCon London 2016)

SQS - Overview

21

AWS’s low-latency, highly scalable, highly available message queue

Infinitely Scalable Queue (though not FIFO)

Low End-to-end latency (generally sub-second)

Pull-based

How it Works!

Producers publish messages, which can be batched, to an SQS queue

Consumers

consume messages, which can be batched, from the queue

commit message contents to a data store

ACK the messages as a batch

Page 22: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Typical Operation Flow

22

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 1: A consumer reads a message from SQS. This starts a visibility timer!

Page 23: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Typical Operation Flow

23

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 2: Consumer persists message contents to DB

Page 24: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Typical Operation Flow

24

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 3: Consumer ACKs message in SQS

Page 25: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Time Out Example

25

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 1: A consumer reads a message from SQS

Page 26: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Time Out Example

26

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 2: Consumer attempts persists message contents to DB

Page 27: Resilient Predictive Data Pipelines (QCon London 2016)

visibility time out

SQS - Time Out Example

27

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 3: A Visibility Timeout occurs & the message becomes visible again.

Page 28: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Time Out Example

28

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

m1

SQS

Step 4: Another consumer reads and persists the same message

Page 29: Resilient Predictive Data Pipelines (QCon London 2016)

visibility timer

SQS - Time Out Example

29

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

SQS

Step 5: Consumer ACKs message in SQS

Page 30: Resilient Predictive Data Pipelines (QCon London 2016)

SQS - Dead Letter Queue

30

SQS - DLQ

visibility timer

Producer

Producer

Producer

m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

SQS

Redrive rule : 2x

m1

Page 31: Resilient Predictive Data Pipelines (QCon London 2016)

SNSSimple Notification Service

31

Page 32: Resilient Predictive Data Pipelines (QCon London 2016)

SNS - Overview

32

Highly Scalable, Highly Available, Push-based Topic Service

Whereas SQS ensures each message is seen by at least 1 consumer

SNS ensures that each message is seen by every consumer

Reliable Multi-Push

Whereas SQS is pull-based, SNS is push-based

There is no message retention & there is a finite retry count

No Reliable Message Delivery

Can we work around this limitation?

Page 33: Resilient Predictive Data Pipelines (QCon London 2016)

SNS + SQS Design Pattern

33

m1m2

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Reliable Multi Push

Reliable Message Delivery

Page 34: Resilient Predictive Data Pipelines (QCon London 2016)

SNS + SQS

34

Producer

Producer

Producer

m1m2Consumer

Consumer

Consumer

DB

m1

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Consumer

Consumer

Consumer

ES

m1

Page 35: Resilient Predictive Data Pipelines (QCon London 2016)

S3 + SNS + SQS Design Pattern

35

m1m2

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Reliable Multi Push

Transactions

S3d1

d2

Page 36: Resilient Predictive Data Pipelines (QCon London 2016)

Batch Pipeline ArchitecturePutting the Pieces Together

36

Page 37: Resilient Predictive Data Pipelines (QCon London 2016)

Architecture

37

Page 38: Resilient Predictive Data Pipelines (QCon London 2016)

Architectural Elements

38

A Schema-aware Data format for all data (Avro)

The entire pipeline is built from Highly-Available/Highly-Scalable components

S3, SNS, SQS, ASG, EMR Spark (exception DB)

The pipeline is never blocked because we use a DLQ for messages we cannot process

We use queue-based auto-scaling to get high on-demand ingest rates

We manage everything with Airflow

Every stage in the pipeline is idempotent

Every stage in the pipeline is instrumented

Page 39: Resilient Predictive Data Pipelines (QCon London 2016)

ASGAuto Scaling Group

39

Page 40: Resilient Predictive Data Pipelines (QCon London 2016)

ASG - Overview

40

What is it?

A means to automatically scale out/in clusters to handle variable load/traffic

A means to keep a cluster/service always up

Fulfills AWS’s pay-per-use promise!

When to use it?

Feed-processing, web traffic load balancing, zone outage, etc…

Page 41: Resilient Predictive Data Pipelines (QCon London 2016)

ASG - Data Pipeline

41

importer

importer

importer

importer

Importer ASG

scale out / inSQS

DB

Page 42: Resilient Predictive Data Pipelines (QCon London 2016)

ASG : CPU-based

42

Sent

CPU

ACKd/Recvd

CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

Page 43: Resilient Predictive Data Pipelines (QCon London 2016)

ASG : CPU-based

43

Sent

CPU

Recv

Premature Scale-in

Premature Scale-in: The CPU drops to noise-levels before all messages are consumed. This causes scale in to occur while the last few messages are still being committed resulting in a long time-to-drain for the queue!

Page 44: Resilient Predictive Data Pipelines (QCon London 2016)

ASG - Queue-based

44

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

Page 45: Resilient Predictive Data Pipelines (QCon London 2016)

Architecture

45

Page 46: Resilient Predictive Data Pipelines (QCon London 2016)

Reliable Hourly Job SchedulingWorkflow Automation & Scheduling

46

Page 47: Resilient Predictive Data Pipelines (QCon London 2016)

47

Historical Context

Our first cut at the pipeline used cron to schedule hourly runs of Spark

Problem

We only knew if Spark succeeded. What if a downstream task failed?

We needed something smarter than cron that

Reliably managed a graph of tasks (DAG - Directed Acyclic Graph)

Orchestrated hourly runs of that DAG

Retried failed tasks

Tracked the performance of each run and its tasks

Reported on failure/success of runs

Our Needs

Page 48: Resilient Predictive Data Pipelines (QCon London 2016)

AirflowWorkflow Automation & Scheduling

48

Page 49: Resilient Predictive Data Pipelines (QCon London 2016)

Airflow - DAG Dashboard

49

Airflow: It’s easy to manage multiple DAGs

Page 50: Resilient Predictive Data Pipelines (QCon London 2016)

Airflow - Authoring DAGs

50

Airflow: Visualizing a DAG

Page 51: Resilient Predictive Data Pipelines (QCon London 2016)

51

Airflow: Author DAGs in Python! No need to bundle many config files!

Airflow - Authoring DAGs

Page 52: Resilient Predictive Data Pipelines (QCon London 2016)

Airflow - Performance Insights

52

Airflow: Gantt chart view reveals the slowest tasks for a run!

Page 53: Resilient Predictive Data Pipelines (QCon London 2016)

53

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

Page 54: Resilient Predictive Data Pipelines (QCon London 2016)

Airflow - Alerting

54

Airflow: …And easy to integrate with Ops tools!

Page 55: Resilient Predictive Data Pipelines (QCon London 2016)

Airflow - Monitoring

55

Page 56: Resilient Predictive Data Pipelines (QCon London 2016)

56

Airflow - Join the Community

With >30 Companies, >100 Contributors , and >2500 Commits, Airflow is growing rapidly!

We are looking for more contributors to help support the community!

Disclaimer : I’m a maintainer on the project

Page 57: Resilient Predictive Data Pipelines (QCon London 2016)

Design Goal ScorecardAre We Meeting Our Design Goals?

57

Page 58: Resilient Predictive Data Pipelines (QCon London 2016)

58

• Scalable

• Available

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

Page 59: Resilient Predictive Data Pipelines (QCon London 2016)

59

• Scalable • Build using scalable components from AWS

• SQS, SNS, S3, ASG, EMR Spark • Exception = DB (WIP)

• Available • Build using available components from AWS • Airflow for reliable job scheduling

• Instrumented, Monitored, & Alert-enabled • Airflow

• Quickly Recoverable • Airflow, DLQs, ASGs, Spark & DB

Desirable Qualities of a Resilient Data Pipeline

Page 60: Resilient Predictive Data Pipelines (QCon London 2016)

Questions? (@r39132)

60