Top Banner
Introducing Apache Airflow ( Incubating ) Sid Anand ( @r39132 ) Data Day Seattle 2016 1
68

Introduction to Apache Airflow - Data Day Seattle 2016

Jan 06, 2017

Download

Software

Sid Anand
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Apache Airflow - Data Day Seattle 2016

Introducing Apache Airflow (Incubating)

Sid Anand (@r39132) Data Day Seattle 2016

1

Page 2: Introduction to Apache Airflow - Data Day Seattle 2016

About Me

2

Work [ed | s] @

Maintainer on

Reports to

Co-Chair for

Page 3: Introduction to Apache Airflow - Data Day Seattle 2016

Apache Airflow

3

What is it?

Page 4: Introduction to Apache Airflow - Data Day Seattle 2016

4

Apache Airflow : What is it?Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)

Page 5: Introduction to Apache Airflow - Data Day Seattle 2016

5

Apache Airflow : What is it?Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)

It ships with • DAG Scheduler • Web application (UI) • Powerful CLI

Page 6: Introduction to Apache Airflow - Data Day Seattle 2016

6

Apache Airflow : What is it?

Page 7: Introduction to Apache Airflow - Data Day Seattle 2016

Airflow - Authoring DAGs

7

Airflow: Visualizing a DAG

Page 8: Introduction to Apache Airflow - Data Day Seattle 2016

8

Airflow: Author DAGs in Python! No need to bundle many XML files!

Airflow - Authoring DAGs

Page 9: Introduction to Apache Airflow - Data Day Seattle 2016

9

Airflow: The Tree View offers a view of DAG Runs over time!

Airflow - Authoring DAGs

Page 10: Introduction to Apache Airflow - Data Day Seattle 2016

Airflow - Performance Insights

10

Airflow: Gantt charts reveal the slowest tasks for a run!

Page 11: Introduction to Apache Airflow - Data Day Seattle 2016

11

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

Page 12: Introduction to Apache Airflow - Data Day Seattle 2016

12

Apache Airflow : What is it?When would you use a Workflow Scheduler like Airflow?

• ETL Pipelines

• Machine Learning Pipelines

• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,

Recommender System, etc…

• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment

Page 13: Introduction to Apache Airflow - Data Day Seattle 2016

13

Apache Airflow : What is it?

What should a Workflow Scheduler do well? • Schedule a graph of dependencies

• where Workflow = A DAG of Tasks

• Handle task failures

• Report / Alert on failures

• Monitor performance of tasks over time

• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met

• Scale

Page 14: Introduction to Apache Airflow - Data Day Seattle 2016

14

Apache Airflow : What is it?What Does Apache Airflow Add?

• Configuration-as-code

• Usability - Stunning UI / UX

• Centralized configuration

• Resource Pooling

• Extensibility

Page 15: Introduction to Apache Airflow - Data Day Seattle 2016

Apache Airflow

15

Incubating

Page 16: Introduction to Apache Airflow - Data Day Seattle 2016

16

Apache Airflow : Incubating

Timeline • Airflow was created @ Airbnb in 2015 by Maxime

Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator

Today • 166+ Contributors • 300+ Users • 40+ companies officially using it! • 9 Committers/Maintainers <— We’re growing here

Page 17: Introduction to Apache Airflow - Data Day Seattle 2016

Agari

17

What We Do!

Page 18: Introduction to Apache Airflow - Data Day Seattle 2016

Agari : What We Do

18

Page 19: Introduction to Apache Airflow - Data Day Seattle 2016

19

Agari : What We Do

Page 20: Introduction to Apache Airflow - Data Day Seattle 2016

20

Agari : What We Do

Page 21: Introduction to Apache Airflow - Data Day Seattle 2016

21

Agari : What We Do

Page 22: Introduction to Apache Airflow - Data Day Seattle 2016

22

Agari : What We Do

Page 23: Introduction to Apache Airflow - Data Day Seattle 2016

23

Enterprise Customers

email metadata

apply trust

models

email md + trust score

Agari’s Current Product

Agari : What We Do

Page 24: Introduction to Apache Airflow - Data Day Seattle 2016

24

email metadata

apply trust

models

email md+ trust score

Agari’s Future ProductEnterprise Customers

Agari : What We Do

Page 25: Introduction to Apache Airflow - Data Day Seattle 2016

Apache Airflow @ AgariHow Do We Use It?

25

Page 26: Introduction to Apache Airflow - Data Day Seattle 2016

Classes of Orchestration

26

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation BI / ETL

N / A

Page 27: Introduction to Apache Airflow - Data Day Seattle 2016

Classes of Orchestration

27

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation BI / ETL

N / A

This Talk

Page 28: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message ScoringBatch Pipeline Architecture

28

Page 29: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message Scoring

29

enterprise Aenterprise Benterprise C

S3

S3 uploads every 15 minutes

Page 30: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message Scoring

30

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour

Page 31: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message Scoring

31

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Page 32: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message Scoring

32

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Page 33: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message Scoring

33

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

Page 34: Introduction to Apache Airflow - Data Day Seattle 2016

34

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 35: Introduction to Apache Airflow - Data Day Seattle 2016

35

enterprise Aenterprise Benterprise C

S3

Users receive alerts of untrusted emails & can review them in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 36: Introduction to Apache Airflow - Data Day Seattle 2016

36

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Page 37: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Message ScoringAirflow DAG

37

Page 38: Introduction to Apache Airflow - Data Day Seattle 2016

38

Airflow DAG

Page 39: Introduction to Apache Airflow - Data Day Seattle 2016

39

5 minute wait for S3 eventual consistency

Airflow DAG

Page 40: Introduction to Apache Airflow - Data Day Seattle 2016

40

1 hour a day, we also build new models

Airflow DAG

Page 41: Introduction to Apache Airflow - Data Day Seattle 2016

41

build models

Airflow DAG

Page 42: Introduction to Apache Airflow - Data Day Seattle 2016

42

dummy needed for branch operator

Airflow DAG

Page 43: Introduction to Apache Airflow - Data Day Seattle 2016

43

• trigger_rule: one_success

Airflow DAG

Page 44: Introduction to Apache Airflow - Data Day Seattle 2016

44

Prep Spark Run

Airflow DAG

Page 45: Introduction to Apache Airflow - Data Day Seattle 2016

45

• Run Spark

• Verify a record is written to the DB

• Wait for the SQS queue to empty

Airflow DAG

Page 46: Introduction to Apache Airflow - Data Day Seattle 2016

46

• Compute discrepancies

• Send email report

• Update monitoring graphs

• Raise SLA (correctness) alerts

Airflow DAG

Page 47: Introduction to Apache Airflow - Data Day Seattle 2016

SLAs & InsightsAirflow

47

Page 48: Introduction to Apache Airflow - Data Day Seattle 2016

48

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Scalable/Available

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• ASGs, SQS, SNS, S3

Page 49: Introduction to Apache Airflow - Data Day Seattle 2016

49

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

SLA

SLA• ASGs, SQS, SNS, S3 Scalable/Available

Page 50: Introduction to Apache Airflow - Data Day Seattle 2016

50

Correctness : Email Reporting

orgs

Page 51: Introduction to Apache Airflow - Data Day Seattle 2016

51

Correctness : Email Reporting

For each org, we check for duplicate or missing data as a count & percentage

orgs

Page 52: Introduction to Apache Airflow - Data Day Seattle 2016

52

Correctness : Email Reporting

These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!

orgs

Page 53: Introduction to Apache Airflow - Data Day Seattle 2016

53

Correctness : Monitoring

Page 54: Introduction to Apache Airflow - Data Day Seattle 2016

54

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Page 55: Introduction to Apache Airflow - Data Day Seattle 2016

55

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

Page 56: Introduction to Apache Airflow - Data Day Seattle 2016

56

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)

Page 57: Introduction to Apache Airflow - Data Day Seattle 2016

57

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

Correctness SLA miss

Page 58: Introduction to Apache Airflow - Data Day Seattle 2016

58

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps

Page 59: Introduction to Apache Airflow - Data Day Seattle 2016

Use-Case : Model Building v2For Both Batch & Near Realtime Scoring Pipelines

59

Page 60: Introduction to Apache Airflow - Data Day Seattle 2016

60

Airflow DAG

Page 61: Introduction to Apache Airflow - Data Day Seattle 2016

61

Model Building DAG

Launch an EMR cluster

Page 62: Introduction to Apache Airflow - Data Day Seattle 2016

62

Run model building as EMR steps

Model Building DAG

Page 63: Introduction to Apache Airflow - Data Day Seattle 2016

63

Validate models

Send email notification if tests fail

Model Building DAG

Page 64: Introduction to Apache Airflow - Data Day Seattle 2016

64

Terminate EMR cluster

Model Building DAG

Page 65: Introduction to Apache Airflow - Data Day Seattle 2016

Apache Airflow Next Steps

65

Areas for Improvement

Page 66: Introduction to Apache Airflow - Data Day Seattle 2016

66

Apache Airflow Next Steps

Improvement Areas

• Security

• API (though we do have a CLI)

• Deployment / Versioning

• Execution Scale Out

• On-demand Execution

Page 67: Introduction to Apache Airflow - Data Day Seattle 2016

Acknowledgments

67

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

None of this work would be possible without the contributions of the strong team below

Page 68: Introduction to Apache Airflow - Data Day Seattle 2016

Questions? (@r39132)

68