Top Banner
Automating Workows for Analytics Pipelines Sadayuki Furuhashi Open Source Summit 2017
28

Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Aug 31, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Automating Workflows for Analytics Pipelines

Sadayuki Furuhashi

Open Source Summit 2017

Page 2: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Sadayuki Furuhashi

A founder of Treasure Data, Inc. located in Silicon Valley.

OSS projects I founded:

An open-source hacker.

Github: @frsyuki

Page 3: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

What's Workflow Engine?• Automates your manual operations.

• Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy

(Continuous Delivery)

Page 4: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Challenge: Multiple Cloud & Regions

On-Premises

Different API, Different tools, Many scripts.

Page 5: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Challenge: Multiple DB technologies

Amazon S3

Amazon Redshift

Amazon EMR

Page 6: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Challenge: Multiple DB technologies

Amazon S3

Amazon Redshift

Amazon EMR

> Hi! > I'm a new technology!

Page 7: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Challenge: Modern complex data analytics

Ingest Application logs User attribute data Ad impressions 3rd-party cookie data

Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs

Model A/B Testing Funnel analysis Segmentation analysis Machine learning

Load Creating indexes Data partitioning Data compression Statistics collection

Utilize Recommendation API Realtime ad bidding Visualize using BI applications

Ingest UtilizeEnrich Model Load

Page 8: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Traditional "false" solution

#!/bin/bash

./run_mysql_query.sh

./load_facebook_data.sh

./rsync_apache_logs.sh

./start_emr_cluster.shfor query in emr/*.sql; do ./run_emr_hive $querydone./shutdown_emr_cluster.sh

./run_redshift_queries.sh

./call_finish_notification.sh

> Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization

Page 9: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Solution: Multi-Cloud Workflow Engine

Solves

> Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization

Page 10: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Example in our case

1. Dump data to BigQuery

2. load all tables to Treasure Data

3. Run queries

5. Notify on slack

4. Create reports on Tableau Server

(on-premises)

Page 11: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Workflow constructs

Page 12: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Unite Engineering & Analytic Teams

+wait_for_arrival:

s3_wait>: |

bucket/www_${session_date}.csv

+load_table:

redshift>: scripts/copy.sql

Powerful for Engineers > Comfortable for advanced users

Friendly for Analysts > Still straight forward for analysts to

understand & leverage workflows

Page 13: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Unite Engineering & Analytic Teams

Powerful for Engineers > Comfortable for advanced users

Friendly for Analysts > Still straight forward for analysts to

understand & leverage workflows

+wait_for_arrival:

s3_wait>: |

bucket/www_${session_date}.csv

+load_table:

redshift>: scripts/copy.sql

+ is a task

> is an operator

${...} is a variable

Page 14: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Operator library_export:

td:

database: workflow_temp

+task1:

td>: queries/open.sql

create_table: daily_open

+task2:

td>: queries/close.sql

create_table: daily_close

Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email

Open-source libraries You can release & use open-source operator libraries.

Page 15: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Parallel execution

+load_data:

_parallel: true

+load_users:

redshift>: copy/users.sql

+load_items:

redshift>: copy/items.sql

Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.

Page 16: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Loops & Parameters

+send_email_to_active_users:

td_for_each>: list_active.sql

_do:

+send:

email>: tempalte.txt

to: ${td.for_each.addr}

Parameter A task can propagate parameters to following tasks

Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.

Page 17: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Grouping workflows...

Ingest UtilizeEnrich Model Load

+task

+task

+task+task +task

+task +task

+task

+task

+task +task +task

Page 18: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Grouping workflows

Ingest UtilizeEnrich Model Load

+ingest +enrich

+task +task

+model

+basket_analysis

+task +task

+learn +load

+task +task+tasks

+task

Page 19: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Pushing workflows to a server with Docker image

schedule:

daily>: 01:30:00

timezone: Asia/Tokyo

_export:

docker:

image: my_image:latest

+task:

sh>: ./run_in_docker

Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor

Docker > Install scripts & dependences in a

Docker image, not on a server. > Workflows can run anywhere including

developer's laptop.

Page 20: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Demo

Page 21: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Digdag is production-ready

Digdag server PostgreSQL

It's just like a web application.

Digdag client

All task state

API & scheduler & executor

Visual UI

Page 22: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Digdag is production-ready

PostgreSQL

Stateless servers + Replicated DB

Digdag client

API & scheduler & executor

PostgreSQL

All task state

Digdag server

Digdag server

HTTP Load Balancer

Visual UI

HA

Page 23: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Digdag is production-ready

Digdag server PostgreSQL

Isolating API and execution for reliability

Digdag client

API

PostgreSQL

HA

Digdag server

Digdag server

Digdag server

scheduler &executor

HTTP Load Balancer

All task state

Page 24: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Digdag at Treasure Data

3,600 workflows run every day 28,000 tasks run every day

850 active workflows

400,000 workflow executions in total

Page 25: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Digdag & Open Source

Page 26: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Learning from my OSS projects• Make it pluggable!

700+ plugins in 6 years

200+ plugins in 3 yearsinput/output, parser/formatter,decoder/encoder, filter, and executor

input/output, and filter

70+ implementations in 8 years

Page 27: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Digdag also has plugin architecture

32 operators 7 schedulers 2 command executors 1 error notification module

Page 28: Automating Workflows for Analytics Pipelinesevents17.linuxfoundation.jp/sites/events/files/slides/oss-summit...Amazon S3 Amazon Redshift Amazon EMR > Hi! > I'm a new technology! Challenge:

Sadayuki Furuhashi

https://digdag.ioVisit my website!