Top Banner
Building Spark Streaming Pipelines with Cask Hydrator Gokul Gunasekaran Software Engineer, Cask Data Aug 31, 2016 Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
18

Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

Jan 07, 2017

Download

Technology

Cask Data, Inc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

Building Spark Streaming Pipelines with Cask Hydrator

Gokul GunasekaranSoftware Engineer, Cask Data

Aug 31, 2016

Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

Page 2: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

INGESTany data from any source

in real-time and batch

BUILDdrag-and-drop ETL/ELT

pipelines that run on Hadoop

EGRESSany data to any destination

in real-time and batch

Data Pipelineprovides the ability to automate complex workflows that involves fetching data,

performing non-trivial transformations, deriving and serving insights from the data

2

Page 3: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Flight Data Analysis Use Case

✦Hadoop ETL pipeline(s) stitched together using hard-to-maintain, brittle scripts

✦Not many developers with expertise in Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka, Hive)

✦Hard to debug and validate, resulting in frequent failures in production environment

Noise due to low flight paths is a common problem. We want to find out the affected airports around the country using flight data sensors placed around airports.

Challenge —

3

Page 4: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Demo

Fetch Flight sensor data from Kafka and find out the affected airport areas

• Sensors are pushing data into Kafka about flight altitude/velocity etc.

• Fetch data from Kafka and batch events every minute

• Group the data by airport code and compute the average altitudes

• Filter out airport areas where average altitude is less than a threshold

• Write the filtered airport codes to HDFS

4

Page 5: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Flight Data from sensor

1472669109, SAN, 400, 2001472669109, SFO, 300, 400….

Fields: Timestamp, Destination Airport Code, Altitude, Velocity

5

Page 6: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Hydrator Studio

✦Drag-and-drop GUI for visual Data Pipeline creation

✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases

✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.

✦Hadoop-native and Hadoop Distro agnostic

6

Page 7: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Hydrator Data Pipeline

✦Captures Metadata, Audit, Lineage info and visualized using Cask Tracker

✦Pre and Post run notification, centralized metrics and log collection for ease of operability

✦Simple Java API to build your own source, transforms, sinks with class loader isolation

✦SparkML based plugins, Python transforms for data scientists

7

Page 8: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

✦ElasticSearch, Cassandra, Kafka, SFTP, JMS and many more sources and sinks

✦De-duplicate, Group By Aggregation, Row Denormalizer and other transforms

Out of the box Integrations

8

Page 9: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

✦ Implement your own batch (or streaming) source, transform, sink plugins using simple Java API

Custom Plugins

9

Page 10: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Data Lake FraudDetection

RecommendationEngine

Sensor DataAnalytics

Customer360Hydrator Tracker

CASK DATA APP PLATFORM

Hadoop ecosystem, 50 different projects

Top 6 Hadoop distributions

10

Page 11: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Pipeline Implementation

Logical Pipeline

Physical Workflow

MR/Spark Executions

Planner

CDAP

✦Planner converts logical pipeline to a physical execution plan

✦Optimizes and bundles functions into one or more MR/Spark jobs and Spark streaming job in case of Realtime pipeline

✦CDAP is the runtime environment where all the components of the data pipeline are executed

✦CDAP provides centralized log and metrics collection, transaction, lineage and audit information

11

Page 12: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Hydrator Realtime Data Pipeline

✦Generates micro batches of data in regular intervals

✦Supports sliding windows, aggregations, various transforms, joins and ML

✦Checkpointing of pipeline state is coming soon

12

Page 13: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Streaming Source

✦Uses Spark DStreams (Discretized Streams)

✦Generates a new RDD every batch interval (pipeline property) (ex: 10 sec)

13

Page 14: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Windowing

✦Sliding window is defined by size and slide interval, both of which are multiples of batch interval

14

In this example, size = 3, slide = 2

Page 15: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

ODBC Connector for BI Tools

15

✦Explore CDAP Streams and Datasets using popular BI Tools using CDAP ODBC connector

Page 16: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

* Checkpointing capability in Spark streaming (HYDRATOR-378)

* More ML and other plugins

Upcoming capabilities

16

Page 17: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

Thank [email protected]

@CaskData

github.com/caskdata/cdapgithub.com/caskdata/hydrator-plugins

Questions?17

Page 18: Building Spark Streaming Pipelines with Cask Hydrator, by Gokul Gunasekaran, Cask

cask.co

Self-Service Data Ingestionand ETL for Data Lakes

Built for Productionon CDAP

Rich Drag-and-DropUser Interface

Open Source &Highly Extensible

18