Cask Hydrator: Open Source, Code-Free Data Pipelines | Global Big Data Conf 2016

HydratorCode-free Data Pipelines

for Hadoop, Spark, and HBaseJonathan Gray, CEO @ Cask

Global Big Data Conference - August 30th, 2016

cask.co

Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

cask.co

About Me

2

cask.co

Hadoop Enables New Apps and Patterns

3

ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS

Batch and Realtime Data Ingestion

Any type of data from anytype of source in any volume

Batch and Streaming ETLCode-free self-service creationand management of pipelines

SQL Exploration andData Science

All data is automaticallyaccessible via SQL and client SDKs

Data as a ServiceEasily expose generic or

custom REST APIs on any data

360o Customer ViewIntegrate data from any source

and expose through queries and APIs

Realtime DashboardsPerform realtime OLAP

aggregations and serve them through REST APIs

Time Series AnalysisStore, process and serve massive

volumes of time-series data

Realtime Log AnalyticsIngestion and processing of high-throughput streaming

log events

Recommendation EnginesBuild models in batch using

historical data and serve them in realtime

Anomaly Detection SystemsProcess streaming events and predictably compare them in

realtime to historical data

NRT Event MonitoringReliably monitor large streams of data and perform defined actions

within a specified time

Internet of ThingsIngestion, storage and processing of events that is highly-available,

scalable and consistent
















log events
























log events









cask.co4

Web Analytics and Reporting Use Case

✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts

✦Not enough people with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or a general lack of expertise

✦Hard to debug and validate, resulting in frequent failures in production environment

✦Difficult to integrate into SQL / BI reporting solutions for business users

✦As use cases advance into Data Science, Machine Learning, and Predictive Analytics you need to include scientists and advanced ML programmers

Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular pages, etc.

The Challenges —

cask.co

The Many Faces of Hadoop

5

Developer

Advanced Programming

Focused on App Logic

Data Scientist

Basic Dev & Complex Analytics

Focused on Data & Algorithms

IT Pro / Ops

Configuring & Monitoring

Focused on Infrastructure & SLA’s

LOB / Product

Decision Making & Driving Revenue

Focused on Apps & Insights

Challenge: The tools are missing to connect these users and take apps from prototype to production

cask.co6

Enter Cask

Key Customers and Partners

Named a Gartner Cool Vendor 2016

Founded in 2011 by early Hadoop engineers from Facebook and Yahoo!

cask.co

Introducing the Data Application Platform

7

Deployment Models

On-premises Hybrid Cloud

Governance Operations

Pre-packaged Integrations

Orchestration/Automation/Workflows

Core Application and Data Integration

Role-based User Experience

Developer Data Scientist

IT /Ops

cask.co

Introducing the Cask Data App Platform

8

Open Source, Integrated Framework for

Building and Running Data Applications

on Hadoop and Spark

• Supports all major Hadoop distros • Integrates the latest Big Data technologies • 100% open source and highly extensible

cask.co9

What’s in CDAP ?

A self-service, re-configurable, code-free framework to build, run and operate real-time or batch data pipelines in cloud or on-premise.

A self-service tool for tracking the flow of data in and out of Data Lake. Track, Index and Search technical, business and operational metadata of applications and pipelines

An integration platform that integrates and abstracts underlying Hadoop technologies. Build data analytics solutions in cloud or on-premise.

cask.co10

A self-service, code-free framework to build, run and operate data pipelines

on Apache Hadoop and Spark

Built for Productionon CDAP

Rich Drag-and-DropUser Interface

Open Source &Highly Extensible

cask.co11

INGESTany data from any source

in real-time and batch

BUILDdrag-and-drop ETL/ELT

pipelines that run on Hadoop

EGRESSany data to any destination

in real-time and batch

Hydrator Data Pipelinesprovide the ability to automate complex workflows that involves fetching data, possibly from multiple

data sources, combining, performing non-trivial transformations and aggregations on the data, writing it to one more data sinks and making it available for applications and analytics

cask.co12

Stack of Data Enablers

cask.co13

Hydrator Studio

✦Drag-and-drop GUI for visual Data Pipeline creation

✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases

✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.

✦Hadoop-native and Hadoop Distro agnostic

cask.co14

Hydrator Data Pipeline

✦Captures Metadata, Audit, Lineage info, discovered and visualized using Cask Tracker

✦Notifications, scheduling, and monitoring with centralized metrics and log collection for ease of operability

✦Simple Java API to build your own source, transforms, sinks with class loading isolation

✦Javascript and Python transforms

✦ Include arbitrary Spark jobs

cask.co15

✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks

✦ Parse/Encode/Hash, Distinct/Group By, Custom JavaScript/Python Transforms

Out of the box Integrations

cask.co16

✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API

Custom Plugins

cask.co17

Pipeline Implementation

Logical Pipeline

Physical Workflow

MR/Spark Executions

Planner

CDAP

✦Planner converts logical pipeline to a physical execution plan

✦Optimizes and bundles functions into one or more MR/Spark jobs

✦CDAP is the runtime environment where all the components of the data pipeline are executed

✦CDAP provides centralized log and metrics collection, transaction, lineage and audit information

cask.co18

Pipeline Implementation

cask.co19

Support for fine-grain role-based authorizing of entities in CDAP

Integration with Sentry and Ranger

Security — Authentication and Authorization

Ability to preview pipelines with real or injected data before deploying (Standalone)

Security — Impersonation and Encryption

Learn about how datasets are being used and the top applications accessing it

Tracker — Data Usage Analytics

Support for annotating business metadata based on business specified taxonomy

Metadata Taxonomy

Build and run Hydrator real-time pipelines using Spark Streaming

Hydrator — Spark Streaming

Ability to run CDAP and CDAP Apps as specified users and ability to

encrypt/decrypt sensitive configuration

Hydrator — Preview Mode

Capability to join multiple streams (inner & outer) and ability to configure actions allowing one to run binaries on designated nodes

Hydrator — Join & Action

Support for XML, Mainframe (COBOL Copybook), Value Mapper, Normalizer, Denormalizer, JsonToXml, SSH Action, Excel Reader, Solr & Spark ML

Hydrator — Plugins

New CDAP 3.5 - Latest Features

cask.co20

Demo ExampleLoad Log Files from S3 to HDFS and perform aggregations/analysis

• Start with web access logs stored in Amazon S3

• Store the raw logs into HDFS Avro Files

• Parse the access log lines into individual fields

• Calculate the total number of requests by IP and status code

• Find out IPs which received maximum successful status code and error codes

69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"

Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info

Sample Web access log (Combined Log Format):

cask.co21

Thanks!

Jonathan Gray @jgrayla

Download CDAP w/ Hydrator: http://cask.co/downloads/

http://cask.co/downloads/

Cask Hydrator: Open Source, Code-Free Data Pipelines | Global Big Data Conf 2016

Technology