Top Banner
1 © Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Real Time Data Processing using Spark Streaming
24

Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

1 © Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ Cloudera

Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Contributor, Apache Spark

Author, Using Flume (O’Reilly)

Real Time Data Processing using Spark Streaming

Page 2: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

2 © Cloudera, Inc. All rights reserved.

Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

• Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...

Page 3: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

3 © Cloudera, Inc. All rights reserved.

Use Cases Across Industries

Credit Identify fraudulent transactions as soon as they occur.

Transportation Dynamic Re-routing Of traffic or Vehicle Fleet.

Retail • Dynamic Inventory Management • Real-time In-store Offers and recommendations

Consumer Internet & Mobile Optimize user engagement based on user’s current behavior.

Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients.

Manufacturing • Identify equipment failures and react instantly • Perform Proactive maintenance.

Surveillance Identify threats and intrusions In real-time

Digital Advertising & Marketing Optimize and personalize content based on real-time information.

Page 4: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

4 © Cloudera, Inc. All rights reserved.

From Volume and Variety to Velocity

Present

Batch + Stream Processing Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

Past Present

Hadoop Ecosystem evolves as well…

Past

Big Data has evolved

Batch Processing Time to insight of Hours

Page 5: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

5 © Cloudera, Inc. All rights reserved.

Key Components of Streaming Architectures

Data Ingestion & Transportation Service

Real-Time Stream Processing Engine

Kafka Flume

System Management

Security

Data Management & Integration

Real-Time Data Serving

Page 6: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

6 © Cloudera, Inc. All rights reserved.

Canonical Stream Processing Architecture

Kafka

Data Ingest

App 1

App 2

.

.

.

Kafka Flume

HDFS HBase

Data Sources

Page 7: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

7 © Cloudera, Inc. All rights reserved.

Spark: Easy and Fast Big Data

•Easy to Develop •Rich APIs in Java, Scala, Python

• Interactive shell

•Fast to Run •General execution graphs

• In-memory storage

2-5× less code Up to 10× faster on disk,

100× in memory

Page 8: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

8 © Cloudera, Inc. All rights reserved.

Spark Architecture

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM

Page 9: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

9 © Cloudera, Inc. All rights reserved.

RDDs

RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage

Page 10: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

10 © Cloudera, Inc. All rights reserved.

Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput

Page 11: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

11 © Cloudera, Inc. All rights reserved.

Spark Streaming

• Incoming data represented as Discretized Streams (DStreams)

• Stream is broken down into micro-batches

• Each micro-batch is an RDD – can share code between batch and streaming

Page 12: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

12 © Cloudera, Inc. All rights reserved.

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1 batch @ t batch @ t+2 tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

Page 13: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

13 © Cloudera, Inc. All rights reserved.

Use DStreams for Windowing Functions

Page 14: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

14 © Cloudera, Inc. All rights reserved.

Spark Streaming

• Runs as a Spark job

• YARN or standalone for scheduling

• YARN has KDC integration

• Use the same code for real-time Spark Streaming and for batch Spark jobs.

• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….

• Easy to write “Receivers” for custom messaging systems.

Page 15: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

15 © Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = {

rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

• Streaming generates RDDs periodically

• Any code that operates on RDDs can therefore be used in streaming as well

Page 16: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

16 © Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines)

filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {

filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:

Page 17: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

17 © Cloudera, Inc. All rights reserved.

Reliability

• Received data automatically persisted to HDFS Write Ahead Log to prevent data loss

• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf

• When AM dies, the application is restarted by YARN

• Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks)

• Reliable Receivers can replay data from the original source, if required

• Un-acked data replayed from source.

• Kafka, Flume receivers bundled with Spark are examples

• Reliable Receivers + WAL = No data loss on driver or receiver failure!

Page 18: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

18 © Cloudera, Inc. All rights reserved.

Kafka Connectors

• Reliable Kafka DStream

• Stores received data to Write Ahead Log on HDFS for replay

• No data loss

• Stable and supported!

• Direct Kafka DStream

• Uses low level API to pull data from Kafka

• Replays from Kafka on driver failure

• No data loss

• Experimental

Page 19: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

19 © Cloudera, Inc. All rights reserved.

Flume Connector

• Flume Polling DStream

• Use Spark sink from Maven to Flume’s plugin directory

• Flume Polling Receiver polls the sink to receive data

• Replays received data from WAL on HDFS

• No data loss

• Stable and Supported!

Page 20: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

20 © Cloudera, Inc. All rights reserved.

Spark Streaming Use-Cases

• Real-time dashboards

• Show approximate results in real-time

• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams

• Time-based or count-based “windows”

• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.

Page 21: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

21 © Cloudera, Inc. All rights reserved.

What is coming?

• Run on Secure YARN for more than 7 days!

• Better Monitoring and alerting

• Batch-level and task-level monitoring

• SQL on Streaming

• Run SQL-like queries on top of Streaming (medium – long term)

• Python!

• Limited support coming in Spark 1.3

Page 22: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

22 © Cloudera, Inc. All rights reserved.

Current Spark project status

• 400+ contributors and 50+ companies contributing

• Includes: Databricks, Cloudera, Intel, Yahoo! etc

• Dozens of production deployments

• Spark Streaming Survived Netflix Chaos Monkey – production ready!

• Included in CDH!

Page 23: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

23 © Cloudera, Inc. All rights reserved.

More Info..

• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html

• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/

• Apache Spark homepage: http://spark.apache.org/

• Github: https://github.com/apache/spark

Page 24: Real Time Data Processing using Spark Streaming · 2019-09-06 · Kafka Connectors •Reliable Kafka DStream •Stores received data to Write Ahead Log on HDFS for replay •No data

24 © Cloudera, Inc. All rights reserved.

Thank you [email protected] 15% Discount Code for Cloudera Training PNWCUG_15 university.cloudera.com