Top Banner
Prajod Vettiyattil Architect, Open source Wipro in.linkedin.com/in/prajod @prajods Apache Spark The Next Gen toolset for Big Data Processing Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams Open Source India Nov 2014 Bangalore
23

Apache Spark: The Next Gen toolset for Big Data Processing

Jun 27, 2015

Download

Data & Analytics

prajods

The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.

Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Spark: The Next Gen toolset for Big Data Processing

Prajod Vettiyattil

Architect, Open source

Wipro

in.linkedin.com/in/prajod

@prajods

Apache Spark The Next Gen toolset for

Big Data Processing

Namitha M S

Architect, Advanced Technologies

Wipro

in.linkedin.com/in/namithams

Open Source India Nov 2014 Bangalore

Page 2: Apache Spark: The Next Gen toolset for Big Data Processing

• Big Data

• Hadoop stack and its limitations

• Spark: An overview

• Streaming, GraphX and MLlib

• Performance characteristics of Spark

Agenda

Page 3: Apache Spark: The Next Gen toolset for Big Data Processing

• Data too huge for normal systems

• 3 Vs: Volume, Variety, Velocity

• Storage challenge

• Analysis challenge

• Query results take hours, days or months

Big Data

Data disks

Page 4: Apache Spark: The Next Gen toolset for Big Data Processing

The Big Data Analysis Triad

Batch

Interactive Streaming

Page 5: Apache Spark: The Next Gen toolset for Big Data Processing

The Hadoop stack

• Distributed data processing

• Fault tolerant

• Process peta byte data sets

• Ecosystem tools

• Hive DB, Hbase

• Pig

• Storm

• Hadoop

• Map

• Reduce

• Shuffle, partition, sort

• HDFS

Page 6: Apache Spark: The Next Gen toolset for Big Data Processing

Hadoop: Data flow

Partition for target reducers

Buffer in memory Map

Input

data

files

Sort each partition

by key

Merge all partitions and write

to disk

Potential spill to

disk

Merge round

1

Merge round

2

Merge round

N http fetch from map node

Reduce

Merge sort

Output

High disk I/O

On Map nodes

On Reduce nodes

Page 7: Apache Spark: The Next Gen toolset for Big Data Processing

• Batch mode

• Only the batch layer in the Lambda pattern

• No real time

• No repetitive queries

• Iterative algorithms

• Interactive data querying

• Poor support for distributed memory

Limitations of Hadoop

Page 8: Apache Spark: The Next Gen toolset for Big Data Processing

Spark: An overview

• “Over time, fewer projects will use MapReduce, and more will use Spark”

• Doug Cutting, creator of Hadoop

• New architecture: scale better and simplify

• In memory processing for Big Data

• Cached intermediate data sets

• Multi-step DAG based execution

• Resilient Distributed Data(RDD) sets

• The core innovation in Spark

Page 9: Apache Spark: The Next Gen toolset for Big Data Processing

Spark Ecosystem tools

Apache Spark

Spark SQL

Streaming

MLlib GraphX

Spark R

Blink DB

Shark Bagel

Page 10: Apache Spark: The Next Gen toolset for Big Data Processing

DAG Execution Engine

Map

Collect

Filter

Map

Reduce

Sort

Collect

DAG = Directed Acyclic Graph

Page 11: Apache Spark: The Next Gen toolset for Big Data Processing

• Resilient Distributed Data sets

• Features

• Read only

• Fault tolerance without replication

• Uses data lineage for recovery

• Low network I/O

• Partitions/Slices

• parallel tasks

RDD

Disk Transform 1

RDD 1

Transform 2

RDD 2

Data partitions

Page 12: Apache Spark: The Next Gen toolset for Big Data Processing

Lambda architecture pattern

• Used for Lambda architecture implementation

• Batch layer

• Speed layer

• Serving layer Batch Layer

Speed Layer

Serving Layer

Input

Data consumers

Query

Query

Page 13: Apache Spark: The Next Gen toolset for Big Data Processing

Spark Streaming

• For stream processing in Spark

• Real time data

• Like Twitter queries

• Discretized streams(DStreams)

• Micro batches

• Sequence of RDDs

Page 14: Apache Spark: The Next Gen toolset for Big Data Processing

Discretized Streams

Spark Streaming

Spark

Batches of x seconds

Input

Output

Page 15: Apache Spark: The Next Gen toolset for Big Data Processing

Why Spark Streaming

• Near real time processing (0.5 – 2 sec latency)

• Parallel recovery of lost nodes and stragglers

• Implementation of Lambda architecture

• Single engine for batch and stream

• Not suited for low latency requirements

• i.e., 100ms

Page 16: Apache Spark: The Next Gen toolset for Big Data Processing

Apache Storm vs Spark Streaming

Feature Spark Streaming Storm

Processing Model Micro-Batching Event Stream

processing

Message Delivery

options

Inherently fault tolerant,

exactly once delivery

At least once, at most

once, exactly once

Flexibility Coarse grained

transformation

Fine grained

transformation

Implemented in Scala Clojure

Development

Cost

Common platform for both

batch and stream

Only stream – separate

setup for batch

Applicability Machine learning, Interactive

analytics, near real time

analytics

Near real time analytics,

Natural language

processing

Page 17: Apache Spark: The Next Gen toolset for Big Data Processing

GraphX & MLlib

• Data parallel Vs Graph Parallel processing

• Wikipedia search vs Facebook connection search, Page rank

• Spark MLlib implements high quality machine learning algorithms

• Iterative Algorithm Paradigm

• Leverage Spark’s in memory data sets

)()1( tt xfx f(xt) f(xt+1)

x(t+1) x(t)

Page 18: Apache Spark: The Next Gen toolset for Big Data Processing

Performance characteristics

Performance of Spark

• 100x faster in memory

• 10x faster on disk

Graph courtesy: spark.apache.org

Page 19: Apache Spark: The Next Gen toolset for Big Data Processing

Hadoop vs Spark

Hadoop Spark Spark World Record

100 TB * 1 PB

Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000

Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Data courtesy: databricks.com

Page 20: Apache Spark: The Next Gen toolset for Big Data Processing

1 TB performance test: data per sec

Page 21: Apache Spark: The Next Gen toolset for Big Data Processing

1 TB performance test data rate vs RAM size

Page 22: Apache Spark: The Next Gen toolset for Big Data Processing

Apache Spark

•New architecture

•RDD, DAG

• In memory processing

•Map reduce and more

•GraphX

•MLlib

•Spark streaming

Summary

Ecosystem tools

•Spark R

•Blink DB

•Storm

Spark performance

•GBs per second

•RAM to data size

• Inflexion point

Page 23: Apache Spark: The Next Gen toolset for Big Data Processing

Questions

Prajod Vettiyattil

Architect, Open source

Wipro

@prajods

in.linkedin.com/in/prajod

Namitha M S

Architect, Advanced Technologies

Wipro

in.linkedin.com/in/namithams