Top Banner
Apache Spark Stream Programming and Distributed Data Processing Habib Ahmed Bhutto Senior Software Engineer iConnect360
33

Getting started with Apache Spark

Apr 15, 2017

Download

Software

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Getting started with Apache Spark

Apache SparkStream Programming and Distributed Data Processing

Habib Ahmed BhuttoSenior Software Engineer

iConnect360

Page 2: Getting started with Apache Spark

Outline

• What’s Spark• Why Spark• Fundamental concepts• Cluster Deployment • Spark Streaming• Application Development• Deployment • Application Monitoring • Debugging

Page 3: Getting started with Apache Spark

What’s Spark

• Fast and speedy • General (purpose) engine • For large-scale data processing • In memory processing • Built at AMPLab,

University of California, Berkeley as sub-project of Hadoop

• Now it’s Apache’s

Page 4: Getting started with Apache Spark

Why Spark

• Speed • Ease of use• Generality • Runs everywhere (Hadoop, Mesos, standalone or in cloud)

• Fault Tolerance • Integration • Deployment

Page 5: Getting started with Apache Spark

Fundamental Concepts

• What exactly it does

Hadoop execution flow

Spark execution flow

Page 6: Getting started with Apache Spark

Fundamental Concepts

• How exactly it does

Page 7: Getting started with Apache Spark

Fundamental Concepts

• Resilient Distributed Dataset (RDD)– Abstraction – Immutable – Partitioned collection– Operated on in parallel

• RDD Operations – Actions – Transformations

• Spark Context

Page 8: Getting started with Apache Spark

Fundamental Concepts

• Driver Program• Cluster Manager• Worker Node• Executer• Job • Stage• Task• Application Jar• Deploy Mode

Page 9: Getting started with Apache Spark

Cluster Deployment

• Standalone• Amazon EC2 • Apache Mesos • Hadoop Yarn

Page 10: Getting started with Apache Spark

Cluster Deployment

• Master page to monitor your cluster – http://<server-url>:8080

Page 11: Getting started with Apache Spark

Spark Streaming

• How it works

Page 12: Getting started with Apache Spark

Spark Streaming

• How it works internally

Page 13: Getting started with Apache Spark

Spark Streaming

Page 14: Getting started with Apache Spark

Spark Streaming

• Discretised Streams– Abstraction – Continuous Stream– Input data/ processed data – Series of RDDs

Page 15: Getting started with Apache Spark

Spark Streaming

• Any operation applied on a DStream translates to operations on the underlying RDDs

Page 16: Getting started with Apache Spark

Spark Streaming

• Window Operations • Output Operations • DataFrame and SQL Operations – DataFrame is abstraction that can act as

distributed SQL query engine.

Page 17: Getting started with Apache Spark

Application Development

• Spark-Shell – Code in Scala with instant execution

Page 18: Getting started with Apache Spark

Application Development

• Self-Contained Applications – Dependencies /Linking Libraries

Page 19: Getting started with Apache Spark

Application Development

• Self-Contained Applications – A simple app

Page 20: Getting started with Apache Spark

Application Development

• Self-Contained Applications – Packaging – Don’t forget app dependencies

Page 21: Getting started with Apache Spark

Deployment

• That’s how you deploy

Page 22: Getting started with Apache Spark

Application Monitoring• monitor your app – http://<driver-node>:4040

Page 23: Getting started with Apache Spark

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Page 24: Getting started with Apache Spark

Application Monitor

• History Server– Enable and Start History Server http://<server-url>:18080

Page 25: Getting started with Apache Spark

Debugging

• Remote debugging – Enable Remote debugging

– Must be running on local[*]

Page 26: Getting started with Apache Spark

Running on Yarn

• Why to run on Yarn? – Cluster resources – Schedulers – Security

Page 27: Getting started with Apache Spark

Running on Yarn

• Standalone

Page 28: Getting started with Apache Spark

Running on Yarn

• Yarn Architecture – Resource Manager– Node Manager– Application Master– Container

Page 29: Getting started with Apache Spark

Running on Yarn

• Yarn Client Mode

Page 30: Getting started with Apache Spark

Running on Yarn

• Yarn Cluster Mode

Page 31: Getting started with Apache Spark

Running on Yarn

• Standalone vs Spark on Yarn

Page 32: Getting started with Apache Spark

References[1] Apache Spark official site http://spark.apache.org/[2] Introduction to Spark http://www.slideshare.net/rahuldausa/introduction-to-apache-spark-39638645 [3] Running Spark on Yarn http://badrit.com/blog/2015/2/29/running-spark-on-yarn#.VnEQub9eeaq [4] Debugging Apache Spark Jobs http://danosipov.com/?p=779 [5] Habib’s brain

Page 33: Getting started with Apache Spark

A Big Thank YouSpark it up

You got questions?