presents Spark Presented by: Sandy
Jan 22, 2018
© copyright ACADGILD
Introduction to ACADGILD
• You can also click on this link to view the video –
https://www.youtube.com/watch?v=7nipSdxv2Uo
Webinar on Spark 2
© copyright ACADGILD
Introduction of Mentor
• The Mentor for this Webinar is Mr. Sandy and below are his qualifications:
• 15 years of experience in IT focusing on Big Data, Data Science and IoT solutions and implementations.
• Expert in the Apache SPARK Ecosystem including Spark 1.6, Scala, Spark SQL, Spark Streaming, MLLIB , SparkR and GraphX.
• Extensive experience in Hadoop Framework solutions including YARN,/MesosHDFS, MapReduce, PigLatin , Hive, HBase/MongoDB/Cassandra, Mahout, Flume, Zookeeper, Oozie and Sqoop.
• Knowledge of Machine Learning for both Supervised and Unsupervised Learning Algorithms.
Webinar on Spark 3
© copyright ACADGILD
Agenda
4Webinar on Spark
Sl No. Agenda Title
1 What is Big data?
2 MapReduce Limitations
3 Introduction to Spark
4 Spark in Hadoop Ecosystem
5 Why In-memory Processing?
6 In-memory Caching
7 Resilient Distributed Dataset
8 Creating RDDs
9 Spark Unified Platform
10 Popular Use Cases
11 Apache Spark Case Studies
12 Get Your Feet Wet with Spark API's
4
© copyright ACADGILD
MapReduce Limitations
6
• MapReduce is based on disk based computing.
• It is more suitable for single pass computations.
• It is not at all suitable for iterative computations.
• Disk intensive.
Programming Model limitations:
• Developing efficient MapReduce applications requires advanced programming skills and deep understanding of the system architecture.
• Every problem has to be broken down in to Map and Reduce phases.
Webinar on Spark 6
© copyright ACADGILD
Introduction to Spark• Apache Spark is a fast and general-purpose cluster computing system.
• Spark is a framework for Scheduling, Monitoring and Distributing the applications.
• Spark is a General Unified Engine which can replace many specialized systems like Mahout, Tez, Graphlab, Storm, etc.
Webinar on Spark 7
© copyright ACADGILD
SQL
GraphX
MLlib
Streaming
RDBMS
Distributions:
DatabasesFile systems
Streaming
sources
Resource Managers
Libraries
APIs
Spark in Hadoop Ecosystem
8Webinar on Spark
- CDH- HDP- Map R- DSE
© copyright ACADGILD
Earlier Now
RAM was very costly. Comparatively, disk was cheap, so disk was primary source of data.
Cost of RAM has been sharply reduced with increase in performance. So, RAM is primary source of data and we use disk for fallback.
Network was costly so data locality Network is faster.
Single core machines were dominant Multi core machines are commonplace
Why In-memory processing?
• Drastic change in hardware
9Webinar on Spark
© copyright ACADGILD
Resilient Distributed Dataset
• Resilient distributed dataset (RDD), represents an immutable collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
• It’s a distributed memory abstraction.
Features:
• Cache an RDD in memory across machines.
• Reuse in multiple MapReduce like parallel operations.
• Fault tolerant through lineage.
11Webinar on Spark
© copyright ACADGILD
Creating RDDs
• Turn a collection into an RDD.
val a = sc.parallelize(Array(1, 2, 3))
• Load text file from local FS, HDFS, or S3.
val a = sc.textFile("file.txt")
val b = sc.textFile("directory/*.txt")
val c = sc.textFile("hdfs://namenode:9000/path/file")
12Webinar on Spark
© copyright ACADGILD
Graph
Spark Core Engine
MLlib
Machine
Learning
Spark
Streaming
Streaming
Graphx
ComputationSpark R
R on SparkSpark SQL
DataFrame
Spark Unified Platform
13Webinar on Spark
© copyright ACADGILD
29%
36%
40%
44%
52%
68%
Popular Use Cases
14
Business Intelligence
Data Warehousing
Recommendation
Log Processing
User-Facing Services
Fraud Detection/ Security
Webinar on Spark
© copyright ACADGILD
Apache Spark Case Studies
Credit Card Fraud Detection
Network Security
Genomic Sequencing
Real-Time Ad Processing
15Webinar on Spark
© copyright ACADGILD
Get Your Feet Wet with Spark API's
Quick tour of Scala, Python, Java API's
16Webinar on Spark
Contact Info:
oWebsite : http://www.acadgild.com
oLinkedIn : https://www.linkedin.com/company/acadgild
oFacebook : https://www.facebook.com/acadgild
oSupport: [email protected]
© copyright ACADGILD
Get in Touch with Us
18Webinar on Spark 18