1 COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Spring 2017 What is SPARK? • In-Memory Cluster Computing for Big Data Applications • Fixes the weaknesses of MapReduce – Iterative applications – Streaming data processing – Keep data in memory across different functions • Sparks works across many environments – Standalone, – Hadoop, – Mesos, • Spark support accessing data from diverse sources ( HDFS, HBase, Cassandra, …)
12
Embed
What is SPARK? - UHgabriel/courses/cosc6339_s17/BDA_11_Spark.pdf · What is SPARK? •In-Memory Cluster ... –Hadoop, –Mesos, •Spark ... Spark Essentials •Spark program has
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
COSC 6339
Big Data Analytics
Introduction to Spark
Edgar Gabriel
Spring 2017
What is SPARK?
• In-Memory Cluster Computing for Big Data Applications
• Fixes the weaknesses of MapReduce
– Iterative applications
– Streaming data processing
– Keep data in memory across different functions
• Sparks works across many environments
– Standalone,
– Hadoop,
– Mesos,
• Spark support accessing data from diverse sources ( HDFS,
HBase, Cassandra, …)
2
What is SPARK (II)• Three modes of execution
– Spark shell
– Spark scripts
– Spark code
• API defined for multiple languages
– Scala
– Python
– Java
– R
A couple of words on Scala
• Object-oriented language: everything is an object and
every operation is a method-call.
• Scala is also a functional language
– Functions are first class values
– Can be passed as arguments to functions
– Functions have to be free of side effects
– Can defined functions inside of functions
• Scala runs on the JVM
– Java and Scala classes can be freely mixed
3
Spark Essentials
• Spark program has to create a SparkContext object,
which tells Spark how to access a cluster
• Automatically done in the shell for Scala or Python: accessible through the sc variable
• Programs must use a constructor to instantiate a new SparkContext
gabriel@whale:> pyspark
…
Using Python version 2.7.6 (default, Nov 21 2013 15:55:38)
SparkSession available as 'spark'.
>>> sc
<pyspark.context.SparkContext object at 0x2609ed0>
Spark Essentials
• The master parameter for a SparkContext
determines which resources to use, e.g. whale>pyspark --master yarn
Slide based on a talk from http://cdn.liber118.com/workshop/itas_workshop.pdf