Apache Spark Fundamentals - DEIcapri/BDC/MATERIAL/Spark1920.pdf · Spark does not come with a storage system (unlike Hadoop) but can run on the Hadoop Distributed File System (HDFS)

Apache Spark Fundamentals

1

Introduction

• Spark is an open-source cluster-computing framework. Originallydeveloped at the UC Berkeley in 2009, it was later donated to theApache Software Foundation.

• Spark provides

• Unified computing engine (Spark Core)

• Set of APIs for data analysis, usable with Scala, Java, Python,R: Spark SQL (structured data) MLlib (machine learning),GraphX (graph analytics), Spark Streaming (streaminganalytics). Spark is written in Scala.

• Spark runs on the Java Virtual Machine (JVM)

2

Introduction

• Spark does not come with a storage system (unlike Hadoop)but can run on the Hadoop Distributed File System (HDFS)as well as on other systems (e.g., HIVE or Relational DBMS).

• Spark’s features:

• Fault tolerance.

• In-memory caching, which enables efficient execution ofmultiround algorithms, with a substantial performanceimprovement w.r.t. Hadoop.

• Spark can run:

• On a single machine in the so called local mode. This is whatwe do in Homeworks 1 and 2.

• On a cluster managed by a cluster manager such as Spark’sStandalone, YARN, Mesos. For Homework 3 we will use theYARN cluster manager on CloudVeneto.

3

Observations

• A single machine does not have enough power and resourcesto process big data efficiently.

• A cluster is a group of machines whose power and resourcesare pooled to provide one powerful execution platform.

• Spark can be viewed as a tool for managing and coordinatingthe execution of (big data) jobs on a cluster.

Without careful management and coordination, the potentialadded power coming from the combination of individual

machines is likely to be wasted.

4

Spark Application

5

Spark Application

• The driver process (master process in the MapReduce terminology)runs the main() function and sits on a node in the cluster. It is theheart of the application and it is responsible for

• maintaining information about the application;

• responding to a user’s program or input;

• analyzing, distributing, and scheduling work across theexecutors.

• The driver process is represented by an object called Spark Contextwhich can be regarded as a channel to access all Sparkfunctionalities.

Obs.: starting from Spark 2.0.0, a Spark Session object has been

introduced which encapsulates a Spark Context object, providing a wider

spectrum of functionalities.

6

Spark Application

• The executor processes (worker processes in the moreMapReduce terminology) are responsible for actuallyexecuting the work that the driver assigns them. Eachexecutor is responsible for:

• executing code assigned to it by the driver;

• reporting the state of its computation back to the driver.

• The cluster manager controls physical machines and allocatesresources to applications.

• The driver and executors are simply processes which can runon the same machine or on different machines. In local mode,both run (as threads) on one machine instead of a cluster.

• While executors, for the most part, run Scala code, the drivercan be driven from different languages through Spark’s APIs.

7

Resilient Distributed Dataset (RDD)

• Fundamental abstraction in Spark. An RDD is a collection ofelements of the same type, partitioned and (possibly)distributed across several machines.

• An RDD provides an interface based on coarse-grainedtransformations.

• RDDs ensure fault-tolerance.

8

Resilient Distributed Dataset• A key ingredient behind the efficiency of Spark is data partitioning:

each RDD is broken into chunks called partitions which aredistributed among the available machines (1 or more per machine).

• A program can specify the number of partitions for each RDD (ifnot, Spark will choose one) and decide whether using a defaultHashPartitioner, based on objects’ hash codes, or a custompartitioner.

• A typical number of partitions is 2x/3x the number of cores, whichhelps balancing the work.

• Partitioning enables:

• Data reuse. In iterative (e.g., multi-round) applicatons data arekept in the executors’ main memories as much as possible, withthe intent to avoid expensive accesses to secondary storage.

• Parallelism. Some data transformations are appliedindependently in each partition thus exploiting the parallelismoffered by the underlying platform without incurring aslowdown for data movements.

9

Resilient Distributed Dataset

• RDDs are immutable (i.e., read-only) and can be created eitherfrom data in stable storage (e.g., HDFS) or from other RDDs,through transformations.

• RDDs need not be materialized at all times. Each RDDs maintainsenough information regarding the sequence of transformations thatgenerated it (its lineage), which enable the recomputation of itspartitions from data in stable storage. In other words, an RDD canalways be recostrucuted after a failure (unless the failure affects thestable storage).

• Programmers can control the actual saving of RDDs in memorythrough the methods persist or cache.

10

Operations on RDDs

The following types of operations can be performed on an RDD A

• Transformations. A transformation generates a new RDD B starting fromthe data in A. We distinguish between:

• Narrow transformations. Each partition of A contributes to at mostone partition of B. Hence, no shuffling of data across machines isneeded (⇒ maximum parallelism). E.g., the map method whichtransforms each element of A.

• Wide transformations. Each partition of A may contribute to manypartitions of B. Hence, shuffling of data across machines may berequired. E.g., the groupByKey method, in case A consists ofkey-value pairs, which, for every key k occurring in A, groups allelements of A with key k into a key-value pair (k, Lk) of B, whereLk is the set of values associated with k in A.

• Actions. An action launches a computation on the data in A whichreturns a value to the application (e.g., the count method which returnsthe number of elements in the RDD). It is at this point that the RDD Ais actually materialized (lazy evaluation).

11

Operations on RDDs

12

Implementing MapReduce algorithms in Spark

The homeworks will provide you first-hand experience on how toimplement MapReduce algorithms using Spark.

We summarize here a few key points to keep in mind about this issue.

• Spark enables the implementation of MapReduce algorithms butoffers a much richer set of processing methods.

• Spark (and MapReduce) uses a core idea of functionalprogramming: functions can be argument to other functions.

• A MapReduce round that transforms a set S of key-value pairs intoa new set S ′ of key-value pairs can be implemented in Spark asfollows (see next slide):

13

Implementing MapReduce algorithms in Spark

• Store S as an RDD.

• Map Phase: from the RDD, invoke one of several map methods(narrow transformations) offered by the Spark API. These methodsrequire the map function as argument.

• Reduce Phase: on the RDD resulting from the Map Phase, firstgroup the key-value pairs by key (e.g., using the groupByKeymethod, a wide transformation), then on each group seen as onekey-value pair (one key, many values) apply a map method. Somemethods (e.g., reduceByKey) allow you to do both steps at once, insome cases.

Obs.: Global variables and data structures (e.g., the dataset size N),stored in the driver’s memory, can be used by the transformations. Theseglobal data can be defined and updated through actions.

14

Summary

• Spark features

• Spark Application: driver process, executor processors, clustermanager.

• Resilient Distributed Dataset (RDD)

• Main characteristics• Partitioning• Operations: transformations, actions, persistence.

• Implementing MapReduce algorithms in Spark.

15

References

AS-1 Spark’s Web Site: spark.apache.org

AS-2 Spark’s RDD Programming guide:spark.apache.org/docs/latest/rdd-programming-guide.html

CZ18 A Gentle Introduction to Apache Spark. From B. Chambers, M.Zaharia. Spark: The Definite Guide, Databricks 2018.

Z+12 M. Zaharia et al. Resilient Distributed Datasets: A Fault-TolerantAbstraction for In-Memory Cluster Computing. NSDI 2012: 15-28

16

Apache Spark Fundamentals - DEIcapri/BDC/MATERIAL/Spark1920.pdf · Spark does not come with a storage system (unlike Hadoop) but can run on the Hadoop Distributed File System (HDFS)

Documents