Top Banner
Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu
32

Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

May 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Data processing in Apache Spark

Pelle Jakovits

5 October, 2015, Tartu

Page 2: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Outline

• Introduction to Spark

• Resilient Distributed Datasets (RDD)

– Data operations

– RDD transformations

– Examples

• Fault tolerance

• Frameworks powered by Spark

Pelle Jakovits 2/34

Page 3: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Spark

• Directed acyclic graph (DAG) task execution engine

• Supports cyclic data flow and in-memory computing

• Spark works with Scala, Java, Python and R

• Integrated with Hadoop Yarn and HDFS

• Extended with tools for SQL like queries, stream processingand graph processing.

• Uses Resilient Distributed Datasets to abstract data that is tobe processed

Pelle Jakovits 3/34

Page 4: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Hadoop YARN

Pelle Jakovits 4/34

Page 5: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Performance vs Hadoop

Pelle Jakovits

0,96

110

0 25 50 75 100 125

Logistic Regression

4,1

155

0 30 60 90 120 150 180

K-Means ClusteringHadoop

Spark

Time per Iteration (s)

Introduction to Spark – Patrick Wendell, Databricks

5/34

Page 6: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Pelle Jakovits 6/34

Page 7: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Working in Java

• Tuples

• Functions– In Java 8 you can use lamda functions

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);

– But In older Java you have to use predefined function interfaces:• Function,

• Function2, Function 3

• FlatMapFunction

• PairFunction

Pelle Jakovits

Tuple2 pair = new Tuple2(a, b);pair._1 // => apair._2 // => b

7/34

Page 8: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Java Spark Function types

class GetLength implements Function<String, Integer> {

public Integer call(String s) {

return s.length();

}

}

class Sum implements Function2<Integer, Integer, Integer> {

public Integer call(Integer a, Integer b) {

return a + b;

}

}

Pelle Jakovits 8/34

Page 9: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Java Example - MapReduce

JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);

int count = dataSet.map(new Function<Integer, Integer>() {public Integer call(Integer integer) {

double x = Math.random() * 2 - 1;double y = Math.random() * 2 - 1;return (x * x + y * y < 1) ? 1 : 0;

}}).reduce(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer integer, Integer integer2) {return integer + integer2;

}});

Pelle Jakovits 9/34

Page 10: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Python example

• Word count in Spark's Python API

file = spark.textFile("hdfs://...")

file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

Pelle Jakovits 10/34

Page 11: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Persisting data

• Spark is Lazy

• To force spark to keep any intermediate data in memory, we can use:

– lineLengths.persist(StorageLevel);

– which would cause lineLengths RDD to be saved in memory after the first time it is computed.

• Should be used in case we want to process the same RDD multiple times.

Pelle Jakovits 11/34

Page 12: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Persistance level

• DISK_ONLY

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

– More efficient

– Use more CPU

• MEMORY_ONLY_2

– Replicate data on 2 executors

Pelle Jakovits 12/34

Page 13: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

RDD operations

• Actions

– Creating RDD’s

– Storing RDD’s

– Extracting data from RDD on the fly

• Transformations

– Restructure or transform RDD data

Pelle Jakovits 13/34

Page 14: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Spark Actions

Pelle Jakovits 14

Page 15: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Loading Data

Local data

External data

Pelle Jakovits

int[] data = {1, 2, 3, 4, 5};JavaRDD<Integer> distData = sc.parallelize(data, slices);

JavaRDD<String> input = sc.textFile(“file.txt”)sc.textFile(“directory/*.txt”)sc.textFile(“hdfs://xxx:9000/path/file”)

15/34

Page 16: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Broadcast

• Broadcast a copy of the data to every node in the Spark cluster:

Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3});

Int[] values = broadcastVar.value();

Pelle Jakovits 16/34

Page 17: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Storing data

• counts.saveAsTextFile("hdfs://...");

• counts.saveAsObjectFile("hdfs://...");

• DataCube.saveAsHadoopFile("testfile.seq", LongWritable.class, LongWritable.class, SequenceFileOutputFormat.class);

Pelle Jakovits 17/34

Page 18: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Other actions

• Reduce() – we already saw in example

• Collect() – Retrieve RDD content.

• Count() – count number of elements in RDD

• First() – Take first element from RDD

• Take(n) - Take n first elements from RDD

• countByKey() – count values for each uniquekey

Pelle Jakovits 18/34

Page 19: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Spark RDD Transformations

Pelle Jakovits 19

Page 20: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Map

JavaPairRDD<String, Integer> ones = words.map(

new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String s) {

return new Tuple2(s, 1);

}

}

);

Pelle Jakovits 20/34

Page 21: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

groupBy

JavaPairRDD<Integer, List<Tuple2<Integer, Float>>> grouped = values.groupBy(new Partitioner(splits), splits);

public class Partitioner extends Function<Tuple2<Integer, Float>, Integer>{

public Integer call(Tuple2<Integer, Float> t) {

return r.nextInt(partitions);

}

}

Pelle Jakovits 21/34

Page 22: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

reduceByKey

JavaPairRDD<String, Integer> counts = ones.reduceByKey(

new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}

}

);

Pelle Jakovits 22/34

Page 23: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

FilterJavaRDD<String> logData = sc.textFile(logFile);

long ERRORS = logData.filter(new Function<String, Boolean>() {

public Boolean call(String s) { return s.contains(“ERROR");

}

}

).count();

long INFOS = logData.filter(new Function<String, Boolean>() {

public Boolean call(String s) { return s.contains(“INFO");

}

}).count();

Pelle Jakovits 23/34

Page 24: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Other transformations

• sample(withReplacement, fraction, seed)• distinct([numTasks]))• union(otherDataset)• flatMap(func)• groupByKey()• join(otherDataset, [numTasks]) - When called on datasets

of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.

Pelle Jakovits 24/34

Page 25: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Fault Tolerance

Pelle Jakovits 25

Page 26: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Lineage

• Lineage is the history of RDD

• RDD’s keep track of all the RDD partitions– What functions were appliced to procduce it

– Which input data partition were involved

• Rebuild lost RDD partitions according to lineage, using the latest still available partitions.

• No performance cost if nothing fails (as opposite to checkpointing)

Pelle Jakovits 26/34

Page 27: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Frameworks powered by Spark

Pelle Jakovits 27/34

Page 28: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Frameworks powered by Spark

• Spark SQL- Seemlessly mix SQL queries with Spark programs.

– Similar to Pig and Hive

• MLlib - machine learning library

• GraphX - Spark's API for graphs and graph-parallel computation.

Pelle Jakovits 28/34

Page 29: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Advantages of Spark

• Much faster than solutions bult ontop of Hadoop when data can fit into memory

– Except Impala maybe

• Hard to keep in track of how (well) the data isdistributed

• More flexible fault tolerance

• Spark has a lot of extensions and is constantlyupdated

Pelle Jakovits 29/34

Page 30: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Disadvantages of Spark

• What if data does not fit into the memory?

• Saving as text files can be very slow

• Java Spark is not as convinient to use as Pig forprototyping, but you can

– Use python Spark instead

– Use Spark Dataframes

– Use Spark SQL

Pelle Jakovits 30/34

Page 31: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Conclusion

• RDDs offer a simple and efficientprogramming model for a broad range ofapplications

• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage

• Provides definite speedup when data fits intothe collective memory

Pelle Jakovits 31/34

Page 32: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits

Thats All

• This week`s practice session

– Processing data with Spark

• Next week`s lecture is about higher level Spark

– Scripting and Prototyping in Spark

• Spark SQL

• DataFrames

– Spark Streaming

Pelle Jakovits 32/34