Top Banner
Lightning-fast cluster computing Apache Spark
19

Apache Spark - Aram Mkrtchyan

Aug 18, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Spark - Aram Mkrtchyan

Lightning-fast cluster computing

Apache Spark

Page 2: Apache Spark - Aram Mkrtchyan

What is Apache Spark?Cluster computing platform designed to be fast and general-purpose.

Fast

Universal

Highly Accessible

NOT A HADOOP

REPLACEMENT

Page 3: Apache Spark - Aram Mkrtchyan

Unified stack

Page 4: Apache Spark - Aram Mkrtchyan

Comparison with MRval textFile = spark.textFile("hdfs://...")val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Page 5: Apache Spark - Aram Mkrtchyan

Examples:Word Count

val sc = new SparkContext(...)

val textFile = sc.textFile("hdfs://...")

val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 6: Apache Spark - Aram Mkrtchyan

val sc = new SparkContext(...) val inputRDD = sc.textFile("log.txt")

val errorsRDD = inputRDD.filter(line => line.contains("error"))

val warningsRDD = inputRDD.filter(line => line.contains("warning"))

val badLinesRDD = errorsRDD.union(warningsRDD)

badLinesRDD.persist()

badLinesRDD.count()

badLinesRDD.collect()

Examples:Log Mining

Page 7: Apache Spark - Aram Mkrtchyan

How it works?

RDD

Resilient

Distributed

Dataset

Page 8: Apache Spark - Aram Mkrtchyan

Example Hadoop RDD

partitions = One per HDFS block

dependencies = none

compute = read corresponding block

preferredLocations = HDFS block locations

partitioner = none

Advanced: RDD as interface

Page 9: Apache Spark - Aram Mkrtchyan

Direct Acyclic Graph (DAG)hadoopRDD

errorsRDD warningsRDD

badLinseRDD

filterfilter

union

Page 10: Apache Spark - Aram Mkrtchyan

Function Name Purpose Example

map() Apply a function to each element in the RDD and return an RDD of the result. rdd.map(x => x + 1)

flatMap() Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned. Often used to extract words.

rdd.flatMap(x => x.to(3))

filter() Return an RDD consisting of only elements that pass the condition passed to filter().

rdd.filter(x => x != 1)

distinct() Remove duplicates. rdd.distinct()

union() Produce an RDD containing elements from both RDDs. rdd.union(other)

intersection() RDD containing only elements found in both RDDs. rdd.intersection(other)

join() Perform an inner join between two RDDs. rdd.join(other)

groupByKey() Group values with same key rdd.groupByKey(other)

RDD Transformations

Page 11: Apache Spark - Aram Mkrtchyan

RDD actions

Function Name Purpose Example

count() Number of elements in RDD rdd.count()

collect() Return all elements from the RDD rdd.collect()

saveAsTextFile() Saves RDD elements to an external storage system

rdd.saveAsTextFile(“hdfs://...”)

take(num) Return num elements from RDD rdd.take(10)

reduce(func) Combine the elements of the RDD together in parallel (e.g., sum)

rdd.reduce((x, y) => x + y)

takeOrdered(num)(ordering) Return num elements regarding provided ordering

rdd.takeOrdered(2)(myOrdering)

Page 12: Apache Spark - Aram Mkrtchyan

RDD Caching

Level Space Used CPU Time In Memory On disk Comments

MEMORY_ONLY High Low Y N

MEMORY_ONLY_SER Low High Y N

MEMORY_AND_DISK High Medium Some Some Spills to disk if there is too much data to fit in memory.

MEMORY_AND_DISK_SER Low High Some Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory.

DISK_ONLY Low High N Y

Page 13: Apache Spark - Aram Mkrtchyan

How it works?

Main program which controls the flow

Driver Executors

Nodes that execute actions

Page 14: Apache Spark - Aram Mkrtchyan

How it works?

DAG Scheduler

Coordination between RDDs, driver and

nodes

Page 15: Apache Spark - Aram Mkrtchyan

What is Spark Application

Page 16: Apache Spark - Aram Mkrtchyan

Advanced Topics: Stages

Page 17: Apache Spark - Aram Mkrtchyan

Advanced Topics: Shuffling

Page 18: Apache Spark - Aram Mkrtchyan

Spark Stack

SQL

Streaming

Machine Learning

GraphX

Page 19: Apache Spark - Aram Mkrtchyan

if not… DEMO everyone?

?