Intro to Apache Spark: Fast cluster computing engine for Hadoop Intro to Scala: Object-oriented and functional language for the Java Virtual Machine ACM SIGKDD, 7/9/2014 Roger Huang Lead System Architect rohuang @visa.com [email protected]@BigDataWrangler
51
Embed
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Intro to Apache Spark:Fast cluster computing engine for Hadoop
Intro to Scala:Object-oriented and functional language for the Java Virtual Machine
• scala> numbers.foldLeft(0){ (acc, b) => acc + b }
• res1: Int = 55
• scala>
27Intro to Spark: Intro to Scala | 7/9/2014
FP: foldLeft
28Intro to Spark: Intro to Scala | 7/9/2014
FP: find the last item in an array
• scala> val ns = Array(20, 40, 60)
• ns: Array[Int] = Array(20, 40, 60)
• scala> ns.foldLeft(ns.head) {(acc, b) => b}
• res0: Int = 60
• scala>
29Intro to Spark: Intro to Scala | 7/9/2014
FP: reverse an array w/ foldLeft
• scala> val ns = Array(20, 40, 60)
• ns: Array[Int] = Array(20, 40, 60)
• scala> ns.foldLeft( Array[Int]() ) { (acc, b) => b +: acc}
• res1: Array[Int] = Array(60, 40, 20)
• scala>
30Intro to Spark: Intro to Scala | 7/9/2014
FP: reverse an array w/ foldLeft
31Intro to Spark: Intro to Scala | 7/9/2014
Outline• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
32Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Java / OO developer: • Interoperable w/ Java
• Case classes
• Mixins with traits
33Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Java / OO developer: • case class
– Implements equals(), hashCode(), toString()
– Can be used in Pattern Matching
34Intro to Spark: Intro to Scala | 7/9/2014
Scala for the Java / OO developer: • http://docs.oracle.com/javase/8/docs/api/java/util/stream/Str
eam.html
• map
– <R> Stream<R> map(Function<? super T,? extends R> mapper)Returns a stream consisting of the results of applying the given function to the elements of this stream.This is an intermediate operation.
• flatMap
– <R> Stream<R> flatMap(Function<? super T,? extends Stream<? extends R>> mapper)Returns a stream consisting of the results of replacing each element of this stream with the contents of a mapped stream produced by applying the provided mapping function to each element. Each mapped stream is closed after its contents have been placed into this stream. (If a mapped stream is null an empty stream is used, instead.)This is an intermediate operation.
Scala for the Spark developer• ResilientDistributedDataset (RDD)
• A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist.
– If you want to “attach” operations such as +, -, *, / or <= to data objects (e.g., Bloom filters), then you want to provide monoid forms of those data objects
– Consists of
• A set of objects
• Binary operation that satisfies the monoid axioms
• Monad
– If you want to create a data processing pipeline that transforms the state of a data object
– composition
41Intro to Spark: Intro to Scala | 7/9/2014
Outline• Spark
– Hadoop eco system
• Scala
– Background
• Why Scala?
– For the computer scientist
– For the Java / OO programmer
– For the Hadoop/Spark developer
– For the Big Data developer
– For the Big Data scientist / mathematician
– For the system architect
42Intro to Spark: Intro to Scala | 7/9/2014
Scala for the system architect• Concurrency
• Problem:
– Threads
– Shared mutable state
– Locks,
• Solution:
– message passing concurrency w/ Actors
– Future, Promise
• Abstractions
– Actor
• an object that processes a message
• encapsulates state (state not shared)
– ActorRef
– Message, usually sent asynchronously
– Mailbox
– ActorSystem
43Intro to Spark: Intro to Scala | 7/9/2014
Scala for the system architect: Akka• Fault tolerance
– Supervision
– Strategies
• Resume, restart, stop, escalate, …
• Scale out: remote actors
– Via configuration
44Intro to Spark: Intro to Scala | 7/9/2014
Scala for the system architect• Parallel collections