Top Banner
#datapopupseattle Scala and Spark are Ideal for Big Data John Nestor Sr Architect, 47 Degrees 47deg
19

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Apr 21, 2017

Download

Data & Analytics

Domino Data Lab
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

#datapopupseattle

Scala and Spark are Ideal for Big Data

John NestorSr Architect, 47 Degrees

47deg

Page 2: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

#datapopupseattle

UNSTRUCTUREDData Science POP-UP in Seattle

www.dominodatalab.com

D

Produced by Domino Data Lab

Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish

models into production faster.

Page 3: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Scala and Spark are Ideal for Big Data

John Nestor47 Degrees

Seattle Unstructured Data Science Pop-UpOctober 7, 2015

www.47deg.com 347deg.com

Page 4: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Why Scala?

• Strong typing

• Concise elegant syntax

• Runs on JVM (Java Virtual Machine)

• Supports both object-oriented and functional

• Small simple programs through large parallel distributed systems

• Easy to cleanly extend with new libraries and DSL’s

• Ideal for parallel and distributed systems

4

Page 5: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Scala: Strong Typing and Concise Syntax

• Strong typing like Java.

• Compile time checks

• Better modularity via strongly typed interfaces

• Easier maintenance: types make code easier to understand

• Concise syntax like Python.

• Type inference. Compiler infers most types that had to be explicit in Java.

• Powerful syntax that avoid much of the boilerplate of Java code (see next slide).

• Best of both worlds: safety of strong typing with conciseness (like Python).

5

Page 6: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Scala Case Class

• Java version class User { private String name; private Int age; public User(String name, Int age) { this.name = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;} } User joe = new User(“Joe”, 30);

• Scala versioncase class User(name:String, var age:Int) val joe = User(“Joe”, 30)

6

Page 7: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Functional Scala

• Anonymous functions. (a:Int,b:Int) => a+b

• Functions that take and return other functions.

• Rarely need variables or loops

• Immutable collections: Seq[T], Map[K,V], …

• Works well with concurrent or distributed systems

• Natural for functional programming

• Functional collection operations (a small sample)

• map, flatMap, reduce, …

• filter, groupBy, sortBy, take, drop, …

7

Page 8: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Scala Availability and Support

• Open Source

• Typesafe provides support. Founded my Martin Odersky who designed Scala.

• IDEs: Intellij IDEA and Eclipse

• Libraries: lots now and more every day

• ScalaNLP - Epic (natural language processing)

• Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages

• Major systems written in Scala: Spark, Kafka

8

Page 9: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Typesafe Scala Components

• Scala Compiler (includes REPL)

• Scala Standard Libraries

• SBT - Scala Build Tool

• Play - scaleable web applications

• Scala JS - compiles Scala to JavaScript

• Akka - for parallel and distributed computation

• Spray - high performance asynchronous TCP/ HTTP library

• Spark - Typesafe also supports Spark

• Slick - for SQL database access

• ConductR - Scala deployment/devops tool

• Reactive Monitoring (Beta)9

Page 10: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Why Spark?

• Support for not only batch but also (near) real-time

• Fast - keeps data in memory as much as possible

• Often 10X to 100X Hadoop speed

• A clean easy-to-use API

• A richer set of functional operations than just map and reduce

• A foundation for a wide set of integrated data applications

• Can recover from failures - recompute or (optional) replication

• Scalable for very large data sets and reduced time

10

Page 11: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Spark RDDs

• RDD[T] - resilient distributed data set

• typed (must be serializable)

• immutable

• ordered

• can be processed in parallel

• lazy evaluation - permits more global optimizations

• Rich set of functional operations ( a small sample)

• map, flatMap, reduce, …

• filter, groupBy, sortBy, take, drop, …

11

Page 12: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Spark Components

• Spark Core

• Scalable multi-node cluster

• Failure detection and recovery

• RDDs and functional operations

• MLLib - for machine learning

• linear regression, SVMs, clustering, collaborative filtering, dimension reduction

• more on the way!

• GraphX - for graph computation

• Streaming - for near real-time

• Dataframes - for SQL and Json12

Page 13: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Spark Availability and Support

• Open Source - top level Apache project

• Over 750 contributors from over 200 organizations

• Can process multiple petabytes on clusters of over 8000 nodes

• Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO

• Packages (more every day)

• Zeppelin - Scala notebooks

• Cassandra, Kafka connectors

13

Page 14: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Clusters and Scalability

• Scala Akka clusters (process distribution, micro services)

• message passing

• remote Actors

• Spark clusters (data distribution)

• local

• Stand alone (optionally with ZooKeeper)

• Apache Mesos

• Hadoop Yarn

• can run above on Amazon and Google clouds

14

Page 15: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Why Scala for Spark?

• Why not Python, R, or Java for Spark?

• Spark is written in Scala

• Scala source code is important Spark documentation

• Spark is best extended in Scala

• The primary API for Spark is Scala

• The functional features of Scala and Spark are a natural fit and easiest to use in Scala

• If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain

15

Page 16: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Demo

16

Page 17: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

47deg.com

Seattle Resources

• Seattle Meetups

• Scala at the Sea Meetup http://www.meetup.com/Seattle-Scala-User-Group/

• Seattle Spark Meetup http://www.meetup.com/Seattle-Spark-Meetup/

• Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training

• UW Scala Professional Certificate Program http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html

17

Page 18: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

#datapopupseattle

@datapopup #datapopupseattle

Page 19: Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

#datapopupseattle

Thank You To Our Sponsors