Top Banner
Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond
33

Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Dec 17, 2015

Download

Documents

Dortha Griffith
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Patrick Wendell

Databricks

Spark.incubator.apache.org

Spark 1.0 and Beyond

Page 2: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

About meCommitter and PMC member of Apache Spark

“Former” PhD student at Berkeley

Release manager for Spark 1.0

Background in networking and distributed systems

Page 3: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Today’s Talk

Spark background

About the Spark release process

The Spark 1.0 release

Looking forward to Spark 1.1

Page 4: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

What is Spark?

EfficientGeneral execution graphs

In-memory storage

UsableRich APIs in Java, Scala, Python

Interactive shell

Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop

2-5× less code

Up to 10× faster on disk,100× in memory

Page 5: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.
Page 6: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

30-Day Commit Activity

Patches0

50

100

150

200

250

MapReduceStormYarnSpark

Lines Added0

5000

10000

15000

20000

25000

30000

35000

40000

45000

MapReduceStormYarnSpark

Lines Removed0

2000

4000

6000

8000

10000

12000

14000

16000

MapReduceStormYarnSpark

Page 7: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark PhilosophyMake life easy and productive for data scientists

Well documented, expressive API’s

Powerful domain specific libraries

Easy integration with storage systems

… and caching to avoid data movement

Predictable releases, stable API’s

Page 8: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark Release Process

Quarterly release cycle (3 months)

2 months of general development

1 month of polishing, QA and fixes

Spark 1.0 Feb 1 April 8th, April 8th+

Spark 1.1 May 1 July 8th, July 8th+

Page 9: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark 1.0:By the numbers- 3 months of development

- 639 patches

- 200+ JIRA issues

- 100+ contributors

Page 10: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

API Stability in 1.XAPI’s are stable for all non-alpha projects

Spark 1.1, 1.2, … will be compatible

@DeveloperApi

Internal API that is unstable

@Experimental

User-facing API that might stabilize later

Page 11: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Today’s Talk

About the Spark release process

The Spark 1.0 release

Looking forward to Spark 1.1

Page 12: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark 1.0 FeaturesCore engine improvements

Spark streaming

MLLib

Spark SQL

Page 13: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark CoreHistory server for Spark UI

Integration with YARN security model

Unified job submission tool

Java 8 support

Internal engine improvements

Page 14: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

History ServerConfigure with :

spark.eventLog.enabled=truespark.eventLog.dir=hdfs://XX

In Spark Standalone, history server is embedded in the master.

In YARN/Mesos, run history server as a daemon.

Page 15: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Job Submission ToolApps don’t need to hard-code master: conf = new SparkConf().setAppName(“My App”) sc = new SparkContext(conf)

./bin/spark-submit <app-jar> \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster

Page 16: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Java 8 SupportRDD operations can use lambda syntaxclass Split extends FlatMapFunction<String, String> { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); });JavaRDD<String> words = lines.flatMap(new Split());

JavaRDD<String> words = lines .flatMap(s -> Arrays.asList(s.split(" ")));

Old

New

Page 17: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Java 8 SupportNOTE: Minor API changes

(a) If you are extending Function classes, use implements rather than extends.

(b) Return-type sensitive functions

mapToPairmapToDouble

Page 18: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Python API Coveragerdd operators

intersection(), take(), top(), topOrdered()

meta-data

name(), id(), getStorageLevel()

runtime configuration

setJobGroup(), setLocalProperty()

Page 19: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Integration with YARN SecuritySupports Kerberos authentication in YARN environments:

spark.authenticate = true

ACL support for user interfaces:

spark.ui.acls.enable = true

spark.ui.view.acls = patrick, matei

Page 20: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Engine ImprovementsJob cancellation directly from UI

Garbage collection of shuffle and RDD data

Page 21: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

DocumentationUnified Scaladocs across modules

Expanded MLLib guide

Deployment and configuration specifics

Expanded API documentation

Page 22: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark

RDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

MLLibmachine learning

DStream’s: Streams of

RDD’s

SchemaRDD’s RDD-Based Matrices

Page 23: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark SQL

Page 24: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Turning an RDD into a Relation// Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

 

Page 25: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Querying using SQL// SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()

// Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)

Page 26: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Import and Export// Save SchemaRDD’s directly to parquetpeople.saveAsParquetFile("people.parquet")

// Load data stored in Hiveval hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import hiveContext._

// Queries can be expressed in HiveQL.hql("FROM src SELECT key, value")

Page 27: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

In Memory Columnar StorageSpark SQL can cache tables using an in-memory columnar format:

- Scan only required columns- Fewer allocated objects (less GC)- Automatically selects best compression

Page 28: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Spark StreamingWeb UI for streaming

Graceful shutdown

User-defined input streams

Support for creating in Java

Refactored API

Page 29: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

MLlibSparse vector support

Decision trees

Linear algebra

SVD and PCA

Evaluation support

3 contributors in the last 6 months

Page 30: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

MLlibNote: Minor API change

val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray)val clusters = KMeans.train(parsedData, 4, 100)

val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble)))val clusters = KMeans.train(parsedData, 4, 100)

Page 31: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

1.1 and BeyondData import/export leveraging catalyst HBase, Cassandra, etc

Shark-on-catalyst

Performance optimizationsExternal shuffle

Pluggable storage strategies

Streaming: Reliable input from Flume and Kafka

Page 32: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Unifying ExperienceSchemaRDD represents a consistent integration point for data sources

spark-submit abstracts the environmental details (YARN, hosted cluster, etc).

API stability across versions of Spark

Page 33: Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

ConclusionVisit spark.apache.org for videos, tutorials, and hands-on exercises.

Help us test a release candidate!

Spark Summit on June 30th

spark-summit.org

Meetup group meetup.com/spark-users