Top Banner
Apache Spark and the future of big data applications Eric Baldeschwieler
21

Apache Spark and the future of big data applications Eric Baldeschwieler.

Dec 24, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Spark and the future of big data applications Eric Baldeschwieler.

Apache Spark and the future of big data applications

Eric Baldeschwieler

Page 2: Apache Spark and the future of big data applications Eric Baldeschwieler.

Who is Eric14?• Big data veteran (since 1996)

• Databricks Tech Advisor

• Twitter handle: @jeric14

• Previously

• CTO/CEO of Hortonworks

• Yahoo - VP Hadoop Engineering

• Inktomi – Web Search

Page 3: Apache Spark and the future of big data applications Eric Baldeschwieler.

Last Spark Summit…

“IMO Apache Spark is the most exciting thing happening in big data today”

The potential:• Spark - the lingua franca for data science• Spark and Hadoop - great together

• So how are we doing?

Page 4: Apache Spark and the future of big data applications Eric Baldeschwieler.

Spark is now bundled with Hadoop

All major Hadoop distributions include Spark

Other big data solutions too!

Page 5: Apache Spark and the future of big data applications Eric Baldeschwieler.

Spark is in use for data science

Page 6: Apache Spark and the future of big data applications Eric Baldeschwieler.

Great progress!

Time to declare victory?Nope, we’re only getting started.

Page 7: Apache Spark and the future of big data applications Eric Baldeschwieler.

Making Spark great for Data Science

Page 8: Apache Spark and the future of big data applications Eric Baldeschwieler.

Increase focus on ETL• Science needs data, in the right place and format• Teams are porting Hadoop ETL languages to Spark• Better job scheduling tools• ETL workloads are different – Scale and Throughput

– Spark 1.0 a big step forward for these workloads• 1000 node spark clusters• petabyte scale jobs

– Let’s build benchmarks and iterate…

https://github.com/Cascading/CoPA/wiki

Page 9: Apache Spark and the future of big data applications Eric Baldeschwieler.

More stuff

• R bindings!!• Add features to ease/accelerate code sharing• SparkSQL needs to be extended to run against

more data stores, including object stores• Deep learning and other Algo support

– Trade off completeness for speed– More communication primitives?

• Developer basics– Profiling & debugging, error reporting & logging…

Page 10: Apache Spark and the future of big data applications Eric Baldeschwieler.

Data science drives

new applications

(some history)

Page 11: Apache Spark and the future of big data applications Eric Baldeschwieler.

Hadoop at Yahoo!

Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/

Page 12: Apache Spark and the future of big data applications Eric Baldeschwieler.

CASE STUDYYAHOO! HOMEPAGE

12

• Serving Maps• Users - Interests

• Five Minute Production

• Weekly Categorization models

SCIENCE HADOOP CLUSTER

SERVING SYSTEMS

PRODUCTION HADOOP CLUSTER

USERBEHAVIOR

ENGAGED USERS

CATEGORIZATIONMODELS (weekly)

SERVINGMAPS

(every 5 minutes)USER

BEHAVIOR

» Identify user interests using Categorization models

» Machine learning to build ever better categorization models

Build customized home pages with latest data (thousands / second)

© Yahoo 2011 12

Page 13: Apache Spark and the future of big data applications Eric Baldeschwieler.

Big data application model

Web & App Servers(ApacheD, Tomcat…)

Serving Store(Cassandra, MySQL…)

Interactive layer

Message Bus(Kafka…)

Streaming Engine(Spark, Storm…)

Batch Compute Framework (Spark, MapReduce…)

Batch Storage (HDFS, S3…)

Streaminglayer

Batch layer

Page 14: Apache Spark and the future of big data applications Eric Baldeschwieler.

Spark applications today

Page 15: Apache Spark and the future of big data applications Eric Baldeschwieler.

3rd party applications

Spark Distros…

Spark Apps …

Page 16: Apache Spark and the future of big data applications Eric Baldeschwieler.

Classes of applications

• Custom Solutions– Internal apps

• Enterprise data tooling– ETL, BI/Query

• Data science tooling– Analytics & ML – Collaboration & Reporting

• Vertical specific applications– financial, healthcare, IC– marketing, retail, gaming

Interactive

layerWeb & App

Servers Serving Store

Message Bus Streaming Engine

Batch Compute Framework

Batch Storage

Streamlayer

Batch layer

Page 17: Apache Spark and the future of big data applications Eric Baldeschwieler.

Why so much activity?

• Spark’s speed allows compelling interactivity• Interactive API eases development• Spark runs well in many environments

– Cloud, Hadoop/YARN, Cassandra, …• Broad open source community support

Page 18: Apache Spark and the future of big data applications Eric Baldeschwieler.

Improving Spark for Applications

Page 19: Apache Spark and the future of big data applications Eric Baldeschwieler.

Build an Open Certification Suite

• Goals of certification for an application– Write an app once and run anywhere– Apps continue running after platform upgrades

• Value of an open suite to user community– Contributors can validate their specific requirements– Drives widest possible ecosystem for applications

• Open source community “does the right thing”– Backwards compatibility has been a challenge in the

big data ecosystem…– Sharing code that defines correct compatibility a win

Page 20: Apache Spark and the future of big data applications Eric Baldeschwieler.

Tachyon / More storage APIs

• Better multi-tenancy for Spark – Keep objects in RAM between Jobs / users– Have a system wide view of RAM requirements

• Provide storage portability – Spark can run against S3, HDFS, Cassanda…– Can we make this transparent to applications

• Better use of tiered storage– RAM, SSD and Disk

Page 21: Apache Spark and the future of big data applications Eric Baldeschwieler.

Enjoy the show! -@jeric14

“IMO @ApacheSpark is the most exciting thing happening in big data today”