Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ngReal,TimeAnaly’cs** with*Spark October(8,(2015

Accelera'ng Real-‐Time Analy'cs with Spark October 8, 2015

Housekeeping

Audio – Streamed via media player, turn volume up

Submit questions for Q&A via Group Chat widget

Download slides and event materials

Hashtag: #stratahadoop

Your Speakers Today

Sean Owen Director of Data Science Cloudera, EMEA

Yann Delacourt Director, Big Data Product Management Talend

•  Apache Spark, its architecture and benefits •  Spark's architecture, deployment strategies and use cases •  Spark's impact to data science, analy@cs and machine learning • How to move data scien@sts' work to IT produc@on •  Best prac@ces for large Spark deployments • Mastering Spark's complexity

Agenda

Accelera@ng Real-‐Time Analy@cs with Apache Spark Sean Owen, Director of Data Science Cloudera, EMEA

What is Apache Spark?

Spark is a general purpose computa@onal framework with more flexibility than MapReduce •  Leverages distributed memory • Full Directed Graph expressions for data parallel computa@ons •  Improved developer experience •  Linear scalability, Data Locality • Fault-‐tolerance

The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-‐

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search

Apache Spark Flexible, in-‐memory data processing for Hadoop

Easy Development

Flexible Extensible API

Fast Batch & Stream Processing

•  Rich APIs for Scala, Java, and Python

•  Interac@ve shell

•  APIs for different types of workloads: •  Batch •  Streaming •  Machine Learning •  Graph

•  In-‐Memory processing and caching

Easy Development Use Interac@vely

•  Interac@ve explora@on of data for data scien@sts •  No need to develop “applica@ons”

•  Developers can prototype applica@on on live system

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>

Easy Development Expressive API

•  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

Example Logis@c Regression

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Spark Takes Advantage of Memory

Resilient Distributed Datasets (RDD) • Memory caching layer that stores data in a distributed, fault-‐tolerant cache

• Can fall back to disk when data-‐set does not fit in memory

• Created by parallel transforma@ons on data in stable storage • Provides fault-‐tolerance through concept of lineage

Fast Processing Using RAM, Operator Graphs

In-‐Memory Caching •  Data Par@@ons read from RAM

instead of disk Operator Graphs •  Scheduling Op@miza@ons •  Fault Tolerance

filter

groupBy

C: D: E:

Ç√Ω

= cached par@@on = RDD

Data Science Baneries Included

MLlib ML “Pipelines” •  Exis@ng, mature Spark ML subproject •  Covers the basics well

•  Decision trees, SVM, LR •  ALS, SVD •  K-‐means •  … and more

•  Stand-‐alone implementa@ons •  Algorithms Only

•  Beta “MLlib 2.0” •  Emulates scikit-‐learn APIs •  Pipelines, not just algos

•  Feature engineering •  Transforma@on •  Ensembles

•  Unified architecture •  Spark 1.4+

Faster Itera@ve ML Algorithms (Data Fits in Memory)

0 500 1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

ing Time(s)

# of Itera'ons

MapReduce

110 s/itera@on

First itera@on = 80s Further itera@ons 1s due to caching

Cloudera Customer Use Cases Core Spark Spark Streaming

•  Porvolio Risk Analysis •  ETL Pipeline Speed-‐Up •  20+ years of stock data Financial

Services

Health

•  Iden@fy disease-‐causing genes in the full human genome

•  Calculate Jaccard scores on health care data sets

•  Op@cal Character Recogni@on and Bill Classifica@on

•  Trend analysis •  Document classifica@on (LDA) •  Fraud analy@cs Data

Services

•  Online Fraud Detec@on Financial Services

Health

•  Incident Predic@on for Sepsis

Retail

•  Online Recommenda@on Systems •  Real-‐Time Inventory Management

Ad Tech

•  Real-‐Time Ad Performance Analysis

Uni@ng Spark and Hadoop The One Plavorm Ini@a@ve Investment Areas

Management Leverage Hadoop-‐na@ve resource management.

Security Full support for Hadoop security

and beyond.

Scale Enable 10k-‐node clusters.

Streaming Support for 80% of common stream

processing workloads.

Management Security Scale Streaming •  Spark on YARN Integra@on •  HBase integra@on •  Improved metrics for

monitoring/troubleshoo@ng •  Dynamic Resource Alloca@on

•  Spark on YARN: •  Container resizing •  Dynamic Resource

Alloca@on for Streaming •  Simplified resource

configura@on •  Improved WebUI for

debugging •  Improved metrics for visibility

into resource u@liza@on •  Smart auto-‐tuning of job

parameters

•  Kerberos Integra@on •  HDFS Sync (Sentry) •  Secure data at rest

•  Secure data over the wire •  Audit/Lineage (Navigator) •  Spark PCI compliance •  Integra@on with Intel’s

advanced encryp@on libraries •  Enable column and view level

security

•  Revamp Scheduler handling of node failure

•  Sort based shuffle improvements

•  Task Scheduling based on HDFS data locality and caching

•  Scheduler improvements for performance at scale

•  Stress test at scale with mixed mul@-‐tenant workloads

•  HDFS DDM Integra@on •  Dynamic resource u@liza@on &

priori@za@on •  Scale Spark History Server for

1000s of jobs

•  Zero Data Loss with Spark Streaming Resilience

•  Flume integra@on •  Ka{a integra@on

•  SQL seman@cs for expressing streaming jobs (Business Users)

•  New streaming specific API extensions

•  Streaming applica@on management (pause, update, redeploy) via CM

•  Op@mized state updates: efficient point lookups and delta updates

Detailed Roadmap: One PlaTorm Ini'a've = Completed Work

= Planned Future Work

Spark is a Developer Framework

• Spark means wri@ng code

• And deploying it

• And monitoring it

• Workflow orchestra@on is hard

• Oozie? Luigi?

• Custom scripts

Data is S'll Fickle • Data Quality is s@ll hard

• Spark s@ll can’t automa@cally find and clean bad records

• Feature engineering = ETL • Data Integra@on is s@ll hard

• Read / write the right formats • “Publish” to BI tools

The Bad News

Accelera'ng Real-‐Time Analy'cs with Spark Yann Delacourt, Director of Big Data Product Management Talend

APPLICATION INTEGRATION

CLOUD INTEGRATION

DATA INTEGRATION

BIG DATA INTEGRATION

MASTER DATA MANAGEMENT

A Modern Data Platform for All Your Integration Needs

INTEGRATE ANYTHING. OPERATE IN REAL-‐TIME. ACT WITH INSIGHT.

BIG DATA, CUSTOMERS & SUPPLIERS

ON-‐PREMISE APPS

CLOUD APPS I IOT SENSORS I CUSTOMERS I SUPPLIERS

DEVELOPER STUDIO Web UI

DATA FABRIC

1st Data Integration Platform on Apache Spark

23 Benefits: Make decisions faster. Tremendous developer produc@vity.

•  Visually develop jobs that run 100% on Spark •  5X 'mes faster using independent benchmarks •  10X developer produc'vity gained over hand-‐coding

Spark •  100X faster with in-‐memory processing

•  Over 100 new drag-‐n-‐drop Spark components •  HDFS, RDBMS, NoSQL, Cloud Storage, Transforma@on,

Messaging, In-‐memory analy@cs & machine learning recommenda@ons, and much more

•  In-‐memory data caching & “windowed” computa@ons •  Click to enable Spark Streaming for real-‐'me data

processing

•  Convert Talend MapReduce jobs to Spark with the click of a bunon, future proofing your investment

Introducing Talend Real-‐'me Big Data 1st Data Integra@on Plavorm on Spark

24 Benefits: Developer produc@vity. Business agility.

Enabling Intelligent Data Pipelining

Lambda Architecture: Batch, Real-‐'me, Query

•  A single solu'on to address •  Bulk/batch •  Real-‐@me •  Streaming & IoT data •  Machine Learning

•  Provides Fast Data access through NoSQL

•  One tool for Hadoop, Spark, tradi@onal ETL/ELT and NoSQL integra@on

Speed Layer

Batch Layer

Web Logs

DBMS/EDW

Legacy

Real-Time Views ____________

Pre-computed

Serving Layer Query

Incremental Data

All Data

Sliding Window Analy'cs

Apply Learning

Learning on past Data

Easily Convert MapReduce to Spark!

Your Job Now 5X Faster

MapReduce (runs on disk)

Spark (runs on disk and in-‐memory)

One Click

Spark/Talend Enabled Use Cases -‐ Examples

Data Discovery (Interactive)

Better Decisions (Batch)

Real-Time Action (Streaming and Machine

Learning)

Digital Economy

Web Analytics Click-Stream Analysis

Real-Time Web Traffic Optimization (retargetting &

Retail SCM Analytics Find Purchase Corellation

Real-Time Promotion & Coupon Optimization

Financial Services

Fraud Detection Learning on

Massive Data Volume

High-Scalable Trading, Risk Management & Real-Time

Fraud Detection

Talend Success Challenge: •  Ever increasing Big Data velocity •  Many last minute cart abandonments

•  Hard to op@mize pricing

Why Talend: •  Is the central integra@on tool within their Business Intelligence

(BI) organiza@on. •  Integrates clickstreams from last 6 months

Value: •  Le}over merchandise reduced by 20% •  Can predict abandoned shopping cart in real-‐@me with a 90%

accuracy •  Op@mize Pricing and Stock pricing

Challenge: •  Needed to migrate 800 ETL jobs to an “Industrial Internet” •  Improve service levels by providing data and analy@cs in the cloud

Industrial Internet

Solu'on: •  Integrate big data, small data, and transac@onal data with high

quality. •  Talend Big Data, Data Quality, Master Data Management

Value: •  Provide a collabora@ve, prescrip@ve, and predic@ve environment •  Improved customer sa@sfac@on, improved produc@vity per

turbine •  Predict failures & Reduce inventory •  Arm sales with compe@@ve intelligence

From Zero to Big Data in 10 Minutes Download free www.talend.com/download

•  Get up and running in minutes, not weeks, with a big data Sandbox and demos

•  Includes: Sentiment analysis, ETL Offload, Log file analysis, Recommendation engine

•  Start working with Talend, Hadoop & NoSQL today!

Now with

The conference for and by Data Scientists, from startup to enterprise wrangleconf.com

Public registration is now open!

  Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more

  When: Thursday, October 22, 2015   Where: Broadway Studios, San Francisco

Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ngReal,TimeAnaly’cs** with*Spark October(8,(2015

Documents

McDonough Spark Tutorial Spark Summit 2013

Gamiﬁcaon and Analy’cs in Educaon - UAB

Spark, spark streaming & tachyon

Spark Concepts - Spark SQL, Graphx, Streaming

Mazda RX-8 Spark Plug and Spark Plug Wire Install...

Spark Architecture · Spark Architecture Spark Shuffle .......

CIPS June 20 2017 Big Data and Data Analy’cs in...

What is SPARK? -...

REPLACEMENT SPARK PLUGS Spark Plug Application Chart ·...

Bosch Industrial Spark Plugs: The Right Spark Plug for...

IBM Spark Meetup - RDD & Spark Basics

Spark and Spark SQL - Amir H. Payberah · Spark and Spark.....

Accelera’ng Drug Discovery with Free Energy Calculaons on....

Learning spark ch10 - Spark Streaming

Spark streaming , Spark SQL

Spark Plugs - · PDF fileSpark Plugs Catalogue: Issue 3.2......

Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ngReal,TimeAnaly’cs** with*Spark October(8,(2015