Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015
Post on 08-Oct-2020
0 Views
Preview:
Transcript
1 ©2015 Talend Inc
Accelera'ng Real-‐Time Analy'cs with Spark October 8, 2015
Housekeeping
Audio – Streamed via media player, turn volume up
Submit questions for Q&A via Group Chat widget
Download slides and event materials
Hashtag: #stratahadoop
3
Your Speakers Today
Sean Owen Director of Data Science Cloudera, EMEA
Yann Delacourt Director, Big Data Product Management Talend
4
• Apache Spark, its architecture and benefits • Spark's architecture, deployment strategies and use cases • Spark's impact to data science, analy@cs and machine learning • How to move data scien@sts' work to IT produc@on • Best prac@ces for large Spark deployments • Mastering Spark's complexity
Agenda
5 © Cloudera, Inc. All rights reserved.
Accelera@ng Real-‐Time Analy@cs with Apache Spark Sean Owen, Director of Data Science Cloudera, EMEA
6 © Cloudera, Inc. All rights reserved.
What is Apache Spark?
Spark is a general purpose computa@onal framework with more flexibility than MapReduce • Leverages distributed memory • Full Directed Graph expressions for data parallel computa@ons • Improved developer experience • Linear scalability, Data Locality • Fault-‐tolerance
7 © Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Spark Streaming MLlib SparkSQL GraphX Data-‐
frames SparkR
STORAGE HDFS, HBase
RESOURCE MANAGEMENT YARN
Spark Impala MR Others Search
8 © Cloudera, Inc. All rights reserved.
Apache Spark Flexible, in-‐memory data processing for Hadoop
Easy Development
Flexible Extensible API
Fast Batch & Stream Processing
• Rich APIs for Scala, Java, and Python
• Interac@ve shell
• APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph
• In-‐Memory processing and caching
9 © Cloudera, Inc. All rights reserved.
Easy Development Use Interac@vely
• Interac@ve explora@on of data for data scien@sts • No need to develop “applica@ons”
• Developers can prototype applica@on on live system
percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....
scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21
scala> words.count...res0: Long = 235886
scala>
10 © Cloudera, Inc. All rights reserved.
Easy Development Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• …
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
11 © Cloudera, Inc. All rights reserved.
Example Logis@c Regression
data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)
w -= gradient print “Final w: %s” % w
12 © Cloudera, Inc. All rights reserved.
Spark Takes Advantage of Memory
Resilient Distributed Datasets (RDD) • Memory caching layer that stores data in a distributed, fault-‐tolerant cache
• Can fall back to disk when data-‐set does not fit in memory
• Created by parallel transforma@ons on data in stable storage • Provides fault-‐tolerance through concept of lineage
13 © Cloudera, Inc. All rights reserved.
Fast Processing Using RAM, Operator Graphs
In-‐Memory Caching • Data Par@@ons read from RAM
instead of disk Operator Graphs • Scheduling Op@miza@ons • Fault Tolerance
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√Ω
map
A:
map
take
= cached par@@on = RDD
14 © Cloudera, Inc. All rights reserved.
Data Science Baneries Included
MLlib ML “Pipelines” • Exis@ng, mature Spark ML subproject • Covers the basics well
• Decision trees, SVM, LR • ALS, SVD • K-‐means • … and more
• Stand-‐alone implementa@ons • Algorithms Only
• Beta “MLlib 2.0” • Emulates scikit-‐learn APIs • Pipelines, not just algos
• Feature engineering • Transforma@on • Ensembles
• Unified architecture • Spark 1.4+
15 © Cloudera, Inc. All rights reserved.
Faster Itera@ve ML Algorithms (Data Fits in Memory)
0 500 1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing Time(s)
# of Itera'ons
MapReduce
Spark
110 s/itera@on
First itera@on = 80s Further itera@ons 1s due to caching
16 © Cloudera, Inc. All rights reserved.
Cloudera Customer Use Cases Core Spark Spark Streaming
• Porvolio Risk Analysis • ETL Pipeline Speed-‐Up • 20+ years of stock data Financial
Services
Health
• Iden@fy disease-‐causing genes in the full human genome
• Calculate Jaccard scores on health care data sets
ERP
• Op@cal Character Recogni@on and Bill Classifica@on
• Trend analysis • Document classifica@on (LDA) • Fraud analy@cs Data
Services
1010
• Online Fraud Detec@on Financial Services
Health
• Incident Predic@on for Sepsis
Retail
• Online Recommenda@on Systems • Real-‐Time Inventory Management
Ad Tech
• Real-‐Time Ad Performance Analysis
17 © Cloudera, Inc. All rights reserved.
Uni@ng Spark and Hadoop The One Plavorm Ini@a@ve Investment Areas
Management Leverage Hadoop-‐na@ve resource management.
Security Full support for Hadoop security
and beyond.
Scale Enable 10k-‐node clusters.
Streaming Support for 80% of common stream
processing workloads.
18 © Cloudera, Inc. All rights reserved.
Management Security Scale Streaming • Spark on YARN Integra@on • HBase integra@on • Improved metrics for
monitoring/troubleshoo@ng • Dynamic Resource Alloca@on
• Spark on YARN: • Container resizing • Dynamic Resource
Alloca@on for Streaming • Simplified resource
configura@on • Improved WebUI for
debugging • Improved metrics for visibility
into resource u@liza@on • Smart auto-‐tuning of job
parameters
• Kerberos Integra@on • HDFS Sync (Sentry) • Secure data at rest
• Secure data over the wire • Audit/Lineage (Navigator) • Spark PCI compliance • Integra@on with Intel’s
advanced encryp@on libraries • Enable column and view level
security
• Revamp Scheduler handling of node failure
• Sort based shuffle improvements
• Task Scheduling based on HDFS data locality and caching
• Scheduler improvements for performance at scale
• Stress test at scale with mixed mul@-‐tenant workloads
• HDFS DDM Integra@on • Dynamic resource u@liza@on &
priori@za@on • Scale Spark History Server for
1000s of jobs
• Zero Data Loss with Spark Streaming Resilience
• Flume integra@on • Ka{a integra@on
• SQL seman@cs for expressing streaming jobs (Business Users)
• New streaming specific API extensions
• Streaming applica@on management (pause, update, redeploy) via CM
• Op@mized state updates: efficient point lookups and delta updates
Detailed Roadmap: One PlaTorm Ini'a've = Completed Work
= Planned Future Work
19 © Cloudera, Inc. All rights reserved.
Spark is a Developer Framework
• Spark means wri@ng code
• And deploying it
• And monitoring it
• Workflow orchestra@on is hard
• Oozie? Luigi?
• Custom scripts
Data is S'll Fickle • Data Quality is s@ll hard
• Spark s@ll can’t automa@cally find and clean bad records
• Feature engineering = ETL • Data Integra@on is s@ll hard
• Read / write the right formats • “Publish” to BI tools
The Bad News
20 ©2015 Talend Inc
Accelera'ng Real-‐Time Analy'cs with Spark Yann Delacourt, Director of Big Data Product Management Talend
21
APPLICATION INTEGRATION
CLOUD INTEGRATION
DATA INTEGRATION
BIG DATA INTEGRATION
MASTER DATA MANAGEMENT
A Modern Data Platform for All Your Integration Needs
INTEGRATE ANYTHING. OPERATE IN REAL-‐TIME. ACT WITH INSIGHT.
22
BIG DATA, CUSTOMERS & SUPPLIERS
ON-‐PREMISE APPS
CLOUD APPS I IOT SENSORS I CUSTOMERS I SUPPLIERS
DEVELOPER STUDIO Web UI
DATA FABRIC
1st Data Integration Platform on Apache Spark
23 Benefits: Make decisions faster. Tremendous developer produc@vity.
• Visually develop jobs that run 100% on Spark • 5X 'mes faster using independent benchmarks • 10X developer produc'vity gained over hand-‐coding
Spark • 100X faster with in-‐memory processing
• Over 100 new drag-‐n-‐drop Spark components • HDFS, RDBMS, NoSQL, Cloud Storage, Transforma@on,
Messaging, In-‐memory analy@cs & machine learning recommenda@ons, and much more
• In-‐memory data caching & “windowed” computa@ons • Click to enable Spark Streaming for real-‐'me data
processing
• Convert Talend MapReduce jobs to Spark with the click of a bunon, future proofing your investment
Introducing Talend Real-‐'me Big Data 1st Data Integra@on Plavorm on Spark
24 Benefits: Developer produc@vity. Business agility.
Enabling Intelligent Data Pipelining
Lambda Architecture: Batch, Real-‐'me, Query
• A single solu'on to address • Bulk/batch • Real-‐@me • Streaming & IoT data • Machine Learning
• Provides Fast Data access through NoSQL
• One tool for Hadoop, Spark, tradi@onal ETL/ELT and NoSQL integra@on
Speed Layer
Batch Layer
NoSQL
IOT
Web Logs
ERP
DBMS/EDW
Legacy
Real-Time Views ____________
Pre-computed
Views
Serving Layer Query
Incremental Data
All Data
Sliding Window Analy'cs
Apply Learning
Learning on past Data
25
Easily Convert MapReduce to Spark!
Your Job Now 5X Faster
MapReduce (runs on disk)
Spark (runs on disk and in-‐memory)
One Click
26
Spark/Talend Enabled Use Cases -‐ Examples
Data Discovery (Interactive)
Better Decisions (Batch)
Real-Time Action (Streaming and Machine
Learning)
Digital Economy
Web Analytics Click-Stream Analysis
Real-Time Web Traffic Optimization (retargetting &
reco)
Retail SCM Analytics Find Purchase Corellation
Real-Time Promotion & Coupon Optimization
Financial Services
EDW
Fraud Detection Learning on
Massive Data Volume
High-Scalable Trading, Risk Management & Real-Time
Fraud Detection
27
Talend Success Challenge: • Ever increasing Big Data velocity • Many last minute cart abandonments
• Hard to op@mize pricing
Why Talend: • Is the central integra@on tool within their Business Intelligence
(BI) organiza@on. • Integrates clickstreams from last 6 months
Value: • Le}over merchandise reduced by 20% • Can predict abandoned shopping cart in real-‐@me with a 90%
accuracy • Op@mize Pricing and Stock pricing
28
Challenge: • Needed to migrate 800 ETL jobs to an “Industrial Internet” • Improve service levels by providing data and analy@cs in the cloud
Industrial Internet
Solu'on: • Integrate big data, small data, and transac@onal data with high
quality. • Talend Big Data, Data Quality, Master Data Management
Value: • Provide a collabora@ve, prescrip@ve, and predic@ve environment • Improved customer sa@sfac@on, improved produc@vity per
turbine • Predict failures & Reduce inventory • Arm sales with compe@@ve intelligence
29
From Zero to Big Data in 10 Minutes Download free www.talend.com/download
• Get up and running in minutes, not weeks, with a big data Sandbox and demos
• Includes: Sentiment analysis, ETL Offload, Log file analysis, Recommendation engine
• Start working with Talend, Hadoop & NoSQL today!
Now with
‹#› © 2015 Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprise wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more
When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco
31 © Cloudera, Inc. All rights reserved. ©2015 Talend Inc
Q&A
top related