Top Banner
Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud Galway Data Meetup, 2015-02-03
32

Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Real-time Analytics with Spark

Maciej Dabrowski, Chief Data Scientist, Altocloud !Galway Data Meetup, 2015-02-03

Page 2: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

2

MEETS A SMALL STARTUP

source: https://media.licdn.com/mpr/mpr/p/1/005/0a0/167/2f98d60.jpg

Page 3: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ We built predictive communications software that uses analytics to make customer interactions and experience better

Altocloud

3

Page 4: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Monitoring live users

4

Page 5: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

5

Page 6: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

6

ANALYTICS

source: http://olap.com/

Page 7: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Real-time for us is under 1-5s

‣ Q: How many customers are currently online?

‣ Q: How many chats/calls are taking place at the moment?

‣ Q: What is the utilisation of my customer support agents?

Use Case 1: Real-time analytics

7

Page 8: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Q: How many calls were offered in the last week?

‣ Q: What is the acceptance rate of my chat offers?

Use Case 2: Reporting

8

Page 9: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Q: Which customers currently on my site I should engage?

Use Case 3: Predictive Analytics

9

Page 10: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Scalability

‣ Limited resources

‣ Various analytics use cases

Technical challenges

10

Page 11: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

11

Real-time analytics with Hadoop

source: http://barbarashdwallpapers.com/funny-elephant-wallpapers/

Page 12: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

APIs

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Platform

12

MESSAGE QUEUES

FRONT-END APIs KAFKA

SPARK

RABBIT MQ

CASSANDRA

SPARK STREAMING

HDFS

BACK-END APIS

APPS

BACK-END APIs

MONGODB

Page 13: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

DATA SOURCES

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Data Platform

13

MESSAGE QUEUES

FRONT-END APIs KAFKA

MONGODB OPLOG

SPARK

RABBIT MQ

CASSANDRA

SPARK STREAMING

HDFS

FRONT-END APIS

APPS

MONGODB

Page 14: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ One code base for streaming and batch processing

‣ Rich API in Scala/Python/Java

‣ Fast for iterative algorithms (important for ML)

‣ Growing community

‣ The concept of a micro-batch

‣ Nicely integrates with Kafka and Cassandra

‣ Fairly easy setup

Why Spark

14

Page 15: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Spark components

15

Page 16: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Hadoop

!

!

!

!

!

!

‣ Spark

Word count in Spark

16

Page 17: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Example: user event aggregation stored in Cassandra

‣ Still much better than Hadoop!

What about something more useful?

17

Page 18: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ User activity is an input (e.g. page view)

‣ Users for multiple businesses online

‣ Scale 100s to 100 000s activities per second

‣ Response time under 5s

‣ A perfect use case for spark streaming

Counting users currently online

18

Page 19: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Pub-sub message broker

‣ Fast: 100s MBs /s on a single broker

‣ Scalable: partitioned data streams

‣ Durable: messages persisted and replicated

‣ Distributed: Strong durability with and fault-tolerance

‣ Downside: requires ZooKeeper

!see https://kafka.apache.org

Data source: Kafka

19

Page 20: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

!

!

!

!

!

!

!‣ Kafka with Spark: http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

Spark and Kafka

20

Page 21: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Simple count unique events

!

!

‣ Count visit events for unique users

Count users online

21

Page 22: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Twitter Algebird to the rescue!

‣ HyperLogLog - a probabilistic data structure saving a lot of memory!

‣ https://github.com/twitter/algebird

Sets can take a lot of memory!

22

Page 23: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Easy to setup

‣ High availability - no master

‣ Great performance

‣ CQL - SQL like querying

‣ Great support and bug-free drivers from Datastax

‣ Key: Design your schema around queries; !!

see https://cassandra.apache.org

Storing your results

23

Page 24: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Datastax driver is very easy to use

!

!

‣ Save our results to Cassandra

Store data in Cassandra

24

Page 25: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

25source: http://top1walls.com

Page 26: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Spark streaming job performs two major tasks:

• data processing • data receiving

‣ Receiver always takes one core

‣ Technically, you need 2N cores to run N streaming jobs

‣ Not a big deal in production, what about testing?

Spark streaming

26

Page 27: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Containerise your app including all its dependencies

‣ Distribute your app in this standard container

‣ Run it on any machine with docker

‣ Very lightweight

Docker

27

Page 28: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

c3.xlarge: 4 cores

‣ AWS example

Spark

SPARK EXECUTOR

c3.large: 2 cores

SPARK DRIVER

SPARK EXECUTOR

CORE 1 CORE 2 CORE 3 CORE 4

Page 29: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

c3.xlarge: 4 cores

‣ AWS example

Spark on Docker

c3.large: 2 cores

SPARK DRIVER

CORE 1 CORE 2 CORE 3 CORE 4

docker-1: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

docker-2: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

SPARK EXECUTOR

Page 30: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Spark Streaming is fast to deploy but tuning is VERY important

‣ The lower the number of tasks, the better (in general)

‣ When reading from Kafka make sure that you configure blockingInterval

‣ optimize your jobs when possible - similar jobs can be sometimes merged

‣ persist your data from workers, NOT the driver

Spark Streaming

30

Page 31: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ OLAP-type queries using Spark SQL

‣ More advanced performance testing

‣ Detailed unit testing

‣ More batch jobs

Where do we go from here?

31