Top Banner
Large Scale Data Analytics
27

DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Apr 11, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Large Scale Data Analytics

Page 2: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Large Scale Data Analytics

Ryan Knight @Knight_Cloud

Solution Engineer - DataStax

Paco Nathan @pacoid

Evil Mad Scientist - O’Reilly Media

Page 3: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Demo of Streaming in the Real World - Spark At Scale Project

3© 2015. All Rights Reserved.

•Based on Real World Use Cases

•Simulate a real world streaming use case

•Test throughput of Spark Streaming

•Best Practices for scaling

•https://github.com/retroryan/SparkAtScale

Page 4: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Spark At Scale Demo Application

4© 2015. All Rights Reserved.

DataStax Enterprise Platform

DataStax Enterprise Platform

Page 5: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

How do we Scale for Load and Traffic?

Page 6: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Data Modeling using Event Sourcing

6© 2015. All Rights Reserved.

•Append-Only Logging

•Database of Facts

•Snapshots or Roll-Ups

•Why Delete Data any more?

•Replay Events

Page 7: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Scala for Large Scale Data Analytics

7© 2015. All Rights Reserved.

•Functional Paradigm is ideal for Data Analytics

•Strongly Typed - Enforce Schema at Every Later

•Immutable by Default - Event Logging

•Declarative instead of Imperative - Focus on Transformation not Implementation

Page 8: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Key to Scaling - Configuring Kafka Topics

8© 2015. All Rights Reserved.

•Number of Partitions per Topic — Degree of parallelism

•Directly Affects Spark Streaming Parallelism

•bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 5 --topic ratings

Page 9: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Populating Kafka Topics

9© 2015. All Rights Reserved.

val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, partNum, key, nxtRating.toString)

val future = feederExtension.producer.send(record, new Callback {

Page 10: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

10© 2015. All Rights Reserved.

Page 11: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Spark Streaming with Kafka Direct Approach

11© 2015. All Rights Reserved.

•Use Kafka Direct Approach (No Receivers)

•Queries Kafka Directly

•Automatically Parallelizes based on Kafka Partitions

•Exactly Once Processing - Only Move Offset after Processing

•Resiliency without copying data

Page 12: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

12© 2015. All Rights Reserved.

Page 13: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Spark Streaming Monitoring

13© 2015. All Rights Reserved.

Processing Time

>Batch Duration

=Total Delay Grows

Out Of Memory Errors

Page 14: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

© 2014 DataStax, All Rights Reserved.

Confidential

DataStax Enterprise Platform Workload Segregation w/out ETL

14

Cassandra Mode OLTP Database

Analytics Mode Streaming and Analytics

Search Mode All Data Searchable

C*

C

C

S A

A

Page 15: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

DataStax Enterprise Platform Workload Segregation w/out ETL

15© 2015. All Rights Reserved.

Page 16: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

DataStax Enterprise Platform Integrated Spark Analytics

16© 2015. All Rights Reserved.

Page 17: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

DataStax Analytics

17© 2015. All Rights Reserved.

•Simplified Deployment and Management •HA Spark Master with automatic leader election

•Detects when Spark Master is down with gossip •Uses Paxos to elect Spark Master

•Stores Spark Worker metadata in Cassandra •No need to run Zookeeper

Page 18: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Spark Notebook

18© 2015. All Rights Reserved.

C*

C

C A

AANotebook

Notebook

Notebook

Spark Notebook ServerCassandra Cluster with Spark Connector

Page 19: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Apache Spark Notebook

19© 2015. All Rights Reserved.

•Reactive / Dynamic Graphs base on Scala, SQL and DataFrames

•Spark Streaming • Examples notebooks covering visualization, machine

learning, streaming, graph analysis, genomics analysis •SVG / Sliders - interactive graphs •Tune and Configure Each Notebook Separately •https://github.com/andypetrella/spark-notebook

Page 20: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html

Demo: Twitter Streaming Language Classifier

Page 21: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Streaming:collect tweets

Twitter API

HDFS:dataset

Spark SQL:ETL, queries

MLlib:train classifier

Spark:featurize

HDFS:model

Streaming:score tweets

language filter

Demo: Twitter Streaming Language Classifier

Cassandra

Cassandra

Page 22: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

1. extract text from the tweet

https://twitter.com/andy_bf/status/16222269370011648

"Ceci n'est pas un tweet"

2. sequence text as bigrams

tweet.sliding(2).toSeq ("Ce", "ec", "ci", …, )

3. convert bigrams into numbers

seq.map(_.hashCode()) (2178, 3230, 3174, …, )

4. index into sparse tf vector

seq.map(_.hashCode() % 1000) (178, 230, 174, …, )

5. increment feature count

Vector.sparse(1000, …) (1000, [102, 104, …], [0.0455, 0.0455, …])

Demo: Twitter Streaming Language Classifier

From tweets to ML features, approximated as sparse vectors:

Page 23: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

KMeans: Formal Definition (ignore this)

Page 24: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

KMeans: How it really works…

Page 25: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

KMeans: How it really works…

Page 26: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Demo: Twitter Streaming Language Classifier

Sample Code + Output: https://github.com/retroryan/twitter_classifier

val sc = new SparkContext(new SparkConf())

val ssc = new StreamingContext(conf, Seconds(5))

 

val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)

val statuses = tweets.map(_.getText)

 

val model = new KMeansModel(ssc.sparkContext.objectFile[Vector]

(modelFile.toString).collect())

 

val filteredTweets = statuses

.filter(t =>

model.predict(Utils.featurize(t)) == clust)

filteredTweets.print()

 

ssc.start()

ssc.awaitTermination()

CLUSTER 1:TLあんまり見ないけど@くれたっらいつでもくっるよ٩(δωδ)۶

そういえばディスガイアも今日か CLUSTER 4:قالوا العروبه روحت بعد صدامواقول مع سلمان تحيى العروبهRT @vip588: √ للمتواجدين االن √ زيادة متابعني √ فولو مي vip588

فولو باك √ رتويت للتغريدة √ فولو للي عمل رتويت √ اللي ما يلتزم ما √… بيستفيدن سورة

Page 27: DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandra on the DataStax Enterprise Platform - A Hands on Workshop

Thank you