Spark Streaming Early Warning Use Case

STREAMING EARLY WARNINGData Day Seattle 6-27-2015Chance Coble

2

Use Case Profile

➾Telecommunications company Had business problems/pain Scalable analytics infrastructure is a problem

Pushing infrastructure to its limits Open to a proof-of-concept engagement with emerging technology Wanted to test on historical data

➾We introduced Spark Streaming Technology would scale Could prove it enabled new analytic techniques (incident detection) Open to Scala requirement Wanted to prove it was easy to deploy – EC2 helped

3

Organization Profile

➾Telecommunications Wholesale Business Process 90 Million calls per day Scale up to 1,000 calls per second

nearly half-a-million calls in a 5 minute window Technology is loosely split into

Operational Support Systems (OSS) Business Support Systems (BSS)

➾ Core technology is mature Analytics on LAMP stack Technology team is strongly skilled in that stack

4

Jargon

➾ Number Comprised of Country Code (possibly), Area Code (NPA),

Exchange (NXX) and 4 other digits Area codes and exchanges are often geo-coded

1 5309512 867

5

Jargon

➾Trunk Group A trunk is a line connecting transmissions for two points. The

group of trunks has some common property, in this case being owned by the same entity.

Transmissions from ingress trunks are routed to transmissions to egress trunks.

➾Route – In this case, selection of a trunk group to facilitate the termination at the calls destination

➾QoS – Quality of Service governed by metrics Call Duration – Short calls are an indication of quality problems ASR – Average Seizure Rate

This company measures this as #connected calls / #calls attempted

➾Real-time: Within 5 minutes

6

The Problem

➾A switch handles most of their routing➾Configuration table in switch governs routing

if-this-then-that style logic.

➾Proprietary technology handles adjustments to that table Manual intervention also required

Call Logs Business Rules Application

Database Intranet Portal

7

The Problem

➾Backend system receives a log of calls from the switch File dumped every few minutes 180 well defined fields representing features of a call event Supports downstream analytics once enriched with pricing, geo-

coding and account information

Their job is to connect calls at the most efficient price without sacrificing quality

8

Why Spark?

➾They were interested because Workbench can simplify operationalizing analytics

They can skip a generation of clunky big data tools Works with their data structures Will “scale-out” rather than up Can handle fault-tolerant in-memory updates

9

Why Spark?

http://redmonk.com/dberkholz/2015/03/13/the-emergence-of-spark/

10

Spark Basics - Architecture

Spark Driver

Spark Context

Cluster Manager

Executor

…

Tasks Cache

Executor

Executor

Tasks

Tasks

Cache

Cache

11

Spark Basics – Call Status Count Example

val cdrLogPath = ”/cdrs/cdr20140731042210.ssv” val conf = new SparkConf().setAppName(”CDR Count") val sc = new SparkContext(conf) val cdrLines = sc.textFile(cdrLogPath)

val cdrDetails = cdrLines.map(_.split(“;”)).cache() val successful = cdrDetails.filter(x => x(6)==“S”).count() val unsuccessful = cdrDetails.filter(x => x(6)==“U”).count()

println(”Successful: %s, Unsuccessful: %s” .format(successful, unsuccessful))

12

Spark Basics - RDD’s

➾Operations on data generate distributable tasks through a Directed Acyclic Graph Functional programming FTW!

➾Resilient Data is redundantly stored, and can be recomputed through a

generated DAG

➾ Distributed The DAG can process each small task, as well as a subset of the

data through optimizations in the Spark planning engine.

➾ Dataset➾This construct is native to Spark computation

13

Spark Basics - RDD’s

➾Lazy➾Transformations for tasks and slices

14

Streaming Applications – Why try it?

➾Streaming Applications Site Activity Statistics Spam detection System monitoring Intrusion Detection Telecommunications

Network Data

15

Streaming Models

➾Record-at-a-time Receive One Record and process it

Simple, low-latency High-Throughput

➾Micro-Batch Receive records and occasionally run a batch process over a

window Process *must* run fast enough to handle all records collected Harder to reduce latency Easy Reasoning

Global state Fault tolerance Unified Code

16

DStreams

➾Stands for Discretized Streams➾A series of RDD’s➾Spark already provided computation model on RDD’s➾Note records are ordered as they are received

They are also time-stamped for computation in that order Is that always the way you want to see your data?

17

Fault Tolerance – Parallel Recovery

➾ Failed Nodes➾ Stragglers!

18

Fault Tolerance - Recompute

19

Throughput vs. Latency

20

Anatomy of a Spark Streaming Program

val sparkConf = new SparkConf().setAppName(“QueueStream”)

val ssc = new StreamingContext(sparkConf, Seconds(1))

val rddQueue = new SynchronizedQueue[RDD[Int]]()

val inputStream = ssc.queueStream(rddQueue)

val mappedStream = inputStream.map(x => (x % 10, 1))

val reducedStream = mappedStream.reduceByKey(_ + _)

reducedStream.print()

ssc.start()

for(i 1 to 30) {

rddQueue += ssc.sparkContext.makeRDD(1 to 1000, 10)

Thread.sleep(1000)

}

ssc.stop()

Utilities also available forTwitterKafkaFlume

Filestream

21

Windows

WindowSlide

22

Streaming Call Analysis with Windows

val path = "/Users/chance/Documents/cdrdrop”

val conf = new SparkConf()

.setMaster("local[12]")

.setAppName("CDRIncidentDetection")

.set("spark.executor.memory","8g")

val ssc = new StreamingContext(conf,Seconds(iteration))

val callStream = ssc.textFileStream(path)

val cdr = callStream.window(Seconds(window),Seconds(slide)).map(_.split(";"))

val cdrArr = cdr.filter(c => c.length>136)

.map(c => extractCallDetailRecord(c))

val result = detectIncidents(cdrArr)

result.foreach(rdd => rdd.take(10)

.foreach{case(x,(d,high,low,res)) =>

println(x + "," + high + "," + d + "," + low + "," + res) })

ssc.start()

ssc.awaitTermination()

23

Can we enable new analytics?

➾Incident detection Chose a univariate technique[1] to detect behavior out of profile

from recent events Technique identifies

out of profile events dramatic shifts in the profile

Easy to understand

Recent Window

24

Is it simple to deploy?

➾EC2 helped➾Client had no Hadoop, and little NoSQL expertise➾Develop and Deploy

Built with sbt, ran on master

➾Architecture involved Pushed new call detail logs to HDFS on EC2 Streaming picks up new data and updates RDD’s accordingly Results were explored in two ways

Accessing results through data virtualization Writing RDD results (small) to SQL database

Using a business intelligence tool to create report content

Call Logs Streaming DataCurrent Processing

HDFS on EC2

Analysis and Reporting Dashboards

Multiple Delivery Options

25

Results

26

Results

10

50

100

150

200

250

300

350

WordCount (Published)

Throughput (MB)

27

Summary of Results

➾Technology would scale Handled 5 minutes of data sub second

➾Proved new analytics enabled Solved single-variable incident detection Small, simple code

➾Made a case for Scala adoption Team is still skeptical about big data

➾Wanted to prove it was easy to deploy – EC2 helped Burned on forward slash bug in AWS secret token

https://issues.apache.org/jira/browse/HADOOP-3733

28

Incident Visual

29

References

➾[1] Zaharia et al : Discretized Streams ➾[2] Zaharia et al:

Discretized Streams: Fault-Tolerant Streaming ➾[3] Das : Spark Streaming – Real-time Big-Data Processing➾[4] Spark Streaming Programming Guide➾[5] Running Spark on EC2➾[6] Spark on EMR➾[7] Ahelegby: Time Series Outliers

https://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf

http://spark-summit.org/wp-content/uploads/2013/10/Spark-Summit-2013-Spark-Streaming.pdf

http://spark.apache.org/docs/latest/streaming-programming-guide.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html

https://spark.apache.org/docs/latest/ec2-scripts.html

https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923

http://dfkahey.webs.com/presentations/Robust.pdf

30

Contact Us

CONTACT US

Email: chance at blacklightsolutions.com

Phone: 512.795.0855

Web: www.blacklightsolutions.com

Twitter: @chancecoble

mailto:[email protected]



http://www.blacklightsolutions.com/

Spark Streaming Early Warning Use Case

Data & Analytics