STREAMING EARLY WARNING Data Day Seattle 6-27-2015 Chance Coble
STREAMING EARLY WARNINGData Day Seattle 6-27-2015Chance Coble
2
Use Case Profile
➾Telecommunications company Had business problems/pain Scalable analytics infrastructure is a problem
Pushing infrastructure to its limits Open to a proof-of-concept engagement with emerging technology Wanted to test on historical data
➾We introduced Spark Streaming Technology would scale Could prove it enabled new analytic techniques (incident detection) Open to Scala requirement Wanted to prove it was easy to deploy – EC2 helped
3
Organization Profile
➾Telecommunications Wholesale Business Process 90 Million calls per day Scale up to 1,000 calls per second
nearly half-a-million calls in a 5 minute window Technology is loosely split into
Operational Support Systems (OSS) Business Support Systems (BSS)
➾ Core technology is mature Analytics on LAMP stack Technology team is strongly skilled in that stack
4
Jargon
➾ Number Comprised of Country Code (possibly), Area Code (NPA),
Exchange (NXX) and 4 other digits Area codes and exchanges are often geo-coded
1 5309512 867
5
Jargon
➾Trunk Group A trunk is a line connecting transmissions for two points. The
group of trunks has some common property, in this case being owned by the same entity.
Transmissions from ingress trunks are routed to transmissions to egress trunks.
➾Route – In this case, selection of a trunk group to facilitate the termination at the calls destination
➾QoS – Quality of Service governed by metrics Call Duration – Short calls are an indication of quality problems ASR – Average Seizure Rate
This company measures this as #connected calls / #calls attempted
➾Real-time: Within 5 minutes
6
The Problem
➾A switch handles most of their routing➾Configuration table in switch governs routing
if-this-then-that style logic.
➾Proprietary technology handles adjustments to that table Manual intervention also required
Call Logs Business Rules Application
Database Intranet Portal
7
The Problem
➾Backend system receives a log of calls from the switch File dumped every few minutes 180 well defined fields representing features of a call event Supports downstream analytics once enriched with pricing, geo-
coding and account information
Their job is to connect calls at the most efficient price without sacrificing quality
8
Why Spark?
➾They were interested because Workbench can simplify operationalizing analytics
They can skip a generation of clunky big data tools Works with their data structures Will “scale-out” rather than up Can handle fault-tolerant in-memory updates
10
Spark Basics - Architecture
Spark Driver
Spark Context
Cluster Manager
Executor
…
Tasks Cache
Executor
Executor
Tasks
Tasks
Cache
Cache
11
Spark Basics – Call Status Count Example
val cdrLogPath = ”/cdrs/cdr20140731042210.ssv” val conf = new SparkConf().setAppName(”CDR Count") val sc = new SparkContext(conf) val cdrLines = sc.textFile(cdrLogPath)
val cdrDetails = cdrLines.map(_.split(“;”)).cache() val successful = cdrDetails.filter(x => x(6)==“S”).count() val unsuccessful = cdrDetails.filter(x => x(6)==“U”).count()
println(”Successful: %s, Unsuccessful: %s” .format(successful, unsuccessful))
12
Spark Basics - RDD’s
➾Operations on data generate distributable tasks through a Directed Acyclic Graph Functional programming FTW!
➾Resilient Data is redundantly stored, and can be recomputed through a
generated DAG
➾ Distributed The DAG can process each small task, as well as a subset of the
data through optimizations in the Spark planning engine.
➾ Dataset➾This construct is native to Spark computation
13
Spark Basics - RDD’s
➾Lazy➾Transformations for tasks and slices
14
Streaming Applications – Why try it?
➾Streaming Applications Site Activity Statistics Spam detection System monitoring Intrusion Detection Telecommunications
Network Data
15
Streaming Models
➾Record-at-a-time Receive One Record and process it
Simple, low-latency High-Throughput
➾Micro-Batch Receive records and occasionally run a batch process over a
window Process *must* run fast enough to handle all records collected Harder to reduce latency Easy Reasoning
Global state Fault tolerance Unified Code
16
DStreams
➾Stands for Discretized Streams➾A series of RDD’s➾Spark already provided computation model on RDD’s➾Note records are ordered as they are received
They are also time-stamped for computation in that order Is that always the way you want to see your data?
17
Fault Tolerance – Parallel Recovery
➾ Failed Nodes➾ Stragglers!
18
Fault Tolerance - Recompute
19
Throughput vs. Latency
20
Anatomy of a Spark Streaming Program
val sparkConf = new SparkConf().setAppName(“QueueStream”)
val ssc = new StreamingContext(sparkConf, Seconds(1))
val rddQueue = new SynchronizedQueue[RDD[Int]]()
val inputStream = ssc.queueStream(rddQueue)
val mappedStream = inputStream.map(x => (x % 10, 1))
val reducedStream = mappedStream.reduceByKey(_ + _)
reducedStream.print()
ssc.start()
for(i 1 to 30) {
rddQueue += ssc.sparkContext.makeRDD(1 to 1000, 10)
Thread.sleep(1000)
}
ssc.stop()
Utilities also available forTwitterKafkaFlume
Filestream
21
Windows
WindowSlide
22
Streaming Call Analysis with Windows
val path = "/Users/chance/Documents/cdrdrop”
val conf = new SparkConf()
.setMaster("local[12]")
.setAppName("CDRIncidentDetection")
.set("spark.executor.memory","8g")
val ssc = new StreamingContext(conf,Seconds(iteration))
val callStream = ssc.textFileStream(path)
val cdr = callStream.window(Seconds(window),Seconds(slide)).map(_.split(";"))
val cdrArr = cdr.filter(c => c.length>136)
.map(c => extractCallDetailRecord(c))
val result = detectIncidents(cdrArr)
result.foreach(rdd => rdd.take(10)
.foreach{case(x,(d,high,low,res)) =>
println(x + "," + high + "," + d + "," + low + "," + res) })
ssc.start()
ssc.awaitTermination()
23
Can we enable new analytics?
➾Incident detection Chose a univariate technique[1] to detect behavior out of profile
from recent events Technique identifies
out of profile events dramatic shifts in the profile
Easy to understand
Recent Window
24
Is it simple to deploy?
➾EC2 helped➾Client had no Hadoop, and little NoSQL expertise➾Develop and Deploy
Built with sbt, ran on master
➾Architecture involved Pushed new call detail logs to HDFS on EC2 Streaming picks up new data and updates RDD’s accordingly Results were explored in two ways
Accessing results through data virtualization Writing RDD results (small) to SQL database
Using a business intelligence tool to create report content
Call Logs Streaming DataCurrent Processing
HDFS on EC2
Analysis and Reporting Dashboards
Multiple Delivery Options
25
Results
26
Results
10
50
100
150
200
250
300
350
WordCount (Published)
Throughput (MB)
27
Summary of Results
➾Technology would scale Handled 5 minutes of data sub second
➾Proved new analytics enabled Solved single-variable incident detection Small, simple code
➾Made a case for Scala adoption Team is still skeptical about big data
➾Wanted to prove it was easy to deploy – EC2 helped Burned on forward slash bug in AWS secret token
28
Incident Visual
29
References
➾[1] Zaharia et al : Discretized Streams ➾[2] Zaharia et al:
Discretized Streams: Fault-Tolerant Streaming ➾[3] Das : Spark Streaming – Real-time Big-Data Processing➾[4] Spark Streaming Programming Guide➾[5] Running Spark on EC2➾[6] Spark on EMR➾[7] Ahelegby: Time Series Outliers
30
Contact Us
CONTACT US
Email: chance at blacklightsolutions.com
Phone: 512.795.0855
Web: www.blacklightsolutions.com
Twitter: @chancecoble