Top Banner
Real-time Analytics on High Velocity Streaming Data Guangyu Wu @CeADAR
18

Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Apr 16, 2017

Download

Technology

John Mulhall
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Real-time Analytics on High Velocity Streaming Data

Guangyu Wu @CeADAR

Page 2: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

CeADAR

‣ Application development & proof of concept

‣ Business-value driven

‣ Market pull/need-driven

‣ Website: http://ceadar.ie/

University CeADAR Enterprise

Page 3: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

CeADAR

Visualisa'on&Analy'cInterfaces

• ‘Beyondthedesktop’• Easeofinterac6on• Changinguserbehaviour

• Passiveanaly6cs

DataManagementforAnaly'cs

• Reducedatamanagementeffortforanaly6cs

• Datavalida6on• Relevanceofeventstorela6onships

• Datacura6on(determiningusefuldata)

• Adap6veETL(Extract,Transform,Load)

AdvancedAnaly'cs

• Causa6onchallenge• Livetopicmonitoring• Socialtrendingandcontextualisa6on

• Con'nuousanaly'cs• SocialIden6tyfingerprin6ng

Page 4: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Overview

‣ Introduce different frameworks:

‣ Spark, Storm, Trident

‣ Continuous Clustering project

‣ Continuous Metrics project

‣ Stream Converge project

Page 5: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Spark‣ Spark is a platform for distributed batch data processing.

‣ Spark includes a number extensions: Spark Streaming, Spark SQL, MLlib, GraphX.

‣ Spark runs batch jobs predominantly in memory.

‣ Spark Streaming manages to integrate stream processing with batch processing by treating a data stream as sequences of small batches of data points, or micro-batches.

‣ Spark Streaming maintains computation states.

Page 6: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Storm‣ A Storm topology is comprised of spouts and bolts.

‣ Storm operates over individual data points.

‣ Storm is designed purely for stream processing.

Page 7: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Trident‣ Trident is a high level programming abstraction built on top of

Storm.

‣ It provides a number of useful functions such as aggregations and filters.

‣ An application can be designed and implemented using these high level abstractions and Trident converts the logic into a standard Storm topology under the hood.

‣ Trident works over micro-batches of data.

‣ Trident also has built-in support for maintaining processing state and state query.

Page 8: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Methodology

Large static batches of messages

Hadoop and off-line batch processing in

Spark

Single messages

Storm

Micro-batches of messages

Spark Streaming,Trident

Discretised streams

Page 9: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Continuous Clustering‣ Use case: real-time SMS spam detection in mobile networks.

‣ Clustering SMS messages based on their content is a good way to identify spam.

‣ Many similar spam messages are sent out over a short period of time.

Page 10: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Continuous Clustering‣ Problem with traditional clustering algorithms…

‣ work off-line over historical data

‣ require multiple passes over the data

‣ not incrementally updatable

‣ are hard to scale to ‘big’ data

‣ CeADAR solution: we developed a novel single pass, scalable data stream clustering algorithm implemented on Storm.

Page 12: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Deployment‣ Our compute cluster is composed of 4 machines. ‣ Each machine:

‣ Intel Xeon CPU E5-2630 0 @ 2.30GHz with 24 cores ‣ 64G memory ‣ 1T disk

‣ Spark, Storm, Hadoop, Kafka, Redis

Page 13: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Continuous Clustering

‣ US tier 1 mobile operator

‣ ~500 messages/second average

‣ ~1,300 messages/second peak

35,913 Near-exact matching

8,160 Matching threshold 75%

Page 14: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Continuous Metrics‣ Evaluate and compare Storm, Storm Trident and Spark Streaming on

the task of computing a set of statistical metrics in real-time over a continuous stream of data.

‣ Evaluate and compare

‣ Throughput: the volume and velocity of data that can be processed on different configurations and hardware.

‣ Latency: the time delay between a new data point being received and the updated metrics being computed.

vs vs

Page 15: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Sliding Windows‣ By items

‣ By time

Page 16: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Continuous Metrics‣ High level results overview

‣ Spark Streaming achieves the highest throughput, with Storm at the other end with the lowest throughput.

‣ However, Storm achieves the best latency by a considerable margin. Spark and Trident both exhibit considerably higher latency which is due at least in part to their micro-batch data processing approach.

‣ The evaluation produced many other insights, learnings and recommendations relating to these real-time platforms.

Page 17: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Stream Converge

‣ Current project: process and combine heterogeneous data streams from diverse sources using Spark Streaming.

Page 18: Real Time Analytics on High Velocity Streaming data by Guangyu Wu

Stream Converge

‣ Challenges:

‣ managing data streams of different frequency.

‣ linking together events across different streams via complex key relationships.

‣ handling out of order arrival of data.

‣ ……