BIG DATA ANALYTICS FOR REAL TIME SYSTEMS Kamalika Dutta Manasi Jayapal
BIG DATA ANALYTICS FOR REAL TIME SYSTEMS
Kamalika Dutta Manasi Jayapal
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
2 Big Data Analytics for Real Time Systems
Overview
3 Big Data Analytics for Real Time Systems
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
Where does Big Data come from?
4 Big Data Analytics for Real Time Systems
Courtesy: http://goo.gl/JWswfj
What makes it Big Data?
5 Big Data Analytics for Real Time Systems
Courtesy: Oracle
VARIABILITY
Evolution of Big Data
6 Big Data Analytics for Real Time Systems
1960s 1967
Automatic Data Compression
1997
Information Explosion
Our Literature Survey!
Overview
7 Big Data Analytics for Real Time Systems
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
Big Data Analytics Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.
8 Big Data Analytics for Real Time Systems
Predictive Analysis
Text Analysis
Data Mining
Statistical Analysis
Courtesy: smartdatacollective.com
Sample Systems
9 Big Data Analytics for Real Time Systems
Analytics & 3 Vs
10 Big Data Analytics for Real Time Systems
Courtesy: watalon.com
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
11 Big Data Analytics for Real Time Systems
Real Time Systems A real-time system is one that processes information and produces a
response within a specified time, else risk severe consequences, sometimes including failure.
12 Big Data Analytics for Real Time Systems
Telecommunication
Systems
Anti-Lock Brakes in a Car
Air Traffic Control System
Weather Forecasting
System
Courtesy: yourdon.com
Real-Time Analytics of Big Data
13 Big Data Analytics for Real Time Systems
What is Happening?
Kilobytes/ Sec
Megabytes/ Sec
Gigabytes Terabytes
Petabytes Exabytes
Seconds Milliseconds Minutes Minutes Hours
Big Data
Real Time
Courtesy: infochimps.com
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
14 Big Data Analytics for Real Time Systems
Challenges of Real Time Analytics
15 Big Data Analytics for Real Time Systems
Expensive
Complex Architecture, Batch Processing
Semi and Unstructured Data: New Sources are unpredictable; Relational databases are not capable, leaving us hamstrung
Market too Dynamic to Predict: Subscribers preferences change; competition adds acceleration to it
Scalability: Requires sub-second response times; more than a single server can handle
Thinking Beyond Hadoop!
16 Big Data Analytics for Real Time Systems
Manage & store huge volume of any data
Hadoop File System MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data Warehousing Structure and control data
Integrate and govern all data sources
Integration, Data Quality, Security, Lifecycle Management, MDM
Understand and navigate federated big data sources Federated Discovery and Navigation
Courtesy: IBM
Our Solution
Do the impossible: Incorporate any kind of data
Scale Big: Scale without any complexity Not Time Consuming: Seconds to
Minutes Real Time: Try to analyze data without
expensive data warehouse loads
17 Big Data Analytics for Real Time Systems
Powerful Analytics, In Place, In Real Time.
Courtesy: slideshare.com
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
18 Big Data Analytics for Real Time Systems
In-Memory Computing In-memory computing primarily relies on keeping data in a server's RAM as a means of processing at faster speeds. It uses a type of middleware software that allows one to store data in RAM, across a cluster of computers, and process it in parallel.
19 Big Data Analytics for Real Time Systems
Courtesy: Stratecast
Stream Processing
20 Big Data Analytics for Real Time Systems
Courtesy: EMC
Stream-processing systems operate on continuous data streams e.g., click streams on web pages, user request/query streams, monitoring events, notifications, etc.
Stream processing delivers real-time analytic processing on constantly changing data in motion.
Analyse first store later!
Complex Event Processing Complex Event Processing (CEP) processes multiple event streams generated within the enterprise to construct data abstraction and identify meaningful patterns among those streams.
21 Big Data Analytics for Real Time Systems
Analytics across both real-time and historical data. Real-time event capture, filtering, pattern detection, matching, and
aggregation.
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
22 Big Data Analytics for Real Time Systems
Tools for Real Time Analytics Big Data is NOT new, the Tools ARE!
23 Big Data Analytics for Real Time Systems
IBM InfoSphere Streams
Kafka A high performance distributed publish-subscribe messaging system.
Designed for processing of real time activity stream data.
Initially developed at LinkedIn, now part of Apache.
Kafka works in combination with Apache Storm, Apache HBase and Apache
Spark for real-time analysis and rendering of streaming data.
24 Big Data Analytics for Real Time Systems
Fast
Scalable
Durable
Fault-tolerant
Storm A highly distributed real-time computation system. Acquired by Twitter. Twitter claims, Over a million tuples processed per second per node. Fast, Scalable, Reliable and Fault-tolerant.
25 Big Data Analytics for Real Time Systems
Stream: Unbounded sequence of tuples
Primitives Spouts: Pull messages Bolts: Perform core
functions of stream computing
Stream
Spark Streaming
Was developed in the AMPLab at UC Berkeley.
In-memory computing capabilities deliver speed.
Low latency High throughput Fault tolerant New programing model:
Discretized streams (Dstreams) Resilient Distributed Datasets
26 Big Data Analytics for Real Time Systems
Spark Streaming uses micro-batching to support continuous stream processing. It is an extension of Spark which is a batch-processing system.
Courtesy: Apache Spark
Spring XD (XD=eXtreme Data) Spring XD is a unified, distributed, and extensible system for data ingestion, real
time analytics, batch processing, and data export.
Spring XD framework supports streams for the ingestion of event driven data
from a source to a sink that passes through any number of processors.
27 Big Data Analytics for Real Time Systems
Courtesy: Infoq
Comparison of Tools (1) Spark Streaming Apache Storm Spring XD
Definition A fast and general purpose cluster computing system. A distributed real-time computation system.
A unified, distributed, and extensible system for data
ingestion, real time analytics, batch processing, and data
export.
Implemented in Scala Clojure Java
Programming API Scala, Java, Python Java API and usable with any programing language. Java
Development A full top level Apache project. Undergoing Apache project. Spring project by Pivotal.
Processing Model Batch processing framework that also does micro-batching.
Stream Processing Framework that processes and dispatches
messages as soon as they arrive.
Unified platform for stream processing.
Fault Tolerance Recovery of lost work and restart of workers via the
resource manager.
Restart of Workers, Supervisors like nothing ever
happened.
Reassignment of work to container working.
28 Big Data Analytics for Real Time Systems
Comparison of Tools (2) Spark Streaming Apache Storm Spring XD
Data processing Messages are not lost and
delivered once. (Small-scale batching)
Keeps track of each and every record.
Unacknowledged messages are retried until the
container comes back.
Use Cases
Combines batch and stream processing (Lambda Architecture).
Machine Learning: Improve performance of iterative algorithms
Power Real-time Dashboards.
Prevention of: securities fraud compliance violations security breaches network outage
Stream tweets to Hadoop for sentiment analysis.
High throughput distributed data ingestion into HDFS from a variety of input sources.
Real-time analytics at ingestion time, e.g. gathering metrics and counting values.
29 Big Data Analytics for Real Time Systems
Which tools are right for you?
30 Big Data Analytics for Real Time Systems
Lambda Architecture
31 Big Data Analytics for Real Time Systems
In 2013, Nathan Marz and James Warren proposed the Lambda Architecture that attempts to provide a methodology to build a Big Data system.
Such a system would balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate pre-computed views, while simultaneously using real-time stream processing to provide dynamic views. Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013.
Courtesy: Trivadis
Lambda Architecture Example
32 Big Data Analytics for Real Time Systems
Marz, Nathan, and James Warren. Big Data: Principles and best practices of scalable real-time data systems. O'Reilly Media, 2013.
Courtesy: Trivadis
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
33 Big Data Analytics for Real Time Systems
Use Cases
34 Big Data Analytics for Real Time Systems
Healthcare Capture and analyze real-time data from medical monitors,
alerting hospital staff to potential health problems before patients manifest clinical signs of infection or other issues.
Analyze privacy-protected streams of medical device data to detect early signs of disease, identify correlations among multiple patients.
Finance Analyze ticks, tweets, satellite imagery, weather trends, and any
other type of data to inform trading algorithms in real time. Apply fraud insights to take action in real time. Use analytics on
streaming data to confidently differentiate legitimate actions, while preventing or interrupting suspicious actions and respond immediately to criminal patterns and activities.
Use Cases
35 Big Data Analytics for Real Time Systems
Government Identify social program fraud within seconds based on program
history, citizen profile, and geospatial data. Identify items or patterns for deeper investigation in Cyber-
security.
Transport Traffic managers can now respond quickly and accurately to
relevant insights from real-time analytics drawn from data feeds and reports.
Telematics can provide data-in-motion such as vehicle speed, data relating to the transmission control system, braking, air bags, tire pressure and wiper speed as well as geospatial and current environmental conditions data. Hence, automotive companies can strengthen customer relationships
Use Cases
36 Big Data Analytics for Real Time Systems
Telecommunication Improve customer profitability analysis, end-to-end visibility for
new product rollouts and real-time analysis for better the network customers.
Perform capacity planning for mobile networks as new high-bandwidth services are introduced. Improve customer experience.
Retail See a product recurring in abandoned shopping carts. Run a
promotion to close more sales of that product. Evaluate sales performance in real time. Take measures now to
achieve sales quotas. An electric coupon delivery service sends e-mails to customers
with recommendations matched to their interests derived from their location information, membership information, and information on nearby stores.
37 Big Data Analytics for Real Time Systems
Courtesy: SAP
Overview
Introduction
Big Data Analytics
Real Time Systems
Challenges of Real Time Analytics
Technologies
Tools
Use Cases
Future Work and Conclusion
38 Big Data Analytics for Real Time Systems
Future Work
Increased Level of Merging
Application of Social and Digital Media
New Technologies
Further Development of Telemetric Data
Self Learning Systems
Complex Statistical Methods
39 Big Data Analytics for Real Time Systems
Conclusion
40 Big Data Analytics for Real Time Systems
Resources
Privacy Security
Time Cost
Consumer Data will be the biggest differentiator in the next two to three years. Whoever unlocks the reams of data and uses it strategically, will win
-Angela Ahrendts, CEO, Burberry
?
41 Big Data Analytics for Real Time Systems
BIG DATA ANALYTICS FOR REAL TIME SYSTEMSOverviewOverviewWhere does Big Data come from?What makes it Big Data?Evolution of Big DataOverviewBig Data AnalyticsSample SystemsAnalytics & 3 VsOverviewReal Time SystemsReal-Time Analytics of Big DataOverviewChallenges of Real Time AnalyticsThinking Beyond Hadoop!Our SolutionOverviewIn-Memory ComputingStream ProcessingComplex Event ProcessingOverviewTools for Real Time AnalyticsKafkaStormSpark StreamingSpring XD (XD=eXtreme Data)Comparison of Tools (1)Comparison of Tools (2)Which tools are right for you?Lambda ArchitectureLambda Architecture ExampleOverviewUse CasesUse CasesUse CasesSlide Number 37OverviewFuture WorkConclusionSlide Number 41