12.12.2015
Azure Stream Analytics
Marco Parenzan@marco_parenzan
12.12.2015
12.12.2015
Thank you to our AWESOME sponsors!
12.12.2015
@marco_parenzan
Microsoft MVP 2015 for Azure Develop modern distributed
and cloud solutions Marco [dot] Parenzan [at] 1nn0va [dot] it
Passion for speaking and inspiring programmers, students, people www.innovazionefvg.net
SQL SATs organization addicted!
I’m a developer!
12.12.2015
Agenda Analytics in a modern world Why a developer talks about analytics Why cloud? Introduction to Azure Stream Analytics Azure Stream Analytics architecture Stream Analytics Query Language (SAQL) Handling time in Azure Stream Analytics Scaling Analytics Conclusions
12.12.2015
ANALYTICS IN A MODERN WORLD
12.12.2015
What is Analytics From Wikipedia
Analytics is the discovery and communication of meaningful patterns in data.
Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance.
Analytics often favors data visualization to communicate insight.
12.12.2015
Traditional analytics
Everything around us produce data From devices, sensors, infrastructures and
applications Traditional Business Intelligence first
collects data and analyzes it afterwards Typically 1 day latency, the day after
But we live in a fast paced world Social media Internet of Things Just-in-time production
Offline data is unuseful For many organizations, capturing and storing
event data for later analysis is no longer enough
Data at Rest
12.12.2015
Analytics in a modern world
We work with streaming data We want to monitor and
analyze data in near real time Typically a few seconds up to a
few minutes latency So we don’t have the time to
stop, copy data and analyze, but we have to work with streams of data
Data in motion
12.12.2015
Event-based systems Event I “something happened…
…somewhere… …sometime!
Event arrive at different times i.e. have unique timestamps
Events arrive at different rates (events/sec). In any given period of time there may be 0, 1 or
more events
12.12.2015
WHY A DEVELOPER TALKS ABOUT ANALYTICS
12.12.2015
Analytics with IoT
12.12.2015
Analytics with ASP.NET Api Apps, Logic Apps, World-wide distributed API (Rest)
Resource consuming (CPU, storage, network bandwidth)
Each request is logged With Event Hub or in log files
Evaluate how API is going on “real time” statistics
Ex. ASP.NET apps logs directly on EventHub
12.12.2015
WHY CLOUD?
12.12.2015
Why Analytics in the Cloud? Not all data is local
Event data is already in the Cloud Event data is globally distributed Bring the processing to the data, not the
data to the processing
14
12.12.2015
Apply cloud principles Focus on building solutions (PAAS or SAAS)
Without having to manage complex infrastructure and software
no hardware or other up-front costs and no time-consuming installation or setup
has elastic scale where resources are efficiently allocated and paid for as requested Scale to any volume of data while still achieving high
throughput, low-latency, and guaranteed resiliency Up and running in minutes
12.12.2015
INTRODUCTION TOAZURE STREAM ANALYTICS
12.12.2015
What is Azure Stream Analytics? Azure Stream Analytics is a cost effective
event processing engine is… …described via SQL-like syntax …a stream processing engine that is integrated
with a scalable event queuing system like Azure Event Hubs
..not alone …not the only one
12.12.2015
Microsoft Azure IoT Services
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database Machine Learning App Service
Service Bus Table/Blob Storage
Stream Analytics Power BI
External Data Sources DocumentDB HDInsight Notification
Hubs
IoT Hub External Data Sources Data Factory Mobile
Services
BizTalk Services
{ }
12.12.2015
Events handled by Azure Event Hubs
Event Producers
Azure Event Hub
> 1M Producers> 1GB/sec Aggregate Throughput
Up to 32 partitions via portal, more on
request
Par
titio
ns
Direct
PartitionKeyHash
Throughput Units:• 1 ≤ TUs ≤ Partition Count• TU: 1 MB/s writes, 2 MB/s reads
Consumer Group(s)
Receivers
Event Processor Host
IEventProcessor
12.12.2015
Analytics by Azure Stream Analytics Remember
Analytics is the discovery and communication of meaningful patterns in data.
Also Azure Machine Learning do the same : where is the difference?
Stream Analytics Machine LearningTransform (Stateless Functions, GROUP BY) Regression
Enrich (Select) Classification
Correlate (Join) Anomaly Detection
12.12.2015
Real-time analytics Intake millions of events per second
Intake millions of events per second (up to 1 GB/s) At variable loads
Scale that accommodates variable loads Low processing latency, auto adaptive (sub-second
to seconds) Transform, augment, correlate, temporal
operations Correlate between different streams, or with
reference data Find patterns or lack of patterns in data in real-time
12.12.2015
No challenges with scale Elasticity of the cloud for scale out
Spin up any number of resources on demand Scale from small to large when required Distributed, scale-out architecture
12.12.2015
Fully managed No hardware (PaaS offering)
Bypasses deployment expertise No software provisioning and maintaining No performance tuning Spin up any number of resources on demand
Expand your business globally leveraging Azure regions
12.12.2015
Mission critical availability Guaranteed events delivery
Guaranteed not to lose events or incorrect output Guaranteed “once and only once” delivery of event Ability to replay events
Guaranteed business continuity Guaranteed uptime (three nines of availability) Auto-recovery from failures Built in state management for fast recovery
Effective Audits Privacy and security properties of solutions are evident Azure integration for monitoring and ops alerting
12.12.2015
Lower costs Efficiently pay only for usage
Architected for multi-tenancy Not paying for idle resources
Typical cloud expense model Low startup costs Ability to incrementally add resources Reduce costs when business needs changes
12.12.2015
Rapid development SQL like language
High-level: focus on stream analytics solution Concise: less code to maintain First-class support for event streams and
reference data Built in temporal semantics
Built-in temporal windowing and joining Simple policy configuration to manage out-of-
order events and late arrivals
12.12.2015
AZURE STREAM ANALYTICS ARCHITECTURE
12.12.2015
Canonical Stream Analytics Pattern
Presentation and action
Storage andBatch Analysis
StreamAnalysis
IngestionCollectionEvent production
Event hubs
Cloud gateways(web APIs)
Field gateways
Applications
Legacy IOT (custom protocols)
Devices
IP-capable devices(Windows/Linux)
Low-power devices (RTOS)
Search and query
Data analytics(Power BI)
Web/thick client dashboardsEvent Hubs
SQL DB
Storage Tables
Power BI
Storage Blobs
Stream Analytics
Devices to take action
MachineLearning
more to come…
12.12.2015
Stream Analytics implements lambda-architecture generic, scalable and fault-tolerant data processing
architecture, based on his experience working on distributed data processing systems
robust system that is fault-tolerant, both against hardware failures and human mistakes
http://lambda-architecture.net/
All data entering the system is dispatched to both the batch layer and the speed layer for processing.The batch layer has two functions
managing the master dataset (an immutable, append-only set of raw data)(ii) to pre-compute the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.Any incoming query can be answered by merging results from batch views and real-time views.
12.12.2015
Azure Stream Analytics
Data SourceCollect Process ConsumeDeliver
Event Inputs- Event Hub- IoT Hub- Azure Blob
Transform- Temporal joins- Filter- Aggregates- Projections- Windows- Etc.
EnrichCorrelate
Outputs- SQL Azure- Azure Blobs- Event Hub- Service Bus
Queue- Service Bus
Topics- Table storage- PowerBI- DocumentDb
☁
BI Dashboards
Predictive Analytics
AzureStorage
• Temporal Semantics
• Guaranteed delivery
• Guaranteed up time
Reference Data- Azure BlobReference
Data
AzureDataFactory
12.12.2015
Inputs sources for a Stream Analytics Job
• Currently supported input Data Streams are Azure Event Hub , Azure IoT Hub and Azure Blob Storage. Multiple input Data Streams are supported.
• Advanced options lets you configure how the Job will read data from the input blob (which folders to read from, when a blob is ready to be read, etc).
• Reference data is usually static or changes very slowly over time.• Must be stored in Azure Blob
Storage. • Cached for performance
12.12.2015
Defining Event Schema
• The serialization format and the encoding for the for the input data sources (both Data Streams and Reference Data) must be defined.
• Currently three formats are supported: CSV, JSON and Avro (binary JSON - https://avro.apache.org/docs/1.7.7/spec.html)
• For CSV format a number of common delimiters are supported: (comma (,), semi-colon(;), colon(:), tab and space.
• For CSV and Avro optionally you can provide the schema for the input data.
12.12.2015
Output for Stream Analytics Jobs
Currently data stores supported as outputsAzure Blob storage: creates log files with temporal query results
Ideal for archivingAzure Table storage:
More structured than blob storage, easier to setup than SQL database and durable (in contrast to event hub)
SQL database: Stores results in Azure SQL Database tableIdeal as source for traditional reporting and analysis
Event hub: Sends an event to an event hubIdeal to generate actionable events such as alerts or notifications
Service Bus Queue: sends an event on a queueIdeal for sending events sequentially
Service Bus Topics: sends an event to subscribersIdeal for sending events to many consumers
PowerBI.com:Ideal for near real time reporting!
DocumentDb:Ideal if you work with json and object graphs
12.12.2015
STREAM ANALYTICS QUERY LANGUAGE (SAQL)
12.12.2015
SAQL – Language & Library
DMLSELECTFROMWHEREGROUP BYHAVINGCASE WHEN THEN ELSEINNER/LEFT OUTER JOINUNIONCROSS/OUTER APPLYCASTINTOORDER BY ASC, DSC
Scaling ExtensionsWITHPARTITION BYOVER
Date and Time FunctionsDateNameDatePartDayMonthYearDateTimeFromPartsDateDiffDateAdd
Windowing ExtensionsTumblingWindowHoppingWindowSlidingWindowDuration
Aggregate FunctionsSumCountAvgMinMaxStDevStDevPVarVarP
String FunctionsLenConcatCharIndexSubstringPatIndex
Temporal FunctionsLag, IsFirstCollectTop
12.12.2015
Supported types
Type Description
bigint Integers in the range -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807).
float Floating point numbers in the range - 1.79E+308 to -2.23E-308, 0, and 2.23E-308 to 1.79E+308.
nvarchar(max) Text values, comprised of Unicode characters. Note: A value other than max is not supported.
datetime Defines a date that is combined with a time of day with fractional seconds that is based on a 24-hour clock and relative to UTC (time zone offset 0).
Inputs will be casted into one of these typesWe can control these types with a CREATE TABLE statement:
This does not create a table, but just a data type mapping for the inputs
12.12.2015
INTO clause
Pipelining data from input to output Without INTO clause we write to destination named
‘output’ We can have multiple outputs
With INTO clause we can choose for every select the appropriate destination
E.g. send events to blob storage for big data analysis, but send special events to event hub for alerting
SELECT UserName, TimeZoneINTO OutputFROM InputStreamWHERE Topic = 'XBox'
12.12.2015
WHERE clause Specifies the conditions for the rows
returned in the result set for a SELECT statement, query expression, or subquery
There is no limit to the number of predicates that can be included in a search condition.
SELECT UserName, TimeZoneFROM InputStreamWHERE Topic = 'XBox'
12.12.2015
JOIN
We can combine multiple event streams or an event stream with reference data via a join (inner join) or a left outer join In the join clause we can specify the time
window in which we want the join to take place We use a special version of DateDiff for this
12.12.2015
Reference Data
Seamless correlation of event streams with reference data Static or slowly-changing data stored in blobs
CSV and JSON files in Azure Blobs scanned for new snapshots on a settable cadence
JOIN (INNER or LEFT OUTER) between streams and reference data sources
Reference data appears like another input:SELECT myRefData.Name, myStream.Value FROM myStreamJOIN myRefData
ON myStream.myKey = myRefData.myKey
12.12.2015
Reference data tips Currently reference data cannot be
refreshed automatically. You need to stop the job and specify new
snapshot with reference data Reference Data are only in Blog
Practice says that you use services like Azure Data Factory to move data from Azure Data Sources to Azure Blob Storage
Have you followed Francesco Diaz’s session?
12.12.2015
UNION
Combines the results of two or more queries into a single result set that includes all the rows that belong to all the queries in the union
Number and order of the columns must be the same in all queries
Data types must be compatible
If ‘ALL’ not specified duplicate rows are removed
SELECT TollId, ENTime AS Time , LicensePlate FROM EntryStream TIMESTAMP BY ENTime UNION SELECT TollId, EXTime AS Time , LicensePlate FROM ExitStream TIMESTAMP BY EXTime
TollId EntryTime LicensePlate …
1 2014-09-10 12:01:00.000 JNB 7001 …
1 2014-09-10 12:02:00.000 YXZ 1001 …
3 2014-09-10 12:02:00.000 ABC 1004 …
TollId ExitTime LicensePlate
1 2009-06-25 12:03:00.000 JNB 7001
1 2009-06-2512:03:00.000 YXZ 1001
3 2009-06-25 12:04:00.000 ABC 1004
TollId Time LicensePlate
1 2014-09-10 12:01:00.000 JNB 7001
1 2014-09-10 12:02:00.000 YXZ 1001
3 2014-09-10 12:02:00.000 ABC 1004
1 2009-06-25 12:03:00.000 JNB 7001
1 2009-06-2512:03:00.000 YXZ 10013 2009-06-25 12:04:00.000 ABC 1004
12.12.2015
HANDLING TIME IN AZURE STREAM ANALYTICS
12.12.2015
Traditional queries
Traditional querying assumes the data doesn’t change while you are querying it: query a fixed state If the data is changing: snapshots and transactions
‘freeze’ the data while we query it Since we query a finite state, our query should
finish in a finite amount of time
table query resulttable
12.12.2015
A different kind of query When analyzing a stream of data, we deal
with a potential infinite amount of data As a consequence our query will never end! To solve this problem most queries will use
time windows
stream temporal query
resultstrea
m
12.12.2015
Arrival Time Vs Application Time Every event that flows through the system comes with a
timestamp that can be accessed via System.Timestamp This timestamp can either be an application time which the
user can specify in the query A record can have multiple timestamps associated with it
The arrival time has different meanings based on the input sources. For the events from Azure Service Bus Event Hub, the arrival
time is the timestamp given by the Event Hub For Blob storage, it is the blob’s last modified time.
If the user wants to use an application time, they can do so using the TIMESTAMP BY keyword Data are sorted by timestamp column
12.12.2015
Temporal Joins
Join are used to combine events from two or more input sourcesJoins are temporal in nature – each JOIN must provide limits on how far the matching rows can be separated in timeTime bounds are specified inside the ON clause using DATEDIFF functionSupports LEFT OUTER JOIN to specify rows from the left table that do not meet the join conditionUseful for pattern detection
SELECT Make FROM EntryStream ES TIMESTAMP BY EntryTimeJOIN ExitStream EX TIMESTAMP BY ExitTimeON ES.Make= EX.Make AND DATEDIFF(second,ES,EX) BETWEEN 0 AND 10
Time(Seconds)
{“Mazda”,6} {“BMW”,7} {“Honda”,2} {“Volvo”,3}Toll Entry :
{“Mazda”,3} {“BMW”,7}{“Honda”,2} {“Volvo”,3}Toll Exit :
0 5 10 15 20 25
“Honda” – Not in result because event in Exit stream precedes event in Entry Stream“BMW” – Not in result because Entry and Exit stream events > 10 seconds apart Query Result = [Mazda, Volvo]
12.12.2015
Windowing Concepts
Common requirement to perform some set-based operation (count, aggregation etc) over events that arrive within a specified period of time
Group by returns data aggregated over a certain subset of data
How to define a subset in a stream? Windowing functions! Each Group By requires a windowing function
12.12.2015
Three types of windows
Every window operation outputs events at the end of the window The output of the window will be single event based on the
aggregate function used. The event will have the time stamp of the window
All windows have a fixed length
Tumbling windowAggregate per time interval
Hopping windowSchedule overlapping windows
Sliding windowWindows constant re-evaluated
12.12.2015
Tumbling Window
1 5 4 26 8 6 5
0 10 4020 30 Time (secs)
1 5 4 26
8 6
50
A 20-second Tumbling Window
60
3 6 1
5 3 6 1
Tumbling windows:• Repeat• Are non-overlapping
SELECT TollId, COUNT(*)FROM EntryStream TIMESTAMP BY EntryTimeGROUP BY TollId, TumblingWindow(second, 20)
Query: Count the total number of vehicles entering each toll booth every interval of 20 seconds.
An event can belong to only one tumbling window
12.12.2015
Hopping Window
1 5 4 26 8 6
0 10 4020 30 Time (secs)
50
A 20-second Hopping Window with a 10 second “Hop”
60
Hopping windows:• Repeat• Can overlap • Hop forward in time by a fixed periodSame as tumbling window if hop size = window sizeEvents can belong to more than one hopping window
SELECT COUNT(*), TollId FROM EntryStream TIMESTAMP BY EntryTimeGROUP BY TollId, HoppingWindow (second, 20,10)
4 26
8 6
5 3 6 1
1 5 4 26
8 6 5 3
6 15 3
QUERY: Count the number of vehicles entering each toll booth every interval of 20 seconds; update results every 10 seconds
12.12.2015
Sliding Window
Sliding window:• Continuously moves forward by an ε
(epsilon) • Produces an output only during the
occurrence of an event• Every windows will have at least one
eventEvents can belong to more than one sliding windowSELECT TollId, Count(*) FROM EntryStream ESGROUP BY TollId, SlidingWindow (second, 20)HAVING Count(*) > 10
Query: Find all the toll booths which have served more than 10 vehicles in the last 20 seconds
1 5
0 10 4020 30 Time (secs)
50
A 20-second Sliding Window
1
8
8
51
9
51 9
60
5 9
«5» enter
«1» enter
«9» enter
«1» exit
«5» exit 9
«9» exit «8» enter
12.12.2015
Demo: analyticsgames.azurewebsites.net
Mobile Controller (html)
WebApi MVC + Web Api
Event Hub-
Stream Analytics Service Bus (Queue)
Web Worker
Remote (html)
Json Tap event SignalR Message
http notificationJson Tap event
Json Event Hub Input source
Service busoutput queue
Input service busoutput queue
12.12.2015
SCALING STREAM ANALYTICS
12.12.2015
Steaming Unit Is a measure of the computing resource
available for processing a Job A streaming unit can process up to 1 Mb /
second By default every job consists of 1 streaming
unit. Total number of streaming units that can be used depends on : rate of incoming events complexity of the query
12.12.2015
Multiple steps, multiple outputs
A query can have multiple steps to enable pipeline execution A step is a sub-query defined using
WITH (“common table expression”) The only query outside of the WITH
keyword is also counted as a step Can be used to develop complex
queries more elegantly by creating a intermediary named result Each step’s output can be sent to
multiple output targets using INTO
WITH Step1 AS (
SELECT Count(*) AS CountTweets, TopicFROM TwitterStream PARTITION BY PartitionId
GROUP BY TumblingWindow(second, 3), Topic, PartitionId
),
Step2 AS (
SELECT Avg(CountTweets) FROM Step1
GROUP BY TumblingWindow(minute, 3)
)
SELECT * INTO Output1 FROM Step1
SELECT * INTO Output2 FROM Step2
SELECT * INTO Output3 FROM Step2
12.12.2015
Scaling Concepts – Partitions
When a query is partitioned, input events will be processed and aggregated in a separate partition groups Output events are produced for each partition group To read from Event Hubs ensure that the number of partitions match
The query within the step must have the Partition By keyword If your input is a partitioned event hub, we can write partitioned queries and
partitioned subqueries (WITH clause) A non-partitioned query with a 3-fold partitioned subquery can have (1+3) * 4 = 24
streaming units!
PartitionId = 1
PartitionId = 3PartitionId = 2
SELECT Count(*) AS Count, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(minute, 3), Topic, PartitionId
Stream AnalyticsQuery Result 1
Query Result 2
Query Result 3
PartitionId = 1PartitionId = 2PartitionId = 3
Event Hub
12.12.2015
Out of order inputs Event Hub guarantees monotonicity of the timestamp on each
partition of the Event Hub All events from all partitions are merged by timestamp order, there
will be no out of order events. When it's important for you to use sender's timestamp, so a
timestamp from the event payload is chosen using "timestamp by," there can be several sources or disorderness introduced. Producers of the events have clock skews. Network delay from the producers sending the events to Event Hub. Clock skews between Event Hub partitions.
Do we skip them (drop) or do we pretend they happened just now (adjust)?
12.12.2015
Handling out of order events
On the configuration tab, you will find the following defaults. Using 0 seconds as the out of order tolerance window means you assert
all events are in order all the time. To allow ASA to correct the disorderness, you can specify a non-
zero out of order tolerance window size. ASA will buffer events up to that window and reorder them using the
user chosen timestamp before applying the temporal transformation. Because of the buffering, the side effect is the output is delayed
by the same amount of time As a result, you will need to tune the value to reduce the number of out
of order events and keep the latency low.
12.12.2015
CONCLUSIONS
12.12.2015
Summary Azure Stream Analytics is the PaaS solution for
Analytics on streaming data It is programmable with a SQL-like language Handling time is a special and central feature Scale with cloud principles: elastic, self service,
multitenant, pay per use More questions:
Other solutions Pricing What to do with that data? Futures
12.12.2015
Microsoft real-time stream processing options
Complex event processing in SQL Server
Ease of development and operationalization Flexibility and customizability
On-premises or Azure IaaS Azure PaaS Azure PaaS
No No Yes
.NET/LINQ SQL SCP.NET, Java, Python
Visual Studio Web browser Visual Studio
12.12.2015
Apache Storm (in HDInsight) Apache Storm is a distributed, fault-tolerant,
open source real-time event processing solution. Storm was originally used by Twitter to process
massive streams of data from the Twitter firehose.
Today, Storm is an incubator project as part of the Apache Software foundation.
Typically, Storm will be integrated with a scalable event queuing system like Apache Kafka or Azure Event Hubs.
12.12.2015
Stream Analytics vs Apache Storm Storm:
Data Transformation Can handle more dynamic data (if you're willing to
program) Requires programming
Stream Analytics Ease of Setup JSON and CSV format only Can change queries within 4 minutes Only takes inputs from Event Hub, Blob Storage Only outputs to Azure Blob, Azure Tables, Azure SQL,
PowerBI
12.12.2015
Pricing Pricing based on volume per job:
Volume of data processed Streaming units required to process the data
stream
Price (USD)Volume of Data Processed Volume of data processed by the
streaming job (in GB)€ 0.0009 per GB
Streaming Unit* Blended measure of CPU, memory,
throughput.
€ 0.0262 per hour€ 18,864 per
month
12.12.2015
Azure Machine Learning
Undestand the “sequence” of data in the history to predict the future But Azure can ‘learn’ which values preceded issues
Azure Machine Learning
12.12.2015
Power BI Solutions to create realtime dashboards SaaS Service
Inside Office 365
12.12.2015
Futures
https://feedback.azure.com/forums/270577-azure-stream-analytics
[started] Native integration with Azure Machine Learning
(done this night!) Provide better ways to debug.
[planned] Call to a REST endpoint to invoke custom code
[under review] Take input from DocumentDb use SQL Azure as reference data
12.12.2015
Thanks Marco Parenzan
http://twitter.com/marco_parenzan http://www.slideshare.net/marcoparenzan http://www.github.com/marcoparenzan