Top Banner
pyright © 2012 – Proprietary and Confidential Information of SQLstream Inc. Back to the Future: Dataflow Finally Comes of Age Damian Black CEO SQLstream Real-time Big Data with Relational Streaming Dataflow Technology
12

SQLstream Structure 2012: Back to the future - dataflow comes of age

Nov 14, 2014

Download

Technology

SQLstream, Inc.

Dataflow is a technique for parallel computing that emerged from research in the 1970s. It's based on graph-based execution models where data flows along the arcs on a graph and is processed at the nodes. It was decades ahead of its time in an era when hardware was expensive and real-world requirements for massively parallel, low latency computing architectures were not required in the mainstream. However, dataflow as an architecture has found its place and time, with the emergence of Big Data volume, real-time low latency requirements, commodity hardware and low cost storage. Dataflow is driving the architectures for today's real-time big data solutions.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.

Back to the Future: Dataflow Finally Comes of Age

Damian BlackCEO SQLstream

Real-time Big Data with

Relational Streaming Dataflow Technology

Page 2: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.2

Brief History of Dataflow

What is Dataflow? Parallel processing model invented in the 70s Graphed-based execution, without destructive updates Data flow along arcs to nodes, are combined, and flow along output arcs

What happened to Dataflow? A number of experimental parallel computers designed and built Transputer and Occam were literally decades ahead of their time Due for a resurgence due to inexpensive multi-core servers & SQL

What is Relational Streaming? A dataflow paradigm for processing Streaming Big Data tuples Familiar relational expressions with automatic optimization Relational queries executed continuously on a massively parallel scale

Page 3: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.3

Dataflow Graph: Pipelined and Superscalar Processing

Relational Streaming: DAGs of fine-grained dataflow.

Page 4: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.4

Comparison of Techniques for Dataflow Scaling

Hadoop and HDFS RelationalStreaming

DataDistribution

Fat File Fat Stream

DataflowEnablement

Generate new tuples from old

leaving old tuples unaltered

Generate new tuples from old

leaving old tuples unaltered

Page 5: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.5

Dataflow: Hadoop versus Relational Streaming

Hadoop style: data chunking coarse-grained dataflow.

Relational Streaming: DAGs of fine-grained dataflow.

Page 6: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.6

» Hadoop Map Reduce Process» Relational Streaming Approach:

» Continuous Parallel Dataflow Execution

» Real-time Answers Immediately

» Intelligently populate data store:

Hadoop or

Data Warehouse

Parallel Dataflow Execution

Collect

Clean

Aggregate

Analyze

Deliver

Low Latency

Page 7: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.7

GroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

Order

Relational Streaming synergies with Hadoop

» Relational Stream Processors co-located with Hadoop Servers

» Stream/re-stream into and from locally data stores in parallel

» Combination performs Real-time and Historical processing:

» Querying the future – Continuous ETL and Analytics (parallel pipelines)

» Querying the past – Hadoop batch jobs on stored tuples (parallel batches)

GroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

Order GroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

OrderGroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

Order GroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

Order GroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

Order GroupAggJoinProjectSelect

ReduceCombineMapSplit

Hadoop & Relational Streaming Server

Sort

Order

Page 8: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.8

» Mozilla Firefox 4 – Real-time Download Monitor

» Continuous processing of download requests

» Real-time integration with Hadoop and HBase

Application Example – Google: “Youtube Mozilla Glow”

Page 9: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.9

SELECT STREAM ROWTIME, url, “numErrorsLastMinute” FROM ( SELECT STREAM ROWTIME, url, “numErrorsLastMinute”, AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “avgErrorsPerMinute”, STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ’1′ MINUTE PRECEDING) AS “stdDevErrorsPerMinute” FROM “ServiceRequestsPerMinute”) AS S WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;

Cloud Monitoring – Detecting Service Error Spikes

» Millions of records per second

» Real-time Bollinger Bands

» Amazon EC2

stream Serverstream

Serverstream Serverstream

Server

stream Serverstream

Serverstream Serverstream

Serverstream Server

stream Server

stream Server

Page 10: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.10

Data Warehouses

RelationalStreaming

HadoopBig Data

Messaging Middleware

Historical analysisPeriodic batches

Continuous analysisReal-time processing

High-level DeclarativeLanguage & Operation

Low-level ProceduralLanguage & Operation

A New Streaming Data Management Quadrant

Real-timeBig Data

BatchedBig Data

Page 11: SQLstream Structure 2012: Back to the future - dataflow comes of age

Copyright © 2012 – Proprietary and Confidential Information of SQLstream Inc.11

3. RT Parallel Processing Made easy, auto-optimized, massive scale

2. Real-time Analysis Process, analyze, and react – all in real-time

Benefits of Real-time “Big Dataflow” with Relational Streaming

Confidential and Trade Secret SQLstream Inc. © 2012

Dataflow finally comes of age.Relational Streaming. The Next Wave of Big Data.

1. Real-time Integration Continuous, real-time data integration

Page 12: SQLstream Structure 2012: Back to the future - dataflow comes of age

Query the Future ®The Future of Query.

Thanks! Any questions?