Top Banner
The Stanford Data The Stanford Data Streams Research Project Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma
39

The Stanford Data Streams Research Project

Feb 25, 2016

Download

Documents

kaycee

The Stanford Data Streams Research Project. Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma st anfordst re amdat am anager. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Stanford Data Streams Research Project

The Stanford Data Streams The Stanford Data Streams Research ProjectResearch Project

Profs. Rajeev Motwani & Jennifer Widom

And a cast of full- and part-time students:Arvind Arasu, Brian Babcock, Shivnath Babu,

Mayur Datar, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma

stanfordstreamdatamanager

Page 2: The Stanford Data Streams Research Project

stanfordstreamdatamanager 2

Data StreamsData Streams• Traditional DBMS -- data stored in finite,

persistent data setsdata sets

• New applications -- data as multiple, continuous, rapid, time-varying data streamsdata streams– Network monitoring and traffic engineering– Security applications– Telecom call records– Financial applications– Web logs and click-streams– Sensor networks– Manufacturing processes

Page 3: The Stanford Data Streams Research Project

stanfordstreamdatamanager 3

ChallengesChallenges• Multiple, continuous, rapid, time-varyingMultiple, continuous, rapid, time-varying

streams of data

• Queries may be continuous continuous (not just one-time)– Evaluated continuously as stream data arrives– Answer updated over time

• Queries may be complexcomplex– Beyond element-at-a-time processing– Beyond stream-at-a-time processing

Page 4: The Stanford Data Streams Research Project

stanfordstreamdatamanager 4

Using Traditional DatabaseUsing Traditional Database

User/ApplicationUser/Application

LoaderLoader

QueryQuery ResultResultResultResult

……QueryQuery

……

Page 5: The Stanford Data Streams Research Project

stanfordstreamdatamanager 5

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Page 6: The Stanford Data Streams Research Project

stanfordstreamdatamanager 6

New Approach for Data StreamsNew Approach for Data Streams

User/ApplicationUser/Application

Register QueryRegister Query

Stream QueryStream QueryProcessorProcessor

ResultResult

Scratch SpaceScratch Space(Memory and/or Disk)(Memory and/or Disk)

DataStream

ManagementSystem(DSMS)

Page 7: The Stanford Data Streams Research Project

stanfordstreamdatamanager 7

DBMS versus DSMSDBMS versus DSMS

Page 8: The Stanford Data Streams Research Project

stanfordstreamdatamanager 8

DBMS versus DSMSDBMS versus DSMS• Persistent relations • Transient streams (and

persistent relations)

Page 9: The Stanford Data Streams Research Project

stanfordstreamdatamanager 9

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Transient streams (and persistent relations)

• Continuous queries

Page 10: The Stanford Data Streams Research Project

stanfordstreamdatamanager 10

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

Page 11: The Stanford Data Streams Research Project

stanfordstreamdatamanager 11

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data arrival and characteristics

Page 12: The Stanford Data Streams Research Project

stanfordstreamdatamanager 12

DBMS versus DSMSDBMS versus DSMS• Persistent relations

• One-time queries

• Random access

• Access plan determined by query processor and physical DB design

• “Unbounded” disk store

• Transient streams (and persistent relations)

• Continuous queries

• Sequential access

• Unpredictable data arrival and characteristics

• Bounded main memory

Page 13: The Stanford Data Streams Research Project

stanfordstreamdatamanager 13

Sample ApplicationsSample Applications• Network management and traffic engineering

(e.g., Sprint)– Streams of measurements and packet traces– Queries: detect anomalies, adjust routing

• Telecom call data (e.g., AT&T)– Streams of call records– Queries: fraud detection, customer call patterns,

billing

Page 14: The Stanford Data Streams Research Project

stanfordstreamdatamanager 14

Sample Applications (cont’d) Sample Applications (cont’d) • Network security

(e.g., iPolicy, NetForensics/Cisco, Netscreen)– Network packet streams, user session information– Queries: URL filtering, detecting intrusions & DOS

attacks & viruses

• Financial applications (e.g., Traderbot)– Streams of trading data, stock tickers, news feeds– Queries: arbitrage opportunities, analytics, patterns

Page 15: The Stanford Data Streams Research Project

stanfordstreamdatamanager 15

Sample Applications (cont’d) Sample Applications (cont’d) • Web tracking and personalization

(e.g., Yahoo, Google, Akamai)– Clickstreams, user query streams, log records– Queries: monitoring, analysis, personalization

• Truly massive databases (e.g., Astronomy Archives)– Stream the data by once (or over and over)– Queries do the best they can

Page 16: The Stanford Data Streams Research Project

stanfordstreamdatamanager 16

Making Things ConcreteMaking Things Concrete• Database = two streams of mobile call records

– Outgoing(connectionID, caller, start, end)– Incoming(connectionID, callee, start, end)

• Query language = SQLFROM clauses can refer to streams and/or relations

Page 17: The Stanford Data Streams Research Project

stanfordstreamdatamanager 17

Query Example 1Query Example 1• Find all outgoing calls longer than 2 minutes

(relational selection)SELECT O.connectionID, O.callerFROM Outgoing OWHERE O.end – O.start > 2

• Result requires unbounded storage

• Can provide result as data stream

Page 18: The Stanford Data Streams Research Project

stanfordstreamdatamanager 18

Query Example 2Query Example 2• Pair up callers and callees (relational join)

SELECT O.caller, I.calleeFROM Outgoing O, Incoming IWHERE O.connectionID = I.connectionID

• Can still provide result as data stream

• Requires unbounded temporary storage (without additional assumptions)

Page 19: The Stanford Data Streams Research Project

stanfordstreamdatamanager 19

Query Example 3Query Example 3• Find total connection time for each caller

(relational grouping and aggregation)SELECT O.caller, sum(O.end – O.start)FROM Outgoing OGROUP BY O.caller

• Cannot provide result in (append-only) stream

Page 20: The Stanford Data Streams Research Project

stanfordstreamdatamanager 20

Project GoalProject Goal Reconsider all aspects of data management

and processing in presence of data streams

Page 21: The Stanford Data Streams Research Project

stanfordstreamdatamanager 21

Remainder of TalkRemainder of Talk• Data stream model

• Queries over data streams– Language, semantics, evaluation & optimization

• DSMS query processing architecture and system internals

• Results to date

• Ongoing work

• Related work

Page 22: The Stanford Data Streams Research Project

stanfordstreamdatamanager 22

Data ModelData Model• Database: relations + data streamsrelations + data streams

• Stream characteristics– Type of data (schema)– Data distribution– Flow rate– Stability of distribution and flow– Ordering and other constraints– Synchronization of multiple streams– Distributed streams

Page 23: The Stanford Data Streams Research Project

stanfordstreamdatamanager 23

Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues• Answer availability

– One-time– Multiple-time– Continuous (“standing”), stored or streamed

• Registration time– Predefined– Ad-hoc

• Stream access– Arbitrary– Sliding window (special case: size = 1)

Page 24: The Stanford Data Streams Research Project

stanfordstreamdatamanager 24

Data Stream Queries -- Basic IssuesData Stream Queries -- Basic Issues• Answer availability

– One-time– Multiple-time– Continuous (“standing”), stored or streamed

• Registration time– Predefined– Ad hoc

• Stream access– Arbitrary– Sliding window (special case: size = 1)

Page 25: The Stanford Data Streams Research Project

stanfordstreamdatamanager 25

Query Language & SemanticsQuery Language & Semantics• Specifying queries over streams

– SQL-like versus dataflow network of operators– Sliding windows as first-class query construct

• Semantic issues– Blocking operators, e.g., aggregation, order-by– Streams as sets versus lists– Timestamping

Page 26: The Stanford Data Streams Research Project

stanfordstreamdatamanager 26

Query Evaluation -- ApproximationQuery Evaluation -- Approximation• Why approximate?

– Streams are coming too fast– Exact answer requires unbounded storage or

significant computational resources– Ad hoc queries reference history

• Issues in approximation– Sliding windows, sampling, synopses, …– How is approximation controlled?– How is it understood by user?

• Accuracy-efficiency-storage tradeoffAccuracy-efficiency-storage tradeoff

Page 27: The Stanford Data Streams Research Project

stanfordstreamdatamanager 27

Query Evaluation -- AdaptivityQuery Evaluation -- Adaptivity• Why adaptivity?

– Queries are long-running– Fluctuating stream arrival & data characteristics– Evolving query loads

• Issues in adaptivity– Adaptive resource allocation (memory,

computation)– Adaptive query execution plans

Page 28: The Stanford Data Streams Research Project

stanfordstreamdatamanager 28

Query Evaluation -- Multiple QueriesQuery Evaluation -- Multiple Queries• Possibly large number of continuous queries

• Long-running

• Shared resources

• Multi-query optimization

Page 29: The Stanford Data Streams Research Project

stanfordstreamdatamanager 29

Query Evaluation -- Distributed StreamsQuery Evaluation -- Distributed Streams1 Many physical streams but one logical stream

– E.g., maintain top 100 visited pages at Yahoo

2 Correlate streams at distributed servers– E.g., network monitoring

3 Many streams controlled by a few servers– E.g., sensor networks

• Issues– Move processing to streams, not streams to

processor– Approximation-bandwidth tradeoffApproximation-bandwidth tradeoff

Page 30: The Stanford Data Streams Research Project

stanfordstreamdatamanager 30

Query Processing ArchitectureQuery Processing Architecture

Input Data Streams

Usersissue

continuous and ad-hoc queries

Administrator can monitor query

executionand adjust run-time

parameters

Applicationsregister

continuous queries

OutputStream

X

X

Waiting Op

Ready Op

Running Op

Synopses Query Plans

Page 31: The Stanford Data Streams Research Project

stanfordstreamdatamanager 31

DSMS InternalsDSMS Internals• Query plans: operators, synopses, queuesoperators, synopses, queues

• Memory management– Dynamic allocation to buffers, queues, synopses– Accuracy vs. memory tradeoff– Operators adapt gracefully to memory reallocation

• Scheduler– Handles variable-rate input streams– Handles varying operator and query requirements

Page 32: The Stanford Data Streams Research Project

stanfordstreamdatamanager 32

Some Results to DateSome Results to Date• Algorithms on data streams

– Online clustering [FOCS 2000, ICDE 2002]

– Online quantiles [SIGMOD 98, SIGMOD 99]

– Statistics over sliding windows [SODA 2002]

– Online frequency counting

• Theory of stream query processing– Memory requirements of stream queries [PODS02]

• System design– STREAMSTREAM: stanfordstreamdatamanager

Page 33: The Stanford Data Streams Research Project

stanfordstreamdatamanager 33

STREAM System ImplementationSTREAM System Implementation• Comprehensive DSMS query processor

• Broad suite of operators and synopses

• Sophisticated “developer’s workbench” interface– Submit queries in extended SQL or algebra– Submit or edit query plans in XML or GUI– Query plan execution visualizer– On-the-fly modification of memory allocation,

scheduling policies, etc.

Page 34: The Stanford Data Streams Research Project

stanfordstreamdatamanager 34

Ongoing WorkOngoing Work• Algebra for streams

• Synopses and algorithmic issues

• Memory management issues

• Exploiting constraints on streams

• Approximation in query processing

• Distributed stream processing

• System development

Page 35: The Stanford Data Streams Research Project

stanfordstreamdatamanager 35

Ongoing WorkOngoing Work• Algebra for streams

• Synopses and algorithmic issues

• Memory management issues

• Exploiting constraints on streams

• Approximation in query processing

• Distributed stream processing

• System development

Page 36: The Stanford Data Streams Research Project

stanfordstreamdatamanager 36

Ongoing Work -- ConstraintsOngoing Work -- Constraints• Exploiting constraints on streams in query

processing– Foreign-key joins, referential integrity, clustering,

ordering– Need not be exact (e.g., k-clustered)– Reduce memory requirements– Unblock blocking operators

Page 37: The Stanford Data Streams Research Project

stanfordstreamdatamanager 37

Ongoing Work -- Approximation in Ongoing Work -- Approximation in Query ProcessingQuery Processing

• Understanding behavior of approximate operators when composed

• Memory allocation to operators in a plan, given per-operator memory-accuracy curve

• Best query plan, assuming best memory allocation

• Multiple (weighted) queries sharing resources

Page 38: The Stanford Data Streams Research Project

stanfordstreamdatamanager 38

Related WorkRelated Work• Triggers, alerters, materialized views,

continuous queries on conventional DBs, pub/sub, sequence & temporal databases, …

• TelegraphTelegraph project at UC Berkeley

• NiagaraNiagara project at Wisconsin/OGI

• AmazonAmazon project at Cornell

• AuroraAurora project at Brown/MIT

• And others

Page 39: The Stanford Data Streams Research Project

For Papers and General Info.For Papers and General Info.

http://www-db.stanford.edu/stream

stanfordstreamdatamanager