Streaming Data Miningcs- · Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013 . 29 Yahoo Confidential & Proprietary Threading Machine Generated Email. 30

St reaming Da ta M in ing

PRESENTED BY Edo Liberty⎪ April 11, 2014

Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission.

Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.

2 Yahoo Confidential & Proprietary

Data

Computation Result

The World

Single machine data mining


Data Data Data Data

Computation Result

The World

Distributed storage


Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Result

The World

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Distributed model (map/reduce, message passing, …)


Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Result

The World

Data + Compute

Data + Compute

Data + Compute

Data + Compute

Computation Query

Distributed model (indexes, tables, databases, …)

207 big-data infographics (meta infographic)




Sketch

The World

Query Algorithm Result Query

Result

Computation

The streaming model


Aggregate+ Sketch

The World

Query Algorithm Result Query

Result

Compute + Sketch

Compute + Sketch

Compute + Sketch

Compute + Sketch

The parallel streaming model


1 7 8 1 0 1 7 7

Sketch

Result

Iterator

Computation

The streaming model (more accurately)

O(n) Items

O(polylog(n)) Space

O(polylog(n)) Computation per item


Sketch Result

Iterator Iterator

Communication complexity

1 7 8 1 0 1 7 7

Frequent i tems

Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006


d

n

f( ) = 5


f( ) = 5

d


`


`


`


`


`


`


`


f 0( ) = 0

`

f 0( ) = 2


Assume we do this times t

Second fact: f 0(x) � f(x)� t

f

0(x) f(x) First fact:

The proof (very short)


Third (not so obvious) fact: Which gives . In words: We can only delete items times!

t n/`

0 �P

f

0(x) =P

f(x)� t · ` = n� t · `

⌅

The proof (very short)

` n/`

|f 0(x)� f(x)| n/`

Useful form…


Define And We get that This is very useful for keeping approx’ distributions!

p(x) = f(x)/np

0(x) = f

0(x)/n

|p0(x)� p(x)| 1/`

Threading Machine Generated Emai l


Email threads

A simple email thread (that’s not very hard to do…)

Threading Machine Generated Email


Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013





What else can we do in the streaming model…


Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations

Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least)

Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification

Thanks!


Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA

Seth Tropper [email protected]

Doug DeSimone [email protected]

Keith Daniels [email protected]

Yahoo is an equal opportunity employer.

Streaming Data Miningcs- · Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013 . 29 Yahoo Confidential & Proprietary Threading Machine Generated Email. 30

Documents