This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.
2 Yahoo Confidential & Proprietary
Data
Computation Result
The World
Single machine data mining
3 Yahoo Confidential & Proprietary
Data Data Data Data
Computation Result
The World
Distributed storage
4 Yahoo Confidential & Proprietary
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Computation Result
The World
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Distributed model (map/reduce, message passing, …)
5 Yahoo Confidential & Proprietary
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Computation Result
The World
Data + Compute
Data + Compute
Data + Compute
Data + Compute
Computation Query
Distributed model (indexes, tables, databases, …)
207 big-data infographics (meta infographic)
6 Yahoo Confidential & Proprietary
7 Yahoo Confidential & Proprietary
8 Yahoo Confidential & Proprietary
Sketch
The World
Query Algorithm Result Query
Result
Computation
The streaming model
9 Yahoo Confidential & Proprietary
Aggregate+ Sketch
The World
Query Algorithm Result Query
Result
Compute + Sketch
Compute + Sketch
Compute + Sketch
Compute + Sketch
The parallel streaming model
10 Yahoo Confidential & Proprietary
1 7 8 1 0 1 7 7
Sketch
Result
Iterator
Computation
The streaming model (more accurately)
O(n) Items
O(polylog(n)) Space
O(polylog(n)) Computation per item
11 Yahoo Confidential & Proprietary
Sketch Result
Iterator Iterator
Communication complexity
1 7 8 1 0 1 7 7
Frequent i tems
Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
13 Yahoo Confidential & Proprietary
d
n
f( ) = 5
14 Yahoo Confidential & Proprietary
f( ) = 5
d
15 Yahoo Confidential & Proprietary
`
16 Yahoo Confidential & Proprietary
`
17 Yahoo Confidential & Proprietary
`
18 Yahoo Confidential & Proprietary
`
19 Yahoo Confidential & Proprietary
`
20 Yahoo Confidential & Proprietary
`
21 Yahoo Confidential & Proprietary
`
22 Yahoo Confidential & Proprietary
f 0( ) = 0
`
f 0( ) = 2
23 Yahoo Confidential & Proprietary
Assume we do this times t
Second fact: f 0(x) � f(x)� t
f
0(x) f(x) First fact:
The proof (very short)
24 Yahoo Confidential & Proprietary
Third (not so obvious) fact: Which gives . In words: We can only delete items times!
t n/`
0 �P
f
0(x) =P
f(x)� t · ` = n� t · `
⌅
The proof (very short)
` n/`
|f 0(x)� f(x)| n/`
Useful form…
25 Yahoo Confidential & Proprietary
Define And We get that This is very useful for keeping approx’ distributions!
p(x) = f(x)/np
0(x) = f
0(x)/n
|p0(x)� p(x)| 1/`
Threading Machine Generated Emai l
27 Yahoo Confidential & Proprietary
Email threads
A simple email thread (that’s not very hard to do…)
Items (words, IP-adresses, events, clicks,...): § Item frequencies § Counting distinct elements § Moment and entropy estimation § Approximate set operations
Vectors (text documents, images, example features,...) § Dimensionality reduction § Clustering (k-means, k-median,…) § Linear Regression § Machine learning (some of it at least)
Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA