Top Banner
Dealing with MASSIVE Data Feifei Li [email protected] Dept Computer Science, FSU Sep 9, 2008
46

Dealing with MASSIVE Data Feifei Li [email protected] Dept Computer Science, FSU Sep 9, 2008.

Jan 12, 2016

Download

Documents

Darcy Clark
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Dealing with MASSIVE Data

Feifei Li

[email protected]

Dept Computer Science, FSU

Sep 9, 2008

Page 2: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

2

Brief Bio• B.A.S. in computer engineering from

Nanyang Technological University in 2002

• Ph.D. in computer science from Boston University in 2007

• Research Interns/Visitors at AT&T Labs, IBM T. J. Watson Research Center, Microsoft Research.

• Now: Assistant Professor in CS Department at FSU

Page 3: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

3

Research Areas

Algorithms and Data structures

I/O-efficient

algorithmsstreaming

algorithms

computational geometry misc.

Database Applications

spatial databases

indexingquery processing

data security and privacy

Geographic

Information Systems

data streams

Probabilistic Data

Page 4: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

4

Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry

Examples (2002):

• Phone: AT&T 20TB phone call database, wireless tracking

• Consumer: WalMart 70TB database, buying patterns

• WEB: Web crawl of 200M pages and 2000M links, Google’s huge indexes

• Geography: NASA satellites generate 1.2TB per day

Page 5: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

5

Example: LIDAR Terrain Data

• Massive (irregular) point sets (1-10m resolution)

– Becoming relatively cheap and easy to collect

• Appalachian Mountains between 50GB and 5TB

• Exceeds memory limit and needs to be stored on disk

Page 6: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

6

Example: Network Flow Data• AT&T IP backbone generates 500 GB per day

• Gigascope: A data stream management system

– Compute certain statistics

• Can we do computation without storing the data?

Page 7: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

7

Traditional Random Access Machine Model

• Standard theoretical model of computation:

– Infinite memory (how nice!)

– Uniform access cost

• Simple model crucial for success of computer industry

R

A

M

Page 8: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

How to Deal with MASSIVE Data?

when there is not enough memory

Page 9: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

9

Solution 1: Buy More Memory

• Expensive

• (Probably) not scalable

– Growth rate of data is higher than the growth of memory

Page 10: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

10

Solution 2: Cheat! (by random sampling)

• Provide approximate solution for some problems– average, frequency of an element, etc.

• What if we want the exact result?• Many problems can’t be solved by sampling

– maximum, and all problems mentioned later

Page 11: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Solution 3: Using the Right Computation Model

• External Memory Model

• Streaming Model

• Probabilistic Model (brief)

Page 12: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Computation Model for Massive Data (1):External Memory Model

Internal memory is limited but fast

External memory is unlimited but slow

Page 13: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

13

Memory Hierarchy

• Modern machines have complicated memory hierarchy

– Levels get larger and slower further away from CPU

– Block sizes and memory sizes are different!

• There are a few attempts to model the hierarchy but not successful

– They are too complicated!

L

1

L

2

R

A

M

Page 14: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

14

Slow I/O

– Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes)

• Important to store/access data to take advantage of blocks (locality)

• Disk access is 106 times slower than main memory access

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and

disk technologies is analogous to the difference

in speed in sharpening a pencil using a sharpener on

one’s desk or by taking an airplane to the other side of

the world and using a sharpener on someone else’s

desk.” (D. Comer)

Page 15: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

15

Puzzle #1: Majority Counting

• A huge file of characters stored on disk• Question: Is there a character that appears > 50% of the time• Solution 1: sort + scan

– A few passes (O(logM/B N)): will come to it later• Solution 2: divide-and-conquer

– Load a chunk in to memory: N/M chunks– Count them, return majority– The overall majority must be the majority in >50% chunks– Iterate until < M– Very few passes (O(logM N)), geometrically decreasing

• Solution 3: O(1) memory, 2 passes (answer to be posted later)

b a e c a d a a d a a e a b a a f a g b

Page 16: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

16

N = # of items in the problem instance

B = # of items per disk block

M = # of items that fit in main memory

I/O: Move block between memory and disk

Performance measure: # of I/Os performed by algorithm

We assume (for convenience) that M >B2

D

P

M

Block I/O

External Memory Model [AV88]

Page 17: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

17

Sorting in External Memory

• Break all N elements into N/M chunks of size M each

• Sort each chunk individually in memory

• Merge them together

• Can merge <M/B sorted lists (queues) at once

M/B blocks in main memory

Page 18: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

18

Sorting in External Memory• Merge sort:

– Create N/M memory sized sorted lists

– Repeatedly merge lists together Θ(M/B) at a time

phases using I/Os each I/Os)( BNO)(log

MN

BMO )log(

BN

BN

BMO

)(MN

)/(BM

MN

))/(( 2BM

MN

1

Page 19: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

19

External Searching: B-Tree

• Each node (except root) has fan-out between B/2 and B

• Size: O(N/B) blocks on disk

• Search: O(logBN) I/Os following a root-to-leaf path

• Insertion and deletion: O(logBN) I/Os

Page 20: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

20

Fundamental Bounds Internal External

• Scanning: N

• Sorting: N log N

• Searching:

More Results

• List ranking N

• Minimal spanning tree N log N

• Offline union-find N

• Interval searching log N + T logBN + T/B

• Rectangle enclosure log N + T log N + T/B

• R-tree search

NBlogBN

BN

BMlog

BN

N2log

BN

BN

BMlog

BBN

BN

BM logloglog

BN

BN

BMlog

TN BT

BN

Page 21: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

21

Does All the Theory Matter?• Programs developed in RAM-model

still runs even there is not enough memory

– Run on large datasets because

OS moves blocks as needed

• OS utilizes paging and prefetching strategies

– But if program makes scattered accesses even good OS cannot take advantage of block access

Thrashing!

data size

runn

ing

tim

e

D

P

M

Page 22: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

22

Toy Experiment: Permuting• Problem:

– Input: N elements out of order: 6, 7, 1, 3, 2, 5, 10, 9, 4, 8

* Each element knows its correct position

– Output: Store them on disk in the right order

• Internal memory solution:

– Just scan the original sequence and move every element in the right place!

– O(N) time, O(N) I/Os

• External memory solution:

– Use sorting

– O(N log N) time, I/Os)log( BN

BN

BMO

Page 23: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

23

A Practical Example on Real Data• Computing persistence on large terrain data

Page 24: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

24

Takeaways• Need to be very careful when your program’s space

usage exceeds physical memory size• If program mostly makes highly localized accesses

– Let the OS handle it automatically• If program makes many non-localized accesses

– Need I/O-efficient techniques• Three common techniques (recall the majority counting

puzzle):– Convert to sort + scan– Divide-and-conquer– Other tricks

Page 25: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Want to know more about I/O-efficient algorithms?

A course on I/O-efficient algorithms is offered as CIS5930 (Advanced Topics in Data Management)

Page 26: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

26

Computation Model for Massive Data (2):Streaming Model

You got to look at each element only once!

Cannot

Don’t want to store data and do further processing

Can’t wait to

Page 27: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

27

Streaming Algorithms: Applications

DBMS(Oracle, DB2)

Back-end Data Warehouse

Off-line analysis – slow, expensive

DSL/CableNetworks

EnterpriseNetworks

Peer

Network OperationsCenter (NOC)

What are the top (most frequent) 1000 (source, dest) pairs seen over the last month?

SELECT COUNT (R1.source, R2.dest)FROM R1, R2WHERE R1.dest = R2.source

SQL Join Query

How many distinct (source, dest) pairs have been seen?

Set-Expression Query

PSTN

Other applications:

• Sensor networks

• Network security

• Financial applications

• Web logs and clickstreams

Page 28: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

28

Puzzle #2: Find Missing Card

• How to find the missing tile by making one pass over everything?

– Assuming you can’t memorize everything (of course)

• Assign a number to each type of tiles: = 8, = 14, = 22

• Compute the sum of all remaining tiles

– (1+…+9+11+…+19+21+…+29)*4 – sum = missing tile!

Mahjong tile

Page 29: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

29

A Research Problem: Count # Distinct Elements

• Unfortunately, there is a lower bound saying you can’t do this without using Ω(n) memory

• But if we allow some errors, then can approximate it well

b a e c a d a a d a a e a b a a f a g b

# distinct elements = 7

Page 30: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

30

Solution: FM Sketch [FM85, AMS99]

• Take a (pseudo) random hash function h : {1,…,n} {1,…,2d}, where 2d > n

• For each incoming element x, compute h(x)

– e.g., h(5) = 10101100010000

– Count how many trailing zeros

– Remember the maximum number of trailing zeroes in any h(x)

• Let Y be the maximum number of trailing zeroes

– Can show E[2Y] = # distinct elements

* 2 elements, “on average” there is one h(x) with 1 trailing zero

* 4 elements, “on average” there is one h(x) with 2 trailing zeroes

* 8 elements, “on average” there is one h(x) with 3 trailing zeroes

* …

Page 31: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Counting Paintballs

• Imagine the following scenario:– A bag of n paintballs is

emptied at the top of a long stair-case.

– At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way.

Looking only at the pattern of marked steps, what was n?

Page 32: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Counting Paintballs (cont)

• What does the distribution of paintball bursts look like?– The number of bursts at

each step follows a binomial distribution.

– The expected number of bursts drops geometrically.

– Few bursts after log2 n steps

1st

2nd

Y th

B(n,1/2)

B(n,1/2 Y)

B(n,1/4)

B(n,1/2 Y)

Page 33: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

33

Solution: FM Sketch [FM85, AMS99]

• So 2Y is an unbiased estimator for # distinct elements

• However, has a large variance

– Use O(1/ε2 ∙ log(1/δ)) copies to guarantee a good estimator that has probability 1–δ to be within relative error ε

• Applications:

– How many distinct IP addresses used a given link to send their traffic from the beginning of the day?

– How many new IP addresses appeared today that didn’t appear before?

Page 34: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

34

Finding Heavy Hitters• Which elements appeared in the stream more than 10% of the time?

• Applications:

– Networking

* Finding IP addresses sending most traffic

– Databases

* Iceberg queries

– Data mining

* Finding “hot” items (item sets) in transaction data

• Solution

– Exact solution is difficult

– If allow approximation of ε

* Use O(1/ε) space and O(1) time per element in stream

Page 35: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

35

Streaming in a Distributed World

• Large-scale querying/monitoring: Inherently distributed!

–Streams physically distributed across remote sitesE.g., stream of UDP packets through subset of edge routers

• Challenge is “holistic” querying/monitoring

– Queries over the union of distributed streams Q(S1 ∪ S2 ∪ …)

– Streaming data is spread throughout the network

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3

S1

S2

Page 36: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

36

Streaming in a Distributed World

• Need timely, accurate, and efficient query answers

• Additional complexity over centralized data streaming!

• Need space/time- and communication-efficient solutions

– Minimize network overhead

– Maximize network lifetime (e.g., sensor battery life)

– Cannot afford to “centralize” all streaming data

Network Operations

Center (NOC)

Query site Query

0 11

1 1

00

1

1 0

0

11

0

11

0

11

0

11

Q(S1 ∪ S2 ∪…)

S6

S5S4

S3

S1

S2

Page 37: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Want to know more about streaming algorithms?

A graduate-level course on streaming algorithms willbe approximately offered

in the next next next semester with an error guarantee of 5%!

Or, talk to me tomorrow!

Page 38: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Top-k Queries

• Extremely useful in information retrieval

– top-k sellers, popular movies, etc.

– google

tuple

score

t1t2t3t4t5

65301008087

top-2 = {t3, t5}

tuple

score

t3t5t4t1t2

10087806530

Threshold Alg

RankSQL

Page 39: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Top-k Queries on Uncertain Data

tuple

score

t3t5t4t1t2

10087806530

confidence

0.20.80.90.50.6

(sensor reading, reliability)

(page rank, how well match query)

tuple

score

t3t5t4t1t2

10087806530

confidence

0.20.80.90.50.6

top-k answer depends onthe interplay between

score and confidence

Page 40: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Top-k Definition: U-Topk

The k tuples with the maximum probabilityof being the top-k

tuple

score

t3t5t4t1t2

10087806530

confidence

0.20.80.90.50.6

{t3, t5}: 0.2*0.8 = 0.16

{t3, t4}:

0.2*(1-0.8)*0.9 = 0.036

{t5, t4}:

(1-0.2)*0.8*0.9 = 0.576

...

Potential problem: top-k could be very different from top-(k+1)

Page 41: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Top-k Definition: U-kRanks

The i-th tuple is the one with the maximumprobability of being at rank i, i=1,...,k

tuple

score

confidence

t3t5t4t1t2

10087806530

0.20.80.90.50.6

Rank 1:

t3: 0.2

t5: (1-0.2)*0.8 = 0.64

t4: (1-0.2)*(1-0.8)*0.9 = 0.144 ...

Rank 2:

t3: 0

t5: 0.2*0.8 = 0.16

t4: 0.9*(0.2*(1-0.8)+(1-0.2)*0.8)

= 0.612Potential problem: duplicated tuples in top-k

Page 42: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Uncertain Data Models

• An uncertain data model represents a probability distribution of database instances (possible worlds)

• Basic model: mutual independence among all tuples• Complete models: able to represent any distribution of possible worlds

– Atomic independent random Boolean variables– Each tuple corresponds to a Boolean formula, appears iff the

formula evaluates to true– Exponential complexity

Page 43: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Uncertain Data Model: x-relations

Each x-tuple represents a discrete probability distribution of tuples

x-tuples are mutually independent, and disjoint

U-Top2: {t1,t2}

U-2Ranks: (t1, t3)

single-alternative

multi-alternative

Page 44: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Want to know more about uncertainty data management?

A graduate-level course on uncertainty data management will be (likely probably) offered

in the next next next next next semester

Or, talk to me tomorrow!

Page 45: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

45

Recap• External memory model

– Main memory is fast but limited

– External memory slow but unlimited

– Aim to optimize I/O performance

• Streaming model

– Main memory is fast but small

– Can’t store, not willing to store, or can’t wait to store data

– Compute the desired answers in one pass

• Probabilistic data model

– Can’t store, query exponential possible instances of possible worlds

– Compute the desired answers in the succinct representation of the probabilistic data (efficiently!! Possibly allow some errors)

Page 46: Dealing with MASSIVE Data Feifei Li lifeifei@cs.fsu.edu Dept Computer Science, FSU Sep 9, 2008.

Thanks!

Questions?