Top Banner
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012
24

Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Dec 27, 2015

Download

Documents

Esmond Shepherd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

Thomas Mølhave

Spring 2012

February 9, 2012

Page 2: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

• Pervasive use of computers and sensors• Increased ability to acquire/store/process data

→ Massive data collected everywhere• Society increasingly “data driven”

→ Access/process data anywhere any time

Nature/Science special issues• 2/06,9/08, 2/11• Scientific data size growing exponentially,

while quality and availability improving• Paradigm shift: Science will be about mining data

Massive Data

Obviously not only in sciences:• Economist 02/10:

• From 150 Billion Gigabytes five years ago

to 1200 Billion today• Managing data deluge difficult; doing so

will transform business/public life

Page 3: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

4

Example: LIDAR Terrain Data• Massive (irregular) point sets (~1m resolution)

– Becoming relatively cheap and easy to collect

• Sub-meter resolution using mobile mapping

Page 4: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

5

Example: LIDAR Terrain Data

~1,2 km

~ 280 km/h at 1500-2000m

~ 1,5 m between measurements

Page 5: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

Example: LIDAR Terrain Data

• ~2 million points at 30 meter (<1GB)• ~18 billion points at 1 meter (>1TB)

I/O-Algorithms

6

Page 6: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

Example: Detailed Data Essential• Mandø with 2 meter sea-level raise

80 meter terrain model 2 meter terrain model

Page 7: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

9

Random Access Machine Model

• Standard theoretical model of computation:– Infinite memory– Uniform access cost

• Simple model crucial for success of computer industry

R

A

M

Page 8: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

10

Hierarchical Memory

• Modern machines have complicated memory hierarchy– Levels get larger and slower further away from CPU– Data moved between levels using large blocks

L

1

L

2

R

A

M

Page 9: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

11

Slow I/O

– Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes)

• Important to store/access data to take advantage of blocks (locality)

• Disk access is 106 times slower than main memory access

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and

disk technologies is analogous to the difference

in speed in sharpening a pencil using a sharpener on

one’s desk or by taking an airplane to the other side of

the world and using a sharpener on someone else’s

desk.” (D. Comer)

Page 10: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

12

Scalability Problems• Most programs developed in RAM-model

– Run on large datasets because

OS moves blocks as needed

• Moderns OS utilizes sophisticated paging and prefetching strategies– But if program makes scattered accesses even good OS cannot

take advantage of block access

Scalability problems!

data size

runn

ing

tim

e

Page 11: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-Algorithms

13

N = # of items in the problem instance

B = # of items per disk block

M = # of items that fit in main memory

T = # of items in output

I/O: Move block between memory and disk

D

P

M

Block I/O

External Memory Model

Page 12: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

Scalability Problems: Block Access Matters• Example: Traversing linked list (List ranking)

– Array size N = 10 elements– Disk block size B = 2 elements– Main memory size M = 4 elements (2 blocks)

• Large difference between N and N/B large since block size is large– Example: N = 256 x 106, B = 8000 , 1ms disk access time

N I/Os take 256 x 103 sec = 4266 min = 71 hr

N/B I/Os take 256/8 sec = 32 sec

Algorithm 2: N/B=5 I/OsAlgorithm 1: N=10 I/Os

1 5 2 6 73 4 108 9 1 2 10 9 85 4 76 3

15

I/O-Algorithms

Page 13: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

20

Fundamental Bounds Internal External

• Scanning: N• Sorting: N log N• Searching:

• Note:– Linear I/O: O(N/B)– B factor VERY important: – Cannot sort optimally with search tree

NBlogBN

BN

BMlog

BN

NBN

BN

BN

BM log

N2log

Page 14: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

21

– If nodes stored arbitrarily on diskÞ Search in I/Os

• Binary search tree:– Standard method for search among N elements

– Search traces at least one root-leaf path

External Search Trees

)(log2 NO

)(log2 N

Page 15: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

22

External Search Trees

• BFS blocking:– Block height– Output elements blocked

Rangesearch in I/Os

• Optimal: O(N/B) space and query

)(log2 B

)(B

)(log)(log/)(log 22 NOBONO B

)(log NB)(log NB

Page 16: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

23

• Maintaining BFS blocking during updates?– Balance normally maintained in search trees using rotations

• Seems very difficult to maintain BFS blocking during rotation

External Search Trees

x

y

x

y

Page 17: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

24

B-trees• BFS-blocking naturally corresponds to tree with fan-out

• B-trees balanced by allowing node degree to vary– Rebalancing performed by splitting and merging nodes

)(B

Page 18: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

25

• (a,b)-tree uses linear space and has height

Choosing a,b = each node/leaf stored in one disk block

O( /N B) space and query

(a,b)-tree• T is an (a,b)-tree (a≥2 and b≥2a-1)

– All leaves on the same level and contain between a and b elements

– Except for the root, all nodes have degree between a and b

– Root has degree between 2 and b

)(log NO a

)(log BT

B N

)(B

(2,4)-tree

Page 19: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

26

(a,b)-Tree Insert• Insert:

Search and insert element in leaf v

DO v has b+1 elements/children

Split v:

make nodes v’ and v’’ with

and elements

insert element (ref) in parent(v)

(make new root if necessary)

v=parent(v)

• Insert touch nodes

bb 2

1 ab 2

1

)(log Na

v

v’ v’’

21b 2

1b

1b

Page 20: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

27

(2,4)-Tree Insert

Page 21: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

28

(a,b)-Tree Delete• Delete:

Search and delete element from leaf v

DO v has a-1 elements/children

Fuse v with sibling v’:

move children of v’ to v

delete element (ref) from parent(v)

(delete root if necessary)

If v has >b (and ≤ a+b-1<2b) children split v

v=parent(v)

• Delete touch nodes )(log NO a

v

v

1a

12 a

Page 22: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

29

(2,4)-Tree Delete

Page 23: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

31

Summary/Conclusion: B-tree• B-trees: (a,b)-trees with a,b =

– O(N/B) space– O(logB N) query

– O(logB N) update

• B-trees with elements in the leaves sometimes called B+-tree

• Construction in I/Os– Sort elements and construct leaves– Build tree level-by-level bottom-up

)(B

)log(BN

BN

BMO

Page 24: Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Heavily based on slides by Lars Arge

I/O-algorithms

32

Merge Sort• Merge sort:

– Create N/M memory sized sorted runs– Merge runs together M/B at a time

phases using I/Os each

• Distribution sort similar (but harder – partition elements)

)( BNO)(log

MN

BMO