Database Systems ( 資料庫系統 )

1

Database Systems(資料庫系統 )

November 8, 2004

Lecture #9

By Hao-hua Chu (朱浩華 )

2

Announcement

• Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103

• Assignment #6 is available on the course homepage.– It is due on 11/24– It is very difficult– Suggest you do it before midterm exam

• Assignment #7 will be available on the course homepage later this afternoon.– It is due 11/16 (next Tuesday).– It is easy.– It will help you prepare midterm exam.

3

Cool Ubicomp ProjectCounter Intelligence (MIT)

• Smart kitchen & kitchen wares

• Talking Spoon– Salty, sweet ,hot?

• Talking Cultery– Bacteria?

• Smart fridge & counters– RFID tags– Tracking food from fridge to

your month

4

Hash-Based Indexing

Chapter 11

5

Introduction

• Recall that Hash-based indexes are best for equality selections. – Cannot support range searches.– Equality selections are useful for join operations.

• Static and dynamic hashing techniques exist– Trade-offs similar to ISAM vs. B+ trees.– Static hashing technique– Two dynamic hashing techniques

• Extendible Hashing

• Linear Hashing

6

Static Hashing

• # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.

• h(k) mod N = bucket to which data entry with key k belongs. (N = # of buckets)

h(key) mod N

hkey

Primary bucket pages Overflow pages

20

N-1

7

Static Hashing (Contd.)

• Buckets contain data entries.• Hash function works on search key field of record r.

– Ideally uniformly distribute values over range 0 ... N-1– h(key) = (a * key + b) usually works well.– a and b are constants; lots known about how to tune h.

• Cost for insertion/delete/search– two/two/one disk page I/Os (no overflow chains).

• Long overflow chains can develop and degrade performance.

– Why poor performance? Scan through overflow chains linearly.– Extendible and Linear Hashing: Dynamic techniques to fix this pr

oblem.

8

Simple Solution

• Avoid creating overflow pages: – When a bucket (primary page) becomes full, double #

of buckets & re-organize the file.

• What’s wrong with this simple solution?– High cost concern: reading and writing all pages is

expensive!

9

Extendible Hashing

• The basic Idea (another level of abstraction):– Use directory of pointers to buckets (hash to the directory entry)– Double # of buckets by doubling the directory– Splitting just the bucket that overflowed!

• Directory much smaller than file, so doubling it is much cheaper.

• Only one page of data entries is split – The page that overflows, rehash that page to two pages.

• Trick lies in how hash function is adjusted! – Before doubling directory, h(r) -> 0..N-1 buckets. – After doubling directory, h(r) -> 0 .. 2N-1

10

Example

• Directory is array of size 4.• To find bucket for r, take last

global depth # bits of h(r); – Example: If h(r) = 5 = binary

101, it is in bucket pointed to by 01.

• Global depth: # of bits used for hashing directory entries.

• Local depth of a bucket: # bits for hashing a bucket.

• When can global depth be different from local depth?

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

DATA PAGES

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

11

Insert h(r)=20 (Causes Doubling)

20*

00

01

10

11

2 2

2

2

LOCAL DEPTH 2

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

1* 5* 21*13*

32*16*

10*

15* 7* 19*

4* 12*

19*

2

2

2

000

001

010

011

100

101

110

111

3

3

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

32*

1* 5* 21*13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

4: 0000 010012: 0000 110020: 0001 010016: 0001 000032: 0010 0000

4 12

12

Extensible Hashing Insert

• Check if the bucket is full.– If no, done!

• Otherwise, check if local depth = global depth– if no, rehash the entries and distribute them into two

buckets + increment the local depth– if yes, double the directory -> rehash the entries and

distribute into two buckets• Directory is doubled by copying it over and

`fixing’ pointer to split image page.– You can do this only by using the least significant bits

in the directory.

13

Insert 9

19*

2

2

2

000

001

010

011

100

101

110

111

3

3

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

32*

1* 5* 21*13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

1: 0000 00015: 0000 0101

21: 0001 010113: 0000 1101

9: 0000 1001

19*

3

2

2

000

001

010

011

100

101

110

111

3

3

3

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket B2(`split image'

of Bucket B)

32*

1* 9*

16*

10*

15* 7*

5* 21*13*

LOCAL DEPTH

GLOBAL DEPTH

3

Bucket A2(`split image'

of Bucket A)

4* 20*12*

14

Directory Doubling

00

01

10

11

2

Why use least significant bits in directory? Allows for doubling via copying!

3

vs.

0

1

1

6*6*

000

001

010

011

100

101

110

111

6*

6 = 110

00

10

01

11

2

3

0

1

1

6*6*

6 = 110

Least Significant Most Significant

000

001

010

011

100

101

110

111

6*

15

Comments on Extendible Hashing

• If directory fits in memory, equality search answered with one disk access; else two.

– 100MB file, 100 bytes/rec, you have 1M data entries.– A 4K page (a bucket) can contain 40 data entries. You need about

25,000 directory elements; chances are high that directory will fit in memory.

– If the distribution of hash values is skewed (concentrates on a few buckets), directory can grow large.

• Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.

16

Linear Hashing (LH)

• This is another dynamic hashing scheme, an alternative to Extendible Hashing.– LH fixes the problem of long overflow chains (in static hashing)

without using a directory (in extendible hashing).

• Basic Idea: Use a family of hash functions h0, h1, h2, ...

– Each function’s range is twice that of its predecessor.– Pages are split when overflows occur – but not necessarily the

overflowing page. (Splitting occurs in turn, in a round robin fashion.)

– Buckets are added gradually (one bucket at a time).– When all the pages at one level (the current hash function) have

been split, a new level is applied.– Primary pages are allocated consecutively.

17

Levels of Linear Hashing• Initial Stage.

– The initial level distributes entries into N0 buckets.– Call the hash function to perform this h0.

• Splitting buckets.– If a bucket overflows its primary page is chained to an overflow page

(same as in static hashing).– Also when a bucket overflows, some bucket is split.

• The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows).

• The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split.

• When buckets are split their entries (including those in overflow pages) are distributed using h1.

– To access split buckets the next level hash function (h1) is applied.– h1 maps entries to 2N0 (or N1)buckets.

18

Levels of Linear Hashing (Cnt)

• Level progression:– Once all Ni buckets of the current level (i) are split, the hash func

tion hi is replaced by hi+1.

– The splitting process starts again at the first bucket, and hi+2 is applied to find entries in split buckets.

19

Linear Hashing Example• Initially, the index level equal to

0 and N0 equals 4 (three entries fit on a page).

• h0 maps index entries to one of four buckets.

• h0 is used and no buckets have been split.

• Now consider what happens when 9 (1001) is inserted (which will not fit in the second bucket).

• Note that next indicates which bucket is to split next. (Round Robin)

next 64 36

1 17 5

6

31 15

00

01

10

11

h0

20

Linear Hashing Example 2

• An overflow page is chained to the primary page to contain the inserted value.

• If h0 maps a value from zero to next – 1 (just the first page in this case) h1 must be used to insert the new entry.

• Note how the new page falls naturally into the sequence as the fifth page.

h1 next 64

h0 next 1 17 5 9

h0 6

h0 31 15

h1 36

• The page indicated by next is split (the first one).

• Next is incremented.000

01

10

11

100

21

Linear Hashing Example 3

• Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233

• After the 2nd. split the base level is 1 (N1 = 8), use h1.• Subsequent splits will use h2 for inserts between the first bucket and next-1.

2 1

h1 h1 next3 64 8 32 16

h1 h1 1 17 9

h1 h0 next1 10 18

6 18 14

h0 h0 next2 11

31 15 7 11

h1 h1 36

h1 h1 5 13

h1 - 6 14

- - 31 15 7 23

22

Linear Hashing vs. Extendable Hashing

• What is the similarity?– One round of RR of splitting in LH is the same as 1-

step doubling of directory in EH

• What are the differences?– Directory overhead vs. none– Overflow pages vs. none– Gradual splitting (of pages) vs. one-step doubling (of

directory)– Pages are allocated in order vs. not in order– Splitting non-overflowing pages vs. splitting

overflowing pages

23

Summary

• Hash-based indexes: best for equality searches, cannot support range searches.

• Static Hashing can lead to long overflow chains.• Extendible Hashing avoids overflow pages by splitting a

full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)

– Directory to keep track of buckets, doubles periodically.– Can get large with skewed data; additional I/O if this does not fit

in main memory.– a skewed data distribution is one in which the hash values of

data entries are not uniformly distributed!

24

Summary (Contd.)

• Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages.

– Overflow pages not likely to be long.– Space utilization could be lower than Extendible Hashing, since

splits not concentrated on `dense’ data areas.• Can tune criterion for triggering splits to trade-off slightly

longer chains for better space utilization.

Database Systems ( 資料庫系統 )

Documents