1 Database Systems ( 資資資資資 ) November 8, 2004 Lecture #9 By Hao-hua Chu ( 資資資 )
Jan 13, 2016
1
Database Systems(資料庫系統 )
November 8, 2004
Lecture #9
By Hao-hua Chu (朱浩華 )
2
Announcement
• Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103
• Assignment #6 is available on the course homepage.– It is due on 11/24– It is very difficult– Suggest you do it before midterm exam
• Assignment #7 will be available on the course homepage later this afternoon.– It is due 11/16 (next Tuesday).– It is easy.– It will help you prepare midterm exam.
3
Cool Ubicomp ProjectCounter Intelligence (MIT)
• Smart kitchen & kitchen wares
• Talking Spoon– Salty, sweet ,hot?
• Talking Cultery– Bacteria?
• Smart fridge & counters– RFID tags– Tracking food from fridge to
your month
4
Hash-Based Indexing
Chapter 11
5
Introduction
• Recall that Hash-based indexes are best for equality selections. – Cannot support range searches.– Equality selections are useful for join operations.
• Static and dynamic hashing techniques exist– Trade-offs similar to ISAM vs. B+ trees.– Static hashing technique– Two dynamic hashing techniques
• Extendible Hashing
• Linear Hashing
6
Static Hashing
• # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.
• h(k) mod N = bucket to which data entry with key k belongs. (N = # of buckets)
h(key) mod N
hkey
Primary bucket pages Overflow pages
20
N-1
7
Static Hashing (Contd.)
• Buckets contain data entries.• Hash function works on search key field of record r.
– Ideally uniformly distribute values over range 0 ... N-1– h(key) = (a * key + b) usually works well.– a and b are constants; lots known about how to tune h.
• Cost for insertion/delete/search– two/two/one disk page I/Os (no overflow chains).
• Long overflow chains can develop and degrade performance.
– Why poor performance? Scan through overflow chains linearly.– Extendible and Linear Hashing: Dynamic techniques to fix this pr
oblem.
8
Simple Solution
• Avoid creating overflow pages: – When a bucket (primary page) becomes full, double #
of buckets & re-organize the file.
• What’s wrong with this simple solution?– High cost concern: reading and writing all pages is
expensive!
9
Extendible Hashing
• The basic Idea (another level of abstraction):– Use directory of pointers to buckets (hash to the directory entry)– Double # of buckets by doubling the directory– Splitting just the bucket that overflowed!
• Directory much smaller than file, so doubling it is much cheaper.
• Only one page of data entries is split – The page that overflows, rehash that page to two pages.
• Trick lies in how hash function is adjusted! – Before doubling directory, h(r) -> 0..N-1 buckets. – After doubling directory, h(r) -> 0 .. 2N-1
10
Example
• Directory is array of size 4.• To find bucket for r, take last
global depth # bits of h(r); – Example: If h(r) = 5 = binary
101, it is in bucket pointed to by 01.
• Global depth: # of bits used for hashing directory entries.
• Local depth of a bucket: # bits for hashing a bucket.
• When can global depth be different from local depth?
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
DATA PAGES
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*
11
Insert h(r)=20 (Causes Doubling)
20*
00
01
10
11
2 2
2
2
LOCAL DEPTH 2
DIRECTORY
GLOBAL DEPTHBucket A
Bucket B
Bucket C
Bucket D
1* 5* 21*13*
32*16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001
010
011
100
101
110
111
3
3
3DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
32*
1* 5* 21*13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
4: 0000 010012: 0000 110020: 0001 010016: 0001 000032: 0010 0000
4 12
12
Extensible Hashing Insert
• Check if the bucket is full.– If no, done!
• Otherwise, check if local depth = global depth– if no, rehash the entries and distribute them into two
buckets + increment the local depth– if yes, double the directory -> rehash the entries and
distribute into two buckets• Directory is doubled by copying it over and
`fixing’ pointer to split image page.– You can do this only by using the least significant bits
in the directory.
13
Insert 9
19*
2
2
2
000
001
010
011
100
101
110
111
3
3
3DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2(`split image'of Bucket A)
32*
1* 5* 21*13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
1: 0000 00015: 0000 0101
21: 0001 010113: 0000 1101
9: 0000 1001
19*
3
2
2
000
001
010
011
100
101
110
111
3
3
3
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket B2(`split image'
of Bucket B)
32*
1* 9*
16*
10*
15* 7*
5* 21*13*
LOCAL DEPTH
GLOBAL DEPTH
3
Bucket A2(`split image'
of Bucket A)
4* 20*12*
14
Directory Doubling
00
01
10
11
2
Why use least significant bits in directory? Allows for doubling via copying!
3
vs.
0
1
1
6*6*
000
001
010
011
100
101
110
111
6*
6 = 110
00
10
01
11
2
3
0
1
1
6*6*
6 = 110
Least Significant Most Significant
000
001
010
011
100
101
110
111
6*
15
Comments on Extendible Hashing
• If directory fits in memory, equality search answered with one disk access; else two.
– 100MB file, 100 bytes/rec, you have 1M data entries.– A 4K page (a bucket) can contain 40 data entries. You need about
25,000 directory elements; chances are high that directory will fit in memory.
– If the distribution of hash values is skewed (concentrates on a few buckets), directory can grow large.
• Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.
16
Linear Hashing (LH)
• This is another dynamic hashing scheme, an alternative to Extendible Hashing.– LH fixes the problem of long overflow chains (in static hashing)
without using a directory (in extendible hashing).
• Basic Idea: Use a family of hash functions h0, h1, h2, ...
– Each function’s range is twice that of its predecessor.– Pages are split when overflows occur – but not necessarily the
overflowing page. (Splitting occurs in turn, in a round robin fashion.)
– Buckets are added gradually (one bucket at a time).– When all the pages at one level (the current hash function) have
been split, a new level is applied.– Primary pages are allocated consecutively.
17
Levels of Linear Hashing• Initial Stage.
– The initial level distributes entries into N0 buckets.– Call the hash function to perform this h0.
• Splitting buckets.– If a bucket overflows its primary page is chained to an overflow page
(same as in static hashing).– Also when a bucket overflows, some bucket is split.
• The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows).
• The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split.
• When buckets are split their entries (including those in overflow pages) are distributed using h1.
– To access split buckets the next level hash function (h1) is applied.– h1 maps entries to 2N0 (or N1)buckets.
18
Levels of Linear Hashing (Cnt)
• Level progression:– Once all Ni buckets of the current level (i) are split, the hash func
tion hi is replaced by hi+1.
– The splitting process starts again at the first bucket, and hi+2 is applied to find entries in split buckets.
19
Linear Hashing Example• Initially, the index level equal to
0 and N0 equals 4 (three entries fit on a page).
• h0 maps index entries to one of four buckets.
• h0 is used and no buckets have been split.
• Now consider what happens when 9 (1001) is inserted (which will not fit in the second bucket).
• Note that next indicates which bucket is to split next. (Round Robin)
next 64 36
1 17 5
6
31 15
00
01
10
11
h0
20
Linear Hashing Example 2
• An overflow page is chained to the primary page to contain the inserted value.
• If h0 maps a value from zero to next – 1 (just the first page in this case) h1 must be used to insert the new entry.
• Note how the new page falls naturally into the sequence as the fifth page.
h1 next 64
h0 next 1 17 5 9
h0 6
h0 31 15
h1 36
• The page indicated by next is split (the first one).
• Next is incremented.000
01
10
11
100
21
Linear Hashing Example 3
• Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233
• After the 2nd. split the base level is 1 (N1 = 8), use h1.• Subsequent splits will use h2 for inserts between the first bucket and next-1.
2 1
h1 h1 next3 64 8 32 16
h1 h1 1 17 9
h1 h0 next1 10 18
6 18 14
h0 h0 next2 11
31 15 7 11
h1 h1 36
h1 h1 5 13
h1 - 6 14
- - 31 15 7 23
22
Linear Hashing vs. Extendable Hashing
• What is the similarity?– One round of RR of splitting in LH is the same as 1-
step doubling of directory in EH
• What are the differences?– Directory overhead vs. none– Overflow pages vs. none– Gradual splitting (of pages) vs. one-step doubling (of
directory)– Pages are allocated in order vs. not in order– Splitting non-overflowing pages vs. splitting
overflowing pages
23
Summary
• Hash-based indexes: best for equality searches, cannot support range searches.
• Static Hashing can lead to long overflow chains.• Extendible Hashing avoids overflow pages by splitting a
full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)
– Directory to keep track of buckets, doubles periodically.– Can get large with skewed data; additional I/O if this does not fit
in main memory.– a skewed data distribution is one in which the hash values of
data entries are not uniformly distributed!
24
Summary (Contd.)
• Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages.
– Overflow pages not likely to be long.– Space utilization could be lower than Extendible Hashing, since
splits not concentrated on `dense’ data areas.• Can tune criterion for triggering splits to trade-off slightly
longer chains for better space utilization.