Top Banner
DBMS 2001 Notes 4.2: Hashing 1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom)
36

DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

Dec 15, 2015

Download

Documents

Arturo Grays
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 1

Principles of Database Management Systems

4.2: Hashing Techniques

Pekka Kilpeläinen(after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and

Jennifer Widom)

Page 2: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 2

Hashing?

• Locating the storage block of a record by the hash value h(k) of its key k

• Normally really fast– records (often) located by a single

disk access

Page 3: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 3

key h(key)

Hashing

<key>

.

.

Buckets(typically 1disk block)

Page 4: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 4

.

.

.

Two alternatives

records

.

.

.

key h(key)

(1) Hash value determines the storage block directly

• to implement a primary index

Page 5: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 5

key h(key)

Index

recordkey 1

Two alternatives

• for a secondary index

(2) Records located indirectly via index buckets

Page 6: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 6

Example hash function

• Key = ‘x1 x2 … xn’ n byte character string

• Have b buckets• h = (x1 + x2 + … + xn) mod b

{0, 1, …, b-1}

Page 7: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 7

This may not be best function …

Good hash Expected number of function: keys/bucket is the

same for all buckets

Read Knuth Vol. 3 if you reallyneed to select a good function.

Page 8: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 8

Next: example to illustrateinserts, overflows,

deletes

h(K)

Page 9: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 9

EXAMPLE 2 records/bucket

INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0

0

1

2

3

d

ac

b

h(e) = 1

e

Page 10: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 10

0

1

2

3

a

bc

e

d

EXAMPLE: deletion

Delete:ef

fg

maybe move“g” up

cd

Page 11: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 11

Rule of thumb:• Try to keep space utilization

between 50% and 80% Utilization = # keys used

total # keys that fit

• If < 50%, wasting space• If > 80%, overflows significant

depends on how good hashfunction is & on # keys/bucket

Page 12: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 12

How do we cope with growth?

• Overflows and reorganizations• Dynamic hashing: # of buckets

may vary• Extensible• Linear• also others ...

Page 13: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 13

Extensible hashing: two ideas

(a) Use i of b bits output by hash function

b h(K)

use i grows over time….

00110101

For example, b=32

Page 14: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 14

(b) Use directory

h(K)[i ] to bucket

.

.

.

.

Directory contains 2i pointers to buckets, and stores i.

Each bucket stores j, indicating #bits used for placing the records in this block (j i)

Page 15: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 15

Extensible Hashing: Insertion

• If there's room in bucket h(k)[i], place record there; Otherwise …

• If j=i, set i=i+1 and double the directory • If j<i, split the block in two, distribute

records among them now using j+1 bits of h(k); (Repeat until some records end up in the new bucket); Update pointers of bucket array

• See the next example

Page 16: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 16

Example: h(k) is 4 bits; 2 keys/block

i = 1

1

1

0001

1001

1100

Insert 1010

11100

1010

New directory

200

01

10

11

i =

2

2

(j)

Page 17: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 17

10001

21001

1010

21100

Insert:

0111

0000

00

01

10

11

2i =

Example continued

0111

0000

0111

0001

2

2

Page 18: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 18

00

01

10

11

2i =

21001

1010

21100

20111

20000

0001

Insert:

1001

Example continued

1001

1001

1010

000

001

010

011

100

101

110

111

3i =

3

3

Page 19: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 19

Extensible hashing: deletion

• Reverse insert procedure …

• Example: Walk thru insert example in reverse!

Page 20: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 20

Extensible hashing

Can handle growing files- without full reorganizations

Summary

+

Indirection(Not bad if directory in memory)

Directory doubles in size(First it fits in memory, then it does

not sudden performance degradation)

-

-

Only one data block examined+

Page 21: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 21

Linear hashing: grow # of buckets by one

No bucket directory needed

Two ideas:(a) Use i low order bits of hash

01110101grows

b

i(b) File grows linearly

Page 22: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 22

Linear Hashing: Parameters

• n: number of buckets in use– buckets numbered 0…n-1

• i: number of bits of h(k) used to address buckets

• r: number of records in hash table– ratio r/n limited to fit an avg bucket in a block– next example: r 1.7n, and block holds 2 records

=> AVG bucket occupancy is 1.7/2 = 0.85 of a block

)log(ni

Page 23: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 23

Example: 2 keys/block, b=4 bits, n=2, i =1

00 01

1111

0000

1010

If h(k)[i ] = (a1 … ai)2 < n, then

look at bucket h(k)[i ]; else

look at bucket h(k)[i ] - 2i -1 = (0a2 … ai)2

Rule

• insert 0101

0101

• now r=4 >1.7n get new bucket

10and distribute keys btw

buckets 00 and 10

Page 24: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 24

n=3, i =2; distribute keys btw buckets 00 and 10:

00 01 10

0101

1111

0000

1010

1010

Page 25: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 25

n=3, i =2; insert 0001:

00 01 10

0101

1111

0000

0001

1010

• can have overflow chains!

Page 26: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 26

n=3, i =2

00 01 10

0101

1111

0000 1010

0001 • insert 0111

• bucket 11 not in use redirect to 01

0111

• now r=6 > 1.7n-> get new bucket 11

Page 27: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 27

n=4, i =2; distribute keys btw 01 and 11

00 01 10 11

0101

1111

0000 1010

00010111

1111

01110001

Page 28: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 28

Example Continued: How to grow beyond this?

00 01 10 11

111110100101

0101

0000

m = 11 (max used block)

i = 2

0 0 0 0 101 110 111

3

. . .

100

100

101

101

0101

0101

Page 29: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 29

Linear Hashing

Can handle growing files- without full reorganizations

No indirection directory of extensible hashingCan have overflow chains

- but probability of long chains can be kept low by controlling the r/n fill ratio (?)

Summary

+

+

-

Page 30: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 30

Hashing- How it works- Dynamic hashing

- Extensible- Linear

Summary

Page 31: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 31

Next:

• Indexing vs Hashing• Index definition in SQL

Page 32: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 32

• Hashing good for probes given keye.g., SELECT …

FROM RWHERE R.A = 5

Indexing vs Hashing

Page 33: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 33

• INDEXING (Including B-Trees) good for

Range Searches:e.g., SELECT

FROM RWHERE R.A > 5

Indexing vs Hashing

Page 34: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 34

Index definition in SQL

• Create index name on rel (attr)• Create unique index name on rel

(attr)defines candidate key

• Drop INDEX name

Page 35: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 35

CANNOT SPECIFY TYPE OF INDEX(e.g. B-tree, Hashing, …)

OR PARAMETERS(e.g. Load Factor, Size of Hash,...)

... at least in SQL …Oracle and IBM DB2 UDB provide a

PCTFREE clause to inditate the proportion of B-tree blocks initially left unfilled

Oracle: “Hash clusters” with built-in or DBA-specified hash function

Note

Page 36: DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.

DBMS 2001 Notes 4.2: Hashing 36

The BIG picture….

• Chapters 2 & 3: Storage, records, blocks...

• Chapter 4: Access Mechanisms- Indexes

- B trees- Hashing

• Chapters 6 & 7: Query ProcessingNEXT