Top Banner
1 HASH TABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate its size and end up with a very sparse structure
111

1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

Mar 31, 2015

Download

Documents

Solomon Chavez
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

1

HASH TABLES

The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure

We tend to overestimate its size and end up with a very sparse structure

Page 2: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

2

STORING BIG DATA

We tend to think that the actual number of keys to be stored is equal to the universe of possible existing keys

Page 3: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

3

HASH TABLES

Often the number of keys to be stored is smaller than the number in the universe of keys.

In this case, a hash table may save us a lot of space.

Page 4: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

4

HASH TABLES How can you store all possible SSN in an array?

Use an array with range 0 - 999,999,999– a billion possible locations!

This will give you O(1) access time but …considering there are approximately

308,000,000 people in the USA ,you waste 1,000,000,000 -350,000,000 array entries!

Page 5: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

5

PROBLEM - WASTED SPACE

Problem:

The range of key values we are mapping is too large

(0-999,999,999 when compared to

the # of actual keys (US citizens)

Page 6: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

6

HASH TABLES

All search structures so far

Relied on a comparison operation

Performance O(n) or O( log n) for input of

Size N

WE CAN DO BETTER WITH HASHING

Page 7: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

7

Simplest case:Assume we have keys with values in the range 1 .. M

Use a hash method to compute the value of the key (an int) to select a slot in a direct access table in which to store the item

Page 8: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

8

HASH(KEY)

To search for an item with key, k,

look in slot hash (key) which produces an int that maps to an index in the array.

If there’s an item there,you’ve found it

If the tag is 0, it’s missing.

Page 9: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

9

CONSTANT TIME SEARCH

This produces a Constant time search O(1)

Page 10: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

10

EXAMPLE (IDEAL) HASH FUNCTION Suppose we now have Strings and

must hash them to an integer.

Our hash function maps the following values:

hashCode("apple") = 5

hashCode("watermelon") = 3

hashCode("grapes") = 8

hashCode("cantaloupe") = 7

hashCode("kiwi") = 0

hashCode("strawberry") = 9

hashCode("mango") = 6

hashCode("banana") = 2

kiwi

bananawatermelon

applemango

cantaloupegrapes

strawberry

0

1

2

3

4

5

6

7

8

9

Page 11: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

11

WHY HASH TABLES? We use key/value pairs to store

an Entry into the table

We use use a hash function to map a key “Hawk”

Key(hawk) to an integer

The value column holds the data we are actually interested in

robin

sparrow

hawk

seagull

bluejay

owl

. . .

141

142

143

144

145

146

147

148

robin info

sparrow info

hawk info

seagull info

bluejay info

owl info

key valuekey value

Page 12: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

12

HASH FUNCTIONS

Hash tables normally provide O(1) time (constant time) to access an element

A value(called a key) is normally stored in slot k – which is an integer value)

In hash tables, this element is stored in slot = hash(key).

Page 13: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

13

HASH FUNCTIONS

hash(k) is a hash function.

It maps the universe U of keys into the slots of a hash table (smaller than the universe) ----

Thus reducing the size of the space we need to use.

Page 14: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

14

PICTORIAL VIEW OF HASH TABLES

k1

k2k3

k4

UNIVERSE OF VALUES ARE MAPPED TO A SMALLER NUMBER OF SLOTS

Page 15: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

15

HASHING

Assume I have a hash function where the key is a String

e.g. A label which represents a city in our HPAir project

hash( key ) integer

i.e. the function maps the key to an integer

That is a string – city name – to an int – which is an index into the HashMap

What performance (Big(0) do I get ?

Page 16: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

16

HASH TABLES - CONSTRAINTS

Initial Constraints – hash a key to an integer

The hashcode of a Key must be unique

Keys must lie in a small range for storage efficiency,

keys must be dense in the range -

If they’re sparse (lots of gaps between values),a lot of space is used to obtain speed

Page 17: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

17

HASH TABLES -

Hashing Keys produces integers, therefore

We need a hash functionhash( key ) ® integer

ie one that maps(hashes) a key to an integer

Applying this function to the key produces a unique address

Page 18: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

18

PROBLEMS WITH A UNIQUE ADDRESS FOR EACH KEY

If hash(key) maps each key to a uniqueinteger in the range 0 .. m-1

then search is O(1) -

BUT THIS IS HARD TO DO!!!!!

Page 19: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

19

Example - using an n-character key e.g. a String –

n = number of characters in the String.

Use a String class method to change the String to a character array -

Call a method with an array name and the number of chars in String:

hash(char array, # of characters)

Page 20: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

20

HASHING A STRING OF CHARACTERS // n = number of chars in the String

int hash( char [] sarray, int n ) {

int sum = 0, i= 0;// sum ascii values of the characterswhile( n-- > 0 ) sum = sum + sarray[ i + +].getNumericValue();

return sum % 256 } // number of ASCII characters –is 256

returns a value in 0 .. 255

Page 21: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

21

EVALUATION int hash( char [] sarray, int n ) {

int sum = 0, i= 0; while( n-- > 0 ) // get ascii values of each character

// and sum them sum = sum + sarray[i++].getNumericValue();

return sum % 256; } returns a value in 0 .. 255

The hash function itself is O(1) since the number of characters is a constant for each String - that number will not change for each String

Page 22: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

22

HASH TABLES – PROBLEM -COLLISIONS With this hash function

int hash( char []s, int n ) { int sum = 0, i = 0; while( n-- > 0 ) sum = sum + s[i++].getNumericValue; return sum % 256; }

FOR:hash( “AB”, 2 ) and

hash( “BA”, 2 ) their Ascii (Unicode) values return the same value!

Unicode value A is 65, for B is 66 Add them together in any order and they equal 131

This is called a collision

Page 23: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

23

COLLISIONS

Because we're mapping a larger universe into a smaller set of slots, collisions occur.

A variety of techniques are used for resolving collisions

Therefore having a unique key is HARD TO DO.

Page 24: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

24

PICTORIAL VIEW OF COLLISION

k1

k2k3

k4

k5

Sometimes keys map to the same memory location COLLISION

Page 25: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

25

HASH TABLES – COLLISION SOLUTIONS I

We need to store the actual key with the item in the hash table

We compute the address index = hash( key )

Next, look for the index in the table

if ( the location is occupied) then we try next entry till we find an open one

Page 26: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

26

COLLISION RESOLUTION & OPEN HASHING The most common resolution mechanism for collisions

is called chaining .

This is also called Open Hashing.

Being "open", the Hashtable will store a linked list of entries whose keys hash to the same value

Chaining incorporates the concepts of linked lists and direct access structures like arrays

Each slot of a hash table will be a pointer to a linked list

Page 27: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

27

CHAINING OR OPEN HASHING

When hashing a key, if a collision happens

the new key is stored in the linked list in that location

E.g., suppose that we're mapping the universe of integers to a hash table of size 10

Page 28: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

28

OPEN HASH TABLEKEYS BUCKETS ENTRIES

John Smith and Sandra map to the same location – a linked list is started from John to Sandra

Page 29: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

29

HASH TABLES - LINKED LISTS Collisions - Resolution

Linked list is attached to each primary table slot

// Three entries map to same location h(k) == h(k1) == h(k2)

Searching for k1 Calculate hash(k1) Item doesn’t match Follow linked list to k1

If NULL found, key isn’t in table

Page 30: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

30

HASH TABLES - LINKED LISTS

If a search can be satisfiedby any item with key, k,performance is still O(1)

but If the key values are different we get O( 1 * max )

Where max is the largest number of duplicates - or length of the

longest chain (Linked List)

Page 31: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

31

TECHNIQUE TWO - USE AN OVERFLOW AREALinked list constructed in special area of table

called OVERFLOW AREA

If two keys map to same location hash(k) == hash(j)

k stored first

Adding jWhen hash(j) maps to hash(k)Find k THENGo to first slot in overflow areaPut j in it

Searching - same as linked list

Page 32: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

32

HASHING(103)

hash(103) = 103 mod 10 hash(103) = 3hash(103) = 103 mod 10 hash(103) = 3

Our hash function is based on the division method for creating hash functions:

hash(k) = k mod size

Page 33: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

33

HASHING(103)

hash(n) = 103 mod 10 hash(n) = 3

hash(n) = 103 mod 10 hash(n) = 3

103103 //

Page 34: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

34

HASHING(69)

hash(n) = 69 mod 10 hash(n) = 9

hash(n) = 69 mod 10 hash(n) = 9

103103 //

6969 //

Page 35: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

35

HASHING(20)

h(n) = 20 mod 10 h(n) = 0h(n) = 20 mod 10 h(n) = 0

103103 //

6969 //

2020 //

Page 36: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

36

HASHING(13)

hash(n) = 13 mod 10 hash(n) = 3hash(n) = 13 mod 10 hash(n) = 3

103103

6969 //

2020 //

1313 //

Page 37: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

37

HASHING(110)

hash(n) = 110 mod 10 hash(n) = 0hash(n) = 110 mod 10 hash(n) = 0

103103

6969 //

2020

1313 //

110110 //

Page 38: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

38

HASHING(53)

hash(n) = 53 mod 10 hash(n) = 3hash(n) = 53 mod 10 hash(n) = 3

103103

6969 //

2020

1313

110110 //

5353 //

Page 39: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

39

FINAL HASH TABLE

103103

6969 //

2020

1313

110110 //

5353 //5353

Page 40: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

40

SEARCHING FOR 53 USING CHAINING

103103

6969 //

2020

1313 //

110110 //

5353 //

Page 41: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

41

SEARCHING FOR 53

103103

6969 //

2020

1313 //

110110 //

5353 //

Page 42: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

42

SEARCHING FOR 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

Page 43: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

43

SEARCHING FOR 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

Page 44: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

44

SEARCHING FOR 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

Page 45: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

45

CLOSED HASHING - RE-HASH FUNCTIONS

Closed hashing, is a method of collision resolution in hash tables.

With this method, a hash collision is resolved by probing, or

searching through other locations in the array –

Page 46: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

46

+1 SOLUTION - LINEAR PROBINGIn one variation, the probing sequence

is called

(+1) – Linear Probing

Continue probing adjacent locations

until an unused array slot is found.

Then put the Entry in that location.

Page 47: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

47

CLOSED HASHING - E.G. LINEAR PROBING

Closed Hashing keeps keys in the main table and uses a re-hash function which has many variations .

Linear probing - previous example - is the most commonly Closed Hashing

uses the Main Table or flat area to find another location

Page 48: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

48

REHASH FUNCTION - LINEAR PROBING

The rehash function for Linear Probing is =

hash’(x) is +1

Keep going to the next slot until you find an empty one

Page 49: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

49

INSERTION, I

Suppose you want to add seagull to this hash table

Also suppose: hashCode(seagull) = 143 table[143] is not empty table[143] != seagull table[144] is not empty table[144] != seagull table[145] is empty

Therefore, put seagull at location 145

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 50: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

50

SEARCHING, I Suppose you want to look up

seagull in this hash table

Also suppose: hashCode(seagull) = 143 table[143] is not empty table[143] != seagull table[144] is not empty table[144] != seagull table[145] is not empty table[145] == seagull !

We found seagull at location 145

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 51: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

51

SEARCHING, II Suppose you want to look up cow in

this hash table

Also suppose: hashCode(cow) = 144 table[144] is not empty table[144] != cow table[145] is not empty table[145] != cow table[146] is empty

If cow were in the table, we should have found it by now

Therefore, it isn’t here

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 52: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

52

INSERTION, II

Suppose you want to add hawk to this hash table

Also suppose hashCode(hawk) = 143 table[143] is not empty table[143] != hawk table[144] is not empty table[144] == hawk

hawk is already in the table, so do nothing

robin

sparrow

hawk

seagull

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

Page 53: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

53

INSERTION, III Suppose:

You want to add cardinal to this hash table

hashCode(cardinal) = 147The last location is 148147 and 148 are occupied

Solution:Treat the table as circular; after

148 comes 0Hence, cardinal goes in

location 0 (or 1, or 2, or ...)

robin

sparrow

hawk

seagull

bluejay

owl

. . .

141

142

143

144

145

146

147

148

Page 54: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

54

LINEAR PROBING – REVIEW:

Closed Hashing uses Linear Probing (among others)

Linear Probing: If position h(key) is occupied, do a linear search in the table until you find a empty slot.

The slot is searched in this order:

h(key), k(key)+1, h(key)+2, ..., h(key)+c

Page 55: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

55

EXPANDING THE TABLE

If the table becomes full, an exception can be thrown or

we can expand the capacity.

This process is involved because if we double the size,

we risk a “sparse” structure that can impact the efficiency we seek.

One solution is to rehash the table using the new table size.

Page 56: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

56

CLOSED HASHING - BUCKETS

One implementation for closed hashing groups hash table slots into buckets.

The M slots of the hash table are divided into B buckets, with each bucket consisting of M/B slots.

The hash function assigns each record to the first slot within one of the buckets.

Page 57: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

57

BUCKET HASHING - USES MAIN TABLE

If this slot is already occupied,

then the bucket slots are searched sequentially until an open slot is found.

Page 58: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

58

BUCKETS ON THE TABLE

If a bucket is entirely full,

then the record is stored in an overflow bucket of infinite capacity at the end of the table.

All buckets share the same overflow bucket. See link below: See this link for a fuller explanation

http://research.cs.vt.edu/AVresearch/hashing/buckethash.php

Page 59: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

59

SLOTS OR BUCKETS – 4 BUCKETS

Page 60: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

60

BUCKET HASHING

To search, hash the key to determine which bucket should contain the record.

The records in this bucket are then searched.

How is this better than linear probing? -- +1

Page 61: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

61

BUCKET HASHING

If the desired key value is not found and the bucket still has free slots, then the search is complete.

If the bucket is full, then the search goes to the overflow bucket.

If many records are in the overflow bucket, this will be an expensive process.

Page 62: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

62

BUCKET HASHING ADVANTAGE

Bucket methods are good for implementing hash tables stored on disk, because the bucket size can be set to the size of a disk block.

Whenever search or insertion occurs, the entire bucket is read into memory.

Page 63: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

63

USING BUCKETS

Because the entire bucket is then in memory,

processing an insert or search operation requires only one disk access, unless the bucket is full.

If the bucket is full, then the overflow bucket must be retrieved from disk as well.

Page 64: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

64

CLUSTERING Even with a good hash function, linear probing has its

problems:

The position of the initial mapping of key k is called the home position of k.

When several insertions map to the same home position, they end up placed contiguously in the table.

This collection of keys with the same home position is called a cluster.

Page 65: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

65

CLUSTERS

A cluster is a group of items not containing any open slots

Clusters cause efficiency to degrade

Page 66: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

66

CLUSTERING

As clusters grow, the probability increases that a key will map to the middle of a cluster,

increasing the rate of the cluster’s growth.

Page 67: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

67

CLUSTERS

This tendency of linear probing to place items together is known as primary clustering.

As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster.

Page 68: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

68

OTHER COLLISION TECHNIQUES We have looked at

chaining(Linked Lists) (Open Hashing) and

Linear Probing( Closed Hashing):

Bucket Hashing

Let us look at some other collision techniques

Page 69: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

69

Other Closed hash function techniques are:

Quadratic probing: a variant of the above where the term being added to the hash result is squared.

h(key) + c2

Random probing: the term being added to the hash function is a random number.

h(key) + random()

Page 70: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

70

REHASH FUNCTIONS

Rehashing: is a technique where a sequence of hashing functions are defined (h1, h2, ... hk).

If a collision occurs the functions are used in the this order

Page 71: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

71

Hash 2’(j) - second hash function

Use a second hash function - Re-Hashing hash(k) == hash(j) k stored first

Adding jCalculate hash(j) Find k first

Calculate hash’2(j) where hash’2 is some other hash function

Repeat until we find an empty slot

Put j in it

Page 72: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

72

HASH TABLES - RE-HASH FUNCTIONS The re-hash function has many variations

Quadratic probing

h’(x) is squared Avoids primary clustering

Secondary clustering occurs All keys which collide on h(x) follow the same sequenceFirst

a = h(j) Then a + c, a + 4c, a + 16c, ....

Page 73: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

73

QUADRATIC PROBING

Some versions use:

p(K, i) = c1 i2 + c2 i2 + c3 i2 for some choice of constants c1, c2, and c3.

Secondary clustering generally less of a problem

Page 74: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

74

SEARCHING IN A HASH TABLE

We have already seen how searching works with chaining. With Closed Hashing, we use the following steps

Given a target, hash the target

Take the value of the hash of target and go to the slot.

If the target exist it must be in this slot

Search in the list in the current slot using a linear search.

Page 75: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

75

LOOK UP A KEYpublic lookup(key) { int I ;

i = find_slot(key) // method to find key in table

if slot[i] is occupied // key is in table

return slot[i].value return slot[i].value ; // return value in slotelse // key is not in table return not found}

Page 76: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

76

LINEAR PROBING AND SINGLE-SLOT STEPpublic find_slot(key) { int i; i = hash(key) ; // use a hash method to hash the key

// search until we either find the key, or find an empty slot. while ( (slot[i] is occupied) and ( slot[i].key ≠ key ) )

{ i = (i + 1) } return i}

Page 77: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

77

Deleting in a table – Closed Hashing

Suppose you want to look up cow in this hash table

Also suppose:– hashCode(cow) = 144– table[144] is not empty– table[144] != cow– table[145] is not empty– table[145] != cow– table[146] is empty

If cow were in the table, we should have found it by now

Therefore it is not there.

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 78: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

78

DELETING FROM A TABLE Problem: When an empty slot is reached, we assume the

item we are searching for is not there.

Deletion leaves an empty slot,

When we next search for an item using linear probing,

We assume the item is not there when we reached the empty slot.

Page 79: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

79

TOMBSTONES

We assume the item is not there when we reached the empty slot.

When, in fact, the item could be AFTER the empty slot.

Page 80: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

80

Therefore, straight deletion of an item would not work.

Instead, the cell is marked (usually by use of a boolean variable) when a item is deleted

The slot is often termed a “tombstone”.

TOMBSTONES

Page 81: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

81

HASH TABLES - SUMMARY SO FAR ...

Potential O(1) search time

If a suitable function hash(key) integer can be found

Space for speed trade-off“Full” hash tables don’t work (more later!)

CollisionsInevitable

Page 82: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

82

Various resolution strategies looked at so far:

Linked lists

Overflow areas

Re-hash functions

Linear probing h’ is +1Quadratic probing h’ is + i2 -

Any other hash function!or even sequence of functions!

Page 83: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

83

COMPARISON OF COLLISION TECHNIQUES

factor (n/size)

Exp

ecte

d N

umbe

r of

Pro

besLinear Probing

Random Probing

Chaining

Page 84: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

84

HASHING WITH CHAINING

What is the running time to insert/search/delete?

Insert: It takes O(1) time to compute the hash function and insert at head of linked list

Search: It is proportional to max linked list length

Delete: Same as search

Page 85: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

85

EFFICIENCY OF CHAINING

Therefore, if we have a “bad” hash function, all n keys may hash to the same table index giving an O(n) run-time!

So how can we create a “good” hash

function?

Page 86: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

86

HASH TABLES - CHOOSING THE HASH FUNCTION

Some functions are definitely better than others!

Key criterionMinimum number of collisions

Keeps chains shortMaintains O(1) on average

Page 87: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

87

WRITING YOUR OWN HASHCODE METHOD

A hashCode method must:

Return a value that is a legal array index

Always return the same value for the same input

It can’t use random numbers, or the time of day

Return the same value for equal inputs

Must be consistent with your equals method

Page 88: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

88

HASHCODE FUNCTION

It does not need to return different values for different inputs – some collisions are inevitable.

A good hashCode method should:

Be efficient to compute

Give a uniform distribution of array indices

so NO SPARSE ARRAYS!

Page 89: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

89

OTHER CONSIDERATIONS

The hash table might fill up; we need to be prepared for that

Generally speaking, hash tables work best when the table size is a prime number

Page 90: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

90

HASH TABLES IN JAVA Java provides two classes, Hashtable and

HashMap classes which implement the MAP Interface

Both are maps: they associate keys with values

Hashtable is synchronized; it can be accessed safely from multiple threads

Hashtable uses an open hash, and has a rehash method, to increase the size of the table –

Page 91: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

91

HASHMAP

HashMap is newer, faster, and usually better,

but it is not synchronized

HashMap (default) uses a bucket hash -(linked list)and has a remove method

Page 92: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

92

HASH TABLE OPERATIONS

Both Hashtable and HashMap are in java.util

Both have no-argument constructors, as well as constructors that take an integer table size

Both have methods as listed in next slide

Page 93: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

93

METHODS // put the entry in the table

public T put(T key, T value)

//Returns the value for this key, or null public T get(T key)

public void clear() // clears the table

public Set keySet() // returns the values in the table in a Set

Page 94: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

94

HASH TABLES - REDUCING THE RANGE TO [ 0, M ) We’ve mapped the keys to a range of integers

0 key < r -

decided on total number of possible keys – For social security numbers - 999,999,999

Now we must reduce this range to [ 0, m ) // from 0 to M

where where m is a reasonable size for the m is a reasonable size for the hash tablehash table

Page 95: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

95

HASH TABLES – HASH FUNCTIONS Some typical functions

Division : Use a mod function

hash(k) = abs( k mod m)

where m is table size

which yields a range between 0 and m-1

Page 96: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

96

Some typical functions

Choice of m?Powers of 2 are generally not good!

h(k) = k mod 2n

Prime numbers close to 2n - good choices

Page 97: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

97

CHOOSING A VIABLE VALUE FOR M

Prime numbers close to 2n - good choices

Eg. want ~4000 entry table, choose m = 4093

Other methods in your text.

Page 98: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

98

PERFORMANCE ANALYSIS

If n slots in a table of size m are occupied, the load factor is defined as: ( α is the load factor)

when =1 means the table is full, and =0 means the table is empty.

It is generally good to get a value < 1, near .8.

mn

n = number of items

m = number of slots

Page 99: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

99

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Ave

rag

e #

of p

robe

s

Load factor

Successful search

Linear probing

Separate Chaining

Double Hashing

Page 100: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

100

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Ave

rage

# o

f pr

obes

Load factor

Unsuccessful search

Linear probing

Double hashing

Separate chaining

Page 101: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

101

HASH TABLES - COLLISION RESOLUTION SUMMARY Chaining

+ Unlimited number of elements+ Unlimited number of collisions- Overhead of multiple linked lists

Re-hashing+ Fast re-hashing + Fast access through use of main table space- Maximum number of elements must be known- Multiple collisions become probable - CLUSTERING!

Overflow area+ Fast access + Collisions don't use primary table space

Page 102: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

102

TERMS TO KNOW

• Open Addressing looks for another open position in the table other than the one to which the element is originally hashed. Requires that the load factor be < 1.

Open Addressing using Linear Probing - seeking next available position –creates clusters - alternative methods - quadratic probing etc.

Separate Chaining If two keys map to the same address, separate chaining creates a linked list of keys that map to that address.

Page 103: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

103

HASHCODE FUNCTION IN JAVA Hash function - has two parts:

Map key k to an integer

There is a default hashcode() in Java - the method maps each object to an integer .

It returns a 32 bit integer – which may be where the object is in memory.

It works poorly with Strings as two strings could be in different locations in memory and contain the same data.

Page 104: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

104

HASH TABLES - REVIEW

• If you can meet the constraints of a hash function that gives a Big(O) of 1:

Hash Tables will generally give good performance

O(1) search

Page 105: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

105

BUT:

not advisable for unknown data

If collection size is relatively static – few insertions and deletions - memory management is actually simpler –

Page 106: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

106

UNIVERSAL OR PERFECT HASHING

“Dynamic perfect hashing" involves using a second hash table as the data structure to store multiple values within a particular bucket.

How do we find the next location with this

approach?

Page 107: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

107

UNIVERSAL HASHING

What advantages does it have over linear probing?

What are possible problems with the approach?

Perfect hashing means that read access takes constant time even in the worst case.

Page 108: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

108

UNIVERSAL OR PERFECT HASHING

For inserting , the time bounds are only true on average.

To make insertion fast enough , the second level hash table is very large for

the number of keys (k2), large enough so that collisions become unlikely.

Page 109: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

109

SECOND LEVEL HASH TABLES

This is not a problem with table size because the first level hash distributes keys evenly

so that on average second level hash tables are still relatively small.

The hash function for the second level tables are chosen at random from a set of parameterized hash functions.

Page 110: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

110

UNIVERSAL HASHING

It is possible when you know exactly what set of keys you are going to be hashing when you design your hash function.

It's popular for hashing keywords for compilers

Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.

Page 111: 1 H ASH T ABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate.

111

CHAINED BUCKET

Note: when using chaining,

each linked list attached to a slot is called a bucket - this is called “chained bucket hashing”

However, there is also “bucket hashing” done on the main table - just to make things real clear.