1 HASH TABLES The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure We tend to overestimate its size and end up with a very sparse structure
Mar 31, 2015
1
HASH TABLES
The crucial disadvantage for avoiding arrays is that we need to allocate in advance the size of this structure
We tend to overestimate its size and end up with a very sparse structure
2
STORING BIG DATA
We tend to think that the actual number of keys to be stored is equal to the universe of possible existing keys
3
HASH TABLES
Often the number of keys to be stored is smaller than the number in the universe of keys.
In this case, a hash table may save us a lot of space.
4
HASH TABLES How can you store all possible SSN in an array?
Use an array with range 0 - 999,999,999– a billion possible locations!
This will give you O(1) access time but …considering there are approximately
308,000,000 people in the USA ,you waste 1,000,000,000 -350,000,000 array entries!
5
PROBLEM - WASTED SPACE
Problem:
The range of key values we are mapping is too large
(0-999,999,999 when compared to
the # of actual keys (US citizens)
6
HASH TABLES
All search structures so far
Relied on a comparison operation
Performance O(n) or O( log n) for input of
Size N
WE CAN DO BETTER WITH HASHING
7
Simplest case:Assume we have keys with values in the range 1 .. M
Use a hash method to compute the value of the key (an int) to select a slot in a direct access table in which to store the item
8
HASH(KEY)
To search for an item with key, k,
look in slot hash (key) which produces an int that maps to an index in the array.
If there’s an item there,you’ve found it
If the tag is 0, it’s missing.
9
CONSTANT TIME SEARCH
This produces a Constant time search O(1)
10
EXAMPLE (IDEAL) HASH FUNCTION Suppose we now have Strings and
must hash them to an integer.
Our hash function maps the following values:
hashCode("apple") = 5
hashCode("watermelon") = 3
hashCode("grapes") = 8
hashCode("cantaloupe") = 7
hashCode("kiwi") = 0
hashCode("strawberry") = 9
hashCode("mango") = 6
hashCode("banana") = 2
kiwi
bananawatermelon
applemango
cantaloupegrapes
strawberry
0
1
2
3
4
5
6
7
8
9
11
WHY HASH TABLES? We use key/value pairs to store
an Entry into the table
We use use a hash function to map a key “Hawk”
Key(hawk) to an integer
The value column holds the data we are actually interested in
robin
sparrow
hawk
seagull
bluejay
owl
. . .
141
142
143
144
145
146
147
148
robin info
sparrow info
hawk info
seagull info
bluejay info
owl info
key valuekey value
12
HASH FUNCTIONS
Hash tables normally provide O(1) time (constant time) to access an element
A value(called a key) is normally stored in slot k – which is an integer value)
In hash tables, this element is stored in slot = hash(key).
13
HASH FUNCTIONS
hash(k) is a hash function.
It maps the universe U of keys into the slots of a hash table (smaller than the universe) ----
Thus reducing the size of the space we need to use.
14
PICTORIAL VIEW OF HASH TABLES
k1
k2k3
k4
UNIVERSE OF VALUES ARE MAPPED TO A SMALLER NUMBER OF SLOTS
15
HASHING
Assume I have a hash function where the key is a String
e.g. A label which represents a city in our HPAir project
hash( key ) integer
i.e. the function maps the key to an integer
That is a string – city name – to an int – which is an index into the HashMap
What performance (Big(0) do I get ?
16
HASH TABLES - CONSTRAINTS
Initial Constraints – hash a key to an integer
The hashcode of a Key must be unique
Keys must lie in a small range for storage efficiency,
keys must be dense in the range -
If they’re sparse (lots of gaps between values),a lot of space is used to obtain speed
17
HASH TABLES -
Hashing Keys produces integers, therefore
We need a hash functionhash( key ) ® integer
ie one that maps(hashes) a key to an integer
Applying this function to the key produces a unique address
18
PROBLEMS WITH A UNIQUE ADDRESS FOR EACH KEY
If hash(key) maps each key to a uniqueinteger in the range 0 .. m-1
then search is O(1) -
BUT THIS IS HARD TO DO!!!!!
19
Example - using an n-character key e.g. a String –
n = number of characters in the String.
Use a String class method to change the String to a character array -
Call a method with an array name and the number of chars in String:
hash(char array, # of characters)
20
HASHING A STRING OF CHARACTERS // n = number of chars in the String
int hash( char [] sarray, int n ) {
int sum = 0, i= 0;// sum ascii values of the characterswhile( n-- > 0 ) sum = sum + sarray[ i + +].getNumericValue();
return sum % 256 } // number of ASCII characters –is 256
returns a value in 0 .. 255
21
EVALUATION int hash( char [] sarray, int n ) {
int sum = 0, i= 0; while( n-- > 0 ) // get ascii values of each character
// and sum them sum = sum + sarray[i++].getNumericValue();
return sum % 256; } returns a value in 0 .. 255
The hash function itself is O(1) since the number of characters is a constant for each String - that number will not change for each String
22
HASH TABLES – PROBLEM -COLLISIONS With this hash function
int hash( char []s, int n ) { int sum = 0, i = 0; while( n-- > 0 ) sum = sum + s[i++].getNumericValue; return sum % 256; }
FOR:hash( “AB”, 2 ) and
hash( “BA”, 2 ) their Ascii (Unicode) values return the same value!
Unicode value A is 65, for B is 66 Add them together in any order and they equal 131
This is called a collision
23
COLLISIONS
Because we're mapping a larger universe into a smaller set of slots, collisions occur.
A variety of techniques are used for resolving collisions
Therefore having a unique key is HARD TO DO.
24
PICTORIAL VIEW OF COLLISION
k1
k2k3
k4
k5
Sometimes keys map to the same memory location COLLISION
25
HASH TABLES – COLLISION SOLUTIONS I
We need to store the actual key with the item in the hash table
We compute the address index = hash( key )
Next, look for the index in the table
if ( the location is occupied) then we try next entry till we find an open one
26
COLLISION RESOLUTION & OPEN HASHING The most common resolution mechanism for collisions
is called chaining .
This is also called Open Hashing.
Being "open", the Hashtable will store a linked list of entries whose keys hash to the same value
Chaining incorporates the concepts of linked lists and direct access structures like arrays
Each slot of a hash table will be a pointer to a linked list
27
CHAINING OR OPEN HASHING
When hashing a key, if a collision happens
the new key is stored in the linked list in that location
E.g., suppose that we're mapping the universe of integers to a hash table of size 10
28
OPEN HASH TABLEKEYS BUCKETS ENTRIES
John Smith and Sandra map to the same location – a linked list is started from John to Sandra
29
HASH TABLES - LINKED LISTS Collisions - Resolution
Linked list is attached to each primary table slot
// Three entries map to same location h(k) == h(k1) == h(k2)
Searching for k1 Calculate hash(k1) Item doesn’t match Follow linked list to k1
If NULL found, key isn’t in table
30
HASH TABLES - LINKED LISTS
If a search can be satisfiedby any item with key, k,performance is still O(1)
but If the key values are different we get O( 1 * max )
Where max is the largest number of duplicates - or length of the
longest chain (Linked List)
31
TECHNIQUE TWO - USE AN OVERFLOW AREALinked list constructed in special area of table
called OVERFLOW AREA
If two keys map to same location hash(k) == hash(j)
k stored first
Adding jWhen hash(j) maps to hash(k)Find k THENGo to first slot in overflow areaPut j in it
Searching - same as linked list
32
HASHING(103)
hash(103) = 103 mod 10 hash(103) = 3hash(103) = 103 mod 10 hash(103) = 3
Our hash function is based on the division method for creating hash functions:
hash(k) = k mod size
33
HASHING(103)
hash(n) = 103 mod 10 hash(n) = 3
hash(n) = 103 mod 10 hash(n) = 3
103103 //
34
HASHING(69)
hash(n) = 69 mod 10 hash(n) = 9
hash(n) = 69 mod 10 hash(n) = 9
103103 //
6969 //
35
HASHING(20)
h(n) = 20 mod 10 h(n) = 0h(n) = 20 mod 10 h(n) = 0
103103 //
6969 //
2020 //
36
HASHING(13)
hash(n) = 13 mod 10 hash(n) = 3hash(n) = 13 mod 10 hash(n) = 3
103103
6969 //
2020 //
1313 //
37
HASHING(110)
hash(n) = 110 mod 10 hash(n) = 0hash(n) = 110 mod 10 hash(n) = 0
103103
6969 //
2020
1313 //
110110 //
38
HASHING(53)
hash(n) = 53 mod 10 hash(n) = 3hash(n) = 53 mod 10 hash(n) = 3
103103
6969 //
2020
1313
110110 //
5353 //
39
FINAL HASH TABLE
103103
6969 //
2020
1313
110110 //
5353 //5353
40
SEARCHING FOR 53 USING CHAINING
103103
6969 //
2020
1313 //
110110 //
5353 //
41
SEARCHING FOR 53
103103
6969 //
2020
1313 //
110110 //
5353 //
42
SEARCHING FOR 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
43
SEARCHING FOR 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
44
SEARCHING FOR 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
45
CLOSED HASHING - RE-HASH FUNCTIONS
Closed hashing, is a method of collision resolution in hash tables.
With this method, a hash collision is resolved by probing, or
searching through other locations in the array –
46
+1 SOLUTION - LINEAR PROBINGIn one variation, the probing sequence
is called
(+1) – Linear Probing
Continue probing adjacent locations
until an unused array slot is found.
Then put the Entry in that location.
47
CLOSED HASHING - E.G. LINEAR PROBING
Closed Hashing keeps keys in the main table and uses a re-hash function which has many variations .
Linear probing - previous example - is the most commonly Closed Hashing
uses the Main Table or flat area to find another location
48
REHASH FUNCTION - LINEAR PROBING
The rehash function for Linear Probing is =
hash’(x) is +1
Keep going to the next slot until you find an empty one
49
INSERTION, I
Suppose you want to add seagull to this hash table
Also suppose: hashCode(seagull) = 143 table[143] is not empty table[143] != seagull table[144] is not empty table[144] != seagull table[145] is empty
Therefore, put seagull at location 145
robin
sparrow
hawk
bluejay
owl
. . .
141
142
143
144
145
146
147
148
. . .
seagull
50
SEARCHING, I Suppose you want to look up
seagull in this hash table
Also suppose: hashCode(seagull) = 143 table[143] is not empty table[143] != seagull table[144] is not empty table[144] != seagull table[145] is not empty table[145] == seagull !
We found seagull at location 145
robin
sparrow
hawk
bluejay
owl
. . .
141
142
143
144
145
146
147
148
. . .
seagull
51
SEARCHING, II Suppose you want to look up cow in
this hash table
Also suppose: hashCode(cow) = 144 table[144] is not empty table[144] != cow table[145] is not empty table[145] != cow table[146] is empty
If cow were in the table, we should have found it by now
Therefore, it isn’t here
robin
sparrow
hawk
bluejay
owl
. . .
141
142
143
144
145
146
147
148
. . .
seagull
52
INSERTION, II
Suppose you want to add hawk to this hash table
Also suppose hashCode(hawk) = 143 table[143] is not empty table[143] != hawk table[144] is not empty table[144] == hawk
hawk is already in the table, so do nothing
robin
sparrow
hawk
seagull
bluejay
owl
. . .
141
142
143
144
145
146
147
148
. . .
53
INSERTION, III Suppose:
You want to add cardinal to this hash table
hashCode(cardinal) = 147The last location is 148147 and 148 are occupied
Solution:Treat the table as circular; after
148 comes 0Hence, cardinal goes in
location 0 (or 1, or 2, or ...)
robin
sparrow
hawk
seagull
bluejay
owl
. . .
141
142
143
144
145
146
147
148
54
LINEAR PROBING – REVIEW:
Closed Hashing uses Linear Probing (among others)
Linear Probing: If position h(key) is occupied, do a linear search in the table until you find a empty slot.
The slot is searched in this order:
h(key), k(key)+1, h(key)+2, ..., h(key)+c
55
EXPANDING THE TABLE
If the table becomes full, an exception can be thrown or
we can expand the capacity.
This process is involved because if we double the size,
we risk a “sparse” structure that can impact the efficiency we seek.
One solution is to rehash the table using the new table size.
56
CLOSED HASHING - BUCKETS
One implementation for closed hashing groups hash table slots into buckets.
The M slots of the hash table are divided into B buckets, with each bucket consisting of M/B slots.
The hash function assigns each record to the first slot within one of the buckets.
57
BUCKET HASHING - USES MAIN TABLE
If this slot is already occupied,
then the bucket slots are searched sequentially until an open slot is found.
58
BUCKETS ON THE TABLE
If a bucket is entirely full,
then the record is stored in an overflow bucket of infinite capacity at the end of the table.
All buckets share the same overflow bucket. See link below: See this link for a fuller explanation
http://research.cs.vt.edu/AVresearch/hashing/buckethash.php
59
SLOTS OR BUCKETS – 4 BUCKETS
60
BUCKET HASHING
To search, hash the key to determine which bucket should contain the record.
The records in this bucket are then searched.
How is this better than linear probing? -- +1
61
BUCKET HASHING
If the desired key value is not found and the bucket still has free slots, then the search is complete.
If the bucket is full, then the search goes to the overflow bucket.
If many records are in the overflow bucket, this will be an expensive process.
62
BUCKET HASHING ADVANTAGE
Bucket methods are good for implementing hash tables stored on disk, because the bucket size can be set to the size of a disk block.
Whenever search or insertion occurs, the entire bucket is read into memory.
63
USING BUCKETS
Because the entire bucket is then in memory,
processing an insert or search operation requires only one disk access, unless the bucket is full.
If the bucket is full, then the overflow bucket must be retrieved from disk as well.
64
CLUSTERING Even with a good hash function, linear probing has its
problems:
The position of the initial mapping of key k is called the home position of k.
When several insertions map to the same home position, they end up placed contiguously in the table.
This collection of keys with the same home position is called a cluster.
65
CLUSTERS
A cluster is a group of items not containing any open slots
Clusters cause efficiency to degrade
66
CLUSTERING
As clusters grow, the probability increases that a key will map to the middle of a cluster,
increasing the rate of the cluster’s growth.
67
CLUSTERS
This tendency of linear probing to place items together is known as primary clustering.
As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster.
68
OTHER COLLISION TECHNIQUES We have looked at
chaining(Linked Lists) (Open Hashing) and
Linear Probing( Closed Hashing):
Bucket Hashing
Let us look at some other collision techniques
69
Other Closed hash function techniques are:
Quadratic probing: a variant of the above where the term being added to the hash result is squared.
h(key) + c2
Random probing: the term being added to the hash function is a random number.
h(key) + random()
70
REHASH FUNCTIONS
Rehashing: is a technique where a sequence of hashing functions are defined (h1, h2, ... hk).
If a collision occurs the functions are used in the this order
71
Hash 2’(j) - second hash function
Use a second hash function - Re-Hashing hash(k) == hash(j) k stored first
Adding jCalculate hash(j) Find k first
Calculate hash’2(j) where hash’2 is some other hash function
Repeat until we find an empty slot
Put j in it
72
HASH TABLES - RE-HASH FUNCTIONS The re-hash function has many variations
Quadratic probing
h’(x) is squared Avoids primary clustering
Secondary clustering occurs All keys which collide on h(x) follow the same sequenceFirst
a = h(j) Then a + c, a + 4c, a + 16c, ....
73
QUADRATIC PROBING
Some versions use:
p(K, i) = c1 i2 + c2 i2 + c3 i2 for some choice of constants c1, c2, and c3.
Secondary clustering generally less of a problem
74
SEARCHING IN A HASH TABLE
We have already seen how searching works with chaining. With Closed Hashing, we use the following steps
Given a target, hash the target
Take the value of the hash of target and go to the slot.
If the target exist it must be in this slot
Search in the list in the current slot using a linear search.
75
LOOK UP A KEYpublic lookup(key) { int I ;
i = find_slot(key) // method to find key in table
if slot[i] is occupied // key is in table
return slot[i].value return slot[i].value ; // return value in slotelse // key is not in table return not found}
76
LINEAR PROBING AND SINGLE-SLOT STEPpublic find_slot(key) { int i; i = hash(key) ; // use a hash method to hash the key
// search until we either find the key, or find an empty slot. while ( (slot[i] is occupied) and ( slot[i].key ≠ key ) )
{ i = (i + 1) } return i}
77
Deleting in a table – Closed Hashing
Suppose you want to look up cow in this hash table
Also suppose:– hashCode(cow) = 144– table[144] is not empty– table[144] != cow– table[145] is not empty– table[145] != cow– table[146] is empty
If cow were in the table, we should have found it by now
Therefore it is not there.
robin
sparrow
hawk
bluejay
owl
. . .
141
142
143
144
145
146
147
148
. . .
seagull
78
DELETING FROM A TABLE Problem: When an empty slot is reached, we assume the
item we are searching for is not there.
Deletion leaves an empty slot,
When we next search for an item using linear probing,
We assume the item is not there when we reached the empty slot.
79
TOMBSTONES
We assume the item is not there when we reached the empty slot.
When, in fact, the item could be AFTER the empty slot.
80
Therefore, straight deletion of an item would not work.
Instead, the cell is marked (usually by use of a boolean variable) when a item is deleted
The slot is often termed a “tombstone”.
TOMBSTONES
81
HASH TABLES - SUMMARY SO FAR ...
Potential O(1) search time
If a suitable function hash(key) integer can be found
Space for speed trade-off“Full” hash tables don’t work (more later!)
CollisionsInevitable
82
Various resolution strategies looked at so far:
Linked lists
Overflow areas
Re-hash functions
Linear probing h’ is +1Quadratic probing h’ is + i2 -
Any other hash function!or even sequence of functions!
83
COMPARISON OF COLLISION TECHNIQUES
factor (n/size)
Exp
ecte
d N
umbe
r of
Pro
besLinear Probing
Random Probing
Chaining
84
HASHING WITH CHAINING
What is the running time to insert/search/delete?
Insert: It takes O(1) time to compute the hash function and insert at head of linked list
Search: It is proportional to max linked list length
Delete: Same as search
85
EFFICIENCY OF CHAINING
Therefore, if we have a “bad” hash function, all n keys may hash to the same table index giving an O(n) run-time!
So how can we create a “good” hash
function?
86
HASH TABLES - CHOOSING THE HASH FUNCTION
Some functions are definitely better than others!
Key criterionMinimum number of collisions
Keeps chains shortMaintains O(1) on average
87
WRITING YOUR OWN HASHCODE METHOD
A hashCode method must:
Return a value that is a legal array index
Always return the same value for the same input
It can’t use random numbers, or the time of day
Return the same value for equal inputs
Must be consistent with your equals method
88
HASHCODE FUNCTION
It does not need to return different values for different inputs – some collisions are inevitable.
A good hashCode method should:
Be efficient to compute
Give a uniform distribution of array indices
so NO SPARSE ARRAYS!
89
OTHER CONSIDERATIONS
The hash table might fill up; we need to be prepared for that
Generally speaking, hash tables work best when the table size is a prime number
90
HASH TABLES IN JAVA Java provides two classes, Hashtable and
HashMap classes which implement the MAP Interface
Both are maps: they associate keys with values
Hashtable is synchronized; it can be accessed safely from multiple threads
Hashtable uses an open hash, and has a rehash method, to increase the size of the table –
91
HASHMAP
HashMap is newer, faster, and usually better,
but it is not synchronized
HashMap (default) uses a bucket hash -(linked list)and has a remove method
92
HASH TABLE OPERATIONS
Both Hashtable and HashMap are in java.util
Both have no-argument constructors, as well as constructors that take an integer table size
Both have methods as listed in next slide
93
METHODS // put the entry in the table
public T put(T key, T value)
//Returns the value for this key, or null public T get(T key)
public void clear() // clears the table
public Set keySet() // returns the values in the table in a Set
94
HASH TABLES - REDUCING THE RANGE TO [ 0, M ) We’ve mapped the keys to a range of integers
0 key < r -
decided on total number of possible keys – For social security numbers - 999,999,999
Now we must reduce this range to [ 0, m ) // from 0 to M
where where m is a reasonable size for the m is a reasonable size for the hash tablehash table
95
HASH TABLES – HASH FUNCTIONS Some typical functions
Division : Use a mod function
hash(k) = abs( k mod m)
where m is table size
which yields a range between 0 and m-1
96
Some typical functions
Choice of m?Powers of 2 are generally not good!
h(k) = k mod 2n
Prime numbers close to 2n - good choices
97
CHOOSING A VIABLE VALUE FOR M
Prime numbers close to 2n - good choices
Eg. want ~4000 entry table, choose m = 4093
Other methods in your text.
98
PERFORMANCE ANALYSIS
If n slots in a table of size m are occupied, the load factor is defined as: ( α is the load factor)
when =1 means the table is full, and =0 means the table is empty.
It is generally good to get a value < 1, near .8.
mn
n = number of items
m = number of slots
99
2
4
6
8
10
12
14
16
18
20
0 0.2 0.4 0.6 0.8 1
Ave
rag
e #
of p
robe
s
Load factor
Successful search
Linear probing
Separate Chaining
Double Hashing
100
2
4
6
8
10
12
14
16
18
20
0 0.2 0.4 0.6 0.8 1
Ave
rage
# o
f pr
obes
Load factor
Unsuccessful search
Linear probing
Double hashing
Separate chaining
101
HASH TABLES - COLLISION RESOLUTION SUMMARY Chaining
+ Unlimited number of elements+ Unlimited number of collisions- Overhead of multiple linked lists
Re-hashing+ Fast re-hashing + Fast access through use of main table space- Maximum number of elements must be known- Multiple collisions become probable - CLUSTERING!
Overflow area+ Fast access + Collisions don't use primary table space
102
TERMS TO KNOW
• Open Addressing looks for another open position in the table other than the one to which the element is originally hashed. Requires that the load factor be < 1.
Open Addressing using Linear Probing - seeking next available position –creates clusters - alternative methods - quadratic probing etc.
Separate Chaining If two keys map to the same address, separate chaining creates a linked list of keys that map to that address.
103
HASHCODE FUNCTION IN JAVA Hash function - has two parts:
Map key k to an integer
There is a default hashcode() in Java - the method maps each object to an integer .
It returns a 32 bit integer – which may be where the object is in memory.
It works poorly with Strings as two strings could be in different locations in memory and contain the same data.
104
HASH TABLES - REVIEW
• If you can meet the constraints of a hash function that gives a Big(O) of 1:
Hash Tables will generally give good performance
O(1) search
105
BUT:
not advisable for unknown data
If collection size is relatively static – few insertions and deletions - memory management is actually simpler –
106
UNIVERSAL OR PERFECT HASHING
“Dynamic perfect hashing" involves using a second hash table as the data structure to store multiple values within a particular bucket.
How do we find the next location with this
approach?
107
UNIVERSAL HASHING
What advantages does it have over linear probing?
What are possible problems with the approach?
Perfect hashing means that read access takes constant time even in the worst case.
108
UNIVERSAL OR PERFECT HASHING
For inserting , the time bounds are only true on average.
To make insertion fast enough , the second level hash table is very large for
the number of keys (k2), large enough so that collisions become unlikely.
109
SECOND LEVEL HASH TABLES
This is not a problem with table size because the first level hash distributes keys evenly
so that on average second level hash tables are still relatively small.
The hash function for the second level tables are chosen at random from a set of parameterized hash functions.
110
UNIVERSAL HASHING
It is possible when you know exactly what set of keys you are going to be hashing when you design your hash function.
It's popular for hashing keywords for compilers
Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.
111
CHAINED BUCKET
Note: when using chaining,
each linked list attached to a slot is called a bucket - this is called “chained bucket hashing”
However, there is also “bucket hashing” done on the main table - just to make things real clear.