COMP2402 Hash Tables Pat Morin
COMP2402Hash Tables
Pat Morin
Outline• Hashing with chaining
• Multiplicative hashing
• Hash table implementations of– Set– Map
• Designing a good hashCode() method
Hash: (noun) Food, especially meat and potatoes, chopped and mixed together; A confused mess;
Hash: (noun) Food, especially meat and potatoes, chopped and mixed together; A confused mess;
Hash tables• Hashing is one of the most widely-used techniques in
computer science– Data structures– Error detection– Security
• Hash tables are one of the most useful data structures– store integers• or data that can be converted to integers (via hashCode())
– allow for exact search only• data is either there or its not (e.g., Set, Map)
Hashing with chaining• Data is stored in an array of lists (table)
• Data value x is stored in the list– table[hash(x)]
public class HashTable<T> extends AbstractCollection<T> { List<T>[] table; // data goes in these int n; // total number of elements ...}
234567
01 a b
f1 g x
a be b
aa
d
c
hash(x) = 1hash(x) = 1
tabletable
Hash tables and hashCode()• Hash tables are really designed to store distinct
integers
• Java has lots of types that are not integers
• Every class has a method– public int hashCode()
that converts an object into an integer
• All classes must guarantee that the methods equals() and hashCode() guarantee:– If x.equals(y) then x.hashCode() = y.hashCode()
The hashing process
Java objectJava object
{-231,...,231-1} (32 bits){-231,...,231-1} (32 bits)
{0,...,table.length-1}{0,...,table.length-1}
hash()
hashCode()
List size distribution• For good performance we need a good hash function
• Universal hashing assumption: – if x.hashCode() ≠ y.hashCode() then
Pr{hash(x) = hash(y)} < c/table.size
• The expected length of table[hash(x)] is– ≤ k + c(n-k)/table.size
where k is the number of elements y such that x.hashCode() = y.hashCode()
Pr{•} means “ the probability that •”Pr{•} means “ the probability that • ”
List size distribution (cont'd)• If all hashCode()s are unique then k=1
– The expected length of the list table[hash(x)] isat most 1 + c(n-1)/table.length
• If we keep table.length > n, then– The expected length of table[hash(x)] is
at most 1 + c
n/table.size() is called the occupancyn/table.size() is called the occupancy
Hash table insertion• To add x to a hash table we store x in table[hash(x)]
– Takes constant time
public boolean add(T x) { if (n+1 > table.length) grow(); table[hash(x)].add(x); n++; return true;}
Hash table search• To find elements equal to x, we search the list
table[hash(x)]– Time is O(1 + occupancy)
Remember: all elements equalto x have the same hashCode()Remember: all elements equalto x have the same hashCode()
public T find(Object x) { for (T y : table[hash(x)]) if (y.equals(x)) return y; return null;}
Hash table search• We can also find all items equal to x
public List<T> findAll(Object x) { List<T> l = new LinkedList<T>(); for (T y : table[hash(x)]) if (y.equals(x)) l.add(y);return l;
Hash table removal• Remove all elements equal to x
– takes O(1 + k) time
public int removeAll(Object x) { int r = 0; Iterator<T> it = table[hash(x)].iterator(); while (it.hasNext()) { T y = it.next(); if (y.equals(x)) { it.remove(); n--; r++; } } return r;}
k = # elements equal to xk = # elements equal to x
Hash table removal• Or remove just one element
public T removeOne(Object x) { Iterator<T> it = table[hash(x)].iterator(); while (it.hasNext()) { T y = it.next(); if (y.equals(x)) { it.remove(); n--; return y; } } return null;}
Growing and shrinking• Hash tables grow and shrink in the same manner as
ArrayStacks and ArrayDeques– allocate new table– add elements into new table
• Cost is amortized over add/remove operations– constant amortized cost per operation
The hash(x) function• There are many many possible hash functions
• In multiplicative hashing we use– table.size = 2d is a power of 2• hash(x) = ((x.hashCode() * z) mod 2w) div 2w-d
where• w is the number of bits in an integer and• z is a randomly chosen odd integer in {0,...,2w-1}
– Equivalently (in Java w = 32):• hash(x) = (x.hashCode()*z) >>> (w-l)protected final int hash(Object x) {
return (x.hashCode() * z) >>> (w-d);}
Example• w=16, d=10
x = 0000000000010110z = 1010110001101011x*z = 00000000000011101101000100110010x*z mod 2**16 = 1101000100110010(x*z mod 2**16) div 2**6 = 1101000100
x = 0000000000010110z = 1010110001101011x*z = 00000000000011101101000100110010x*z >>> 6 = 1101000100110010
Multiplicative Hashing Theorem• Theorem 1: With the multiplicative hash function
– if x.hashCode() ≠ y.hashCode() thenPr{hash(x) = hash(y)} ≤ 2/table.length}
• Proof sketch:– If x != y, then there are at most 2w-d odd values of
z∈{1,...,2w-1} such that hash(x) = hash(y)– we have 2w/2 choices for z– Pr{hash(x) = hash(y)}
= 2w-d / (2w/2)= 2/2d
Multiplicative hash table summary• Theorem 2: With a multiplicative hash table
– find(x) takes O(1) expected time– add(x) takes O(1) expected amortized time– remove(x) takes O(1) expected amortized time
provided that the table stores elements with distinct hashCode()s
Hash tables and the Set interface• The Set interface is easily implemented as a hash table
public class MultiplicativeHashSet<T> extends AbstractSet<T> { MultiplicativeHashTable<T> tab; ...} public boolean add(T x) {
if (tab.contains(x)) { return false; } else { return tab.add(x); }}
public boolean remove(Object x) { return tab.remove(x);}
public boolean contains(Object x) { return tab.find(x) != null;}
Hash tables and the Map interface• The Map methods are easily implemented using a hash
table
• Use a hash table that stores key/value Pairs– two Pairs are equal if their keys are equal– the hashCode() of a Pair is the hashCode() of its key
Map Pairs
class Pair<V> { public Object key; public V value; ... public boolean equals(Object o) { return ((o instanceof Pair) && key.equals(((Pair)o).key)); } public int hashCode() { return key.hashCode(); }}
Hash mapspublic V put(K key, V value) { Pair<V> p = new Pair<V>(key, value); Pair<V> r = tab.removeOne(p); tab.add(p); return (r == null) ? null : r.value;}
public V get(Object key) { Pair<V> p = new Pair<V>(key, null); Pair<V> r = tab.find(p); return (r == null) ? null : r.value;}
public V remove(Object key) { Pair<V> p = new Pair<V>(key,null); Pair<V> r = tab.removeOne(p); return (r == null) ? null : r.value;}
Hash Maps and Sets• Using multiplicative hashing:
– Theorem: A MultiplicativeHashSet supports• contains(x) in O(1) expected time• add(x) and remove(x) in O(1) expected amortized time
– Theorem: A MultiplicativeHashMap supports• get(k) in O(1) expected time• put(k,v) and remove(k) in O(1) expected amortized time
• Both theorems hold under the assumption that all stored objects have distinct hashCode()s
The hashCode() method• Default hashCode() and equals() use memory locations
– a.equals(b) if and only if a and b refer to the same memory location
– a.hashCode() is the (integer) memory location of a
• Therefore each object has a unique hashCode()
• We run into problems when we override the equals() method
Designing a good hashCode()• Recall:
– x.equals(y) → x.hashCode() = y.hashCode()
• We would like:– x.hashCode() = y.hashCode() → x.equals(y)
• But we can't always have this– hashCode() returns a 32 (or 64) bit integer• only 232 (or 264) possible return values
– Many objects can take on more values than this• e.g. there are 280 > 232 ASCII strings of length 10
Example of a bad hashCode()• This code will be very slow to execute
– The last for loop takes a loooong time - Why?
int n = 100000; Map<Integer,Integer> m = new HashMap<Integer,Integer>();for (int i = 1; i <= n; i++) { m.put(i,i);}Set<Map.Entry<Integer,Integer>> s = new HashSet<Map.Entry<Integer,Integer>>();for (Map.Entry<Integer,Integer> e : m.entrySet()) { s.add(e);}
Answer• From the Map.Entry documentation:
– e.hashCode() = e.getKey().hashCode()^e.getValue().hashCode()
public int hashCode()Returns the hash code value for this map entry. The hash code of a map entry e is defined to be: (e.getKey()==null ? 0 : e.getKey().hashCode()) ^ (e.getValue()==null ? 0 : e.getValue().hashCode())
...
public int hashCode()Returns the hash code value for this map entry. The hash code of a map entry e is defined to be: (e.getKey()==null ? 0 : e.getKey().hashCode()) ^ (e.getValue()==null ? 0 : e.getValue().hashCode())
...
^ is the bitwise exclusive-or (XOR) operation^ is the bitwise exclusive-or (XOR) operation
Answer• If the key and value are the same, then
– e.getKey() = e.getValue() so– e.getKey().hashCode() = e.getValue().hashCode()
• The XOR of two equal values is always 0– a XOR a = 0
• So all 100,000 elements have the same hashCode()– The hash table degenerates into 1 linked list– contains(x) takes O(n) time!– Creating the Set takes O(n2) time!
1
an,n
tabletable
a b4,4 3,3 a b2,2 1,1…
Some bad ideas• There are lots of bad ways to combine hashCode()s
– XOR: x.hashCode() ^ y.hashCode()• always gives 0 if x = y
– Commutative operators• addition, multiplication, bitwise operators• x.hashCode() + y.hashCode()– gives same value even if we swap x and y– e.g. (“ Craig” , “ James” ) versus (“ James” , “ Craig” )
– Lots of others bad examples
A good hashCode() recipe• Look at all the fields that are compared in the equals()
method– these, and only these, should be used– Recursively compute the hashCode() for each field to get
32-bit values• a1,a2,...,ak
– Output• (a1z1+a2z2+a3z3+...+ak-1zk-1+akzk)mod 232
• z1,...,zk are randomly chosen 32-bit integers
public int hashCode() { long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL}; // random long zz = 0xbea0107e5067d19dL; // random
long h0 = x0.hashCode() & ((1L<<32)-1); // unsigned int to long long h1 = x1.hashCode() & ((1L<<32)-1); long h2 = x2.hashCode() & ((1L<<32)-1);
return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz) >>> 32);}
Good hashCode() theorem• If (a1,...,ak) != (b1,...,bk) then
– Pr{a.hashCode() = b.hashCode()} ≤ 3/2w
• In Java, this means that, using the previous recipe,
• Pr{x.hashCode() = y.hashCode()} ≤ 3/232
= 3/4,294,967,296
Another recipe• Random numbers z1,...,zk can be hard to get
– If our object includes arrays then we might need a variable amount
– We could use the Java Random class
• In practice, powers of 37 are often used• a137k-1
1+a237k-2+a337k-3+...+ak-137+ak
public static int hashIt(Object[] a) { int h = 0; for (int i = 0; i < a.length; i++) h = (37 * h) + a[i].hashCode(); return h;}
The Prime Field Method• Pick a large prime number p and a random z in
{0,...,p-1}.
• h(a1,...,ak) = (a1z0 + a2z1 + ... + akzk-1) mod p
• Theorem: If (a1,...,ak) != (b1,...,bk) then– Pr{a.hashCode() = b.hashCode()} ≤ k/p
Hash tables summary• Hash tables allow for implementations of Set and Map
where basic operations take constant expected time– Requires a good hash function• Multiplicative hashing is efficient and provably good
– Requires a good hashCode() method• For x ≠ y, Pr{x.hashCode() = y.hashCode()} < c/2w
• We can find a lot of bad implementations of hash(x) and hashCode() online– even in things like the Java Collections Framework!
Hash tables: some perspectives• Hash tables and hashCode() use random numbers
– Should we pick these in advance, or at run-time?
• In advance:– Can get real random numbers • from random.org for example
– Doesn't protect us from an adversarial user
• At run-time:– Harder to get real random numbers• usually settle for pseudorandom numbers (java.util.Random)
– Can help protect against adversarial users