COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

COMP2402Hash Tables

Pat Morin

Outline• Hashing with chaining

• Multiplicative hashing

• Hash table implementations of– Set– Map

• Designing a good hashCode() method

Hash: (noun) Food, especially meat and potatoes, chopped and mixed together; A confused mess;

Hash: (noun) Food, especially meat and potatoes, chopped and mixed together; A confused mess;

Hash tables• Hashing is one of the most widely-used techniques in

computer science– Data structures– Error detection– Security

• Hash tables are one of the most useful data structures– store integers• or data that can be converted to integers (via hashCode())

– allow for exact search only• data is either there or its not (e.g., Set, Map)

Hashing with chaining• Data is stored in an array of lists (table)

• Data value x is stored in the list– table[hash(x)]

public class HashTable<T> extends AbstractCollection<T> { List<T>[] table; // data goes in these int n; // total number of elements ...}

234567

01 a b

f1 g x

a be b

aa

d

c

hash(x) = 1hash(x) = 1

tabletable

Hash tables and hashCode()• Hash tables are really designed to store distinct

integers

• Java has lots of types that are not integers

• Every class has a method– public int hashCode()

that converts an object into an integer

• All classes must guarantee that the methods equals() and hashCode() guarantee:– If x.equals(y) then x.hashCode() = y.hashCode()

The hashing process

Java objectJava object

{-231,...,231-1} (32 bits){-231,...,231-1} (32 bits)

{0,...,table.length-1}{0,...,table.length-1}

hash()

hashCode()

List size distribution• For good performance we need a good hash function

• Universal hashing assumption: – if x.hashCode() ≠ y.hashCode() then

Pr{hash(x) = hash(y)} < c/table.size

• The expected length of table[hash(x)] is– ≤ k + c(n-k)/table.size

where k is the number of elements y such that x.hashCode() = y.hashCode()

Pr{•} means “ the probability that •”Pr{•} means “ the probability that • ”

List size distribution (cont'd)• If all hashCode()s are unique then k=1

– The expected length of the list table[hash(x)] isat most 1 + c(n-1)/table.length

• If we keep table.length > n, then– The expected length of table[hash(x)] is

at most 1 + c

n/table.size() is called the occupancyn/table.size() is called the occupancy

Hash table insertion• To add x to a hash table we store x in table[hash(x)]

– Takes constant time

public boolean add(T x) { if (n+1 > table.length) grow(); table[hash(x)].add(x); n++; return true;}

Hash table search• To find elements equal to x, we search the list

table[hash(x)]– Time is O(1 + occupancy)

Remember: all elements equalto x have the same hashCode()Remember: all elements equalto x have the same hashCode()

public T find(Object x) { for (T y : table[hash(x)]) if (y.equals(x)) return y; return null;}

Hash table search• We can also find all items equal to x

public List<T> findAll(Object x) { List<T> l = new LinkedList<T>(); for (T y : table[hash(x)]) if (y.equals(x)) l.add(y);return l;

Hash table removal• Remove all elements equal to x

– takes O(1 + k) time

public int removeAll(Object x) { int r = 0; Iterator<T> it = table[hash(x)].iterator(); while (it.hasNext()) { T y = it.next(); if (y.equals(x)) { it.remove(); n--; r++; } } return r;}

k = # elements equal to xk = # elements equal to x

Hash table removal• Or remove just one element

public T removeOne(Object x) { Iterator<T> it = table[hash(x)].iterator(); while (it.hasNext()) { T y = it.next(); if (y.equals(x)) { it.remove(); n--; return y; } } return null;}

Growing and shrinking• Hash tables grow and shrink in the same manner as

ArrayStacks and ArrayDeques– allocate new table– add elements into new table

• Cost is amortized over add/remove operations– constant amortized cost per operation

The hash(x) function• There are many many possible hash functions

• In multiplicative hashing we use– table.size = 2d is a power of 2• hash(x) = ((x.hashCode() * z) mod 2w) div 2w-d

where• w is the number of bits in an integer and• z is a randomly chosen odd integer in {0,...,2w-1}

– Equivalently (in Java w = 32):• hash(x) = (x.hashCode()*z) >>> (w-l)protected final int hash(Object x) {

return (x.hashCode() * z) >>> (w-d);}

Example• w=16, d=10

x = 0000000000010110z = 1010110001101011x*z = 00000000000011101101000100110010x*z mod 2**16 = 1101000100110010(x*z mod 2**16) div 2**6 = 1101000100

x = 0000000000010110z = 1010110001101011x*z = 00000000000011101101000100110010x*z >>> 6 = 1101000100110010

Multiplicative Hashing Theorem• Theorem 1: With the multiplicative hash function

– if x.hashCode() ≠ y.hashCode() thenPr{hash(x) = hash(y)} ≤ 2/table.length}

• Proof sketch:– If x != y, then there are at most 2w-d odd values of

z∈{1,...,2w-1} such that hash(x) = hash(y)– we have 2w/2 choices for z– Pr{hash(x) = hash(y)}

= 2w-d / (2w/2)= 2/2d

Multiplicative hash table summary• Theorem 2: With a multiplicative hash table

– find(x) takes O(1) expected time– add(x) takes O(1) expected amortized time– remove(x) takes O(1) expected amortized time

provided that the table stores elements with distinct hashCode()s

Hash tables and the Set interface• The Set interface is easily implemented as a hash table

public class MultiplicativeHashSet<T> extends AbstractSet<T> { MultiplicativeHashTable<T> tab; ...} public boolean add(T x) {

if (tab.contains(x)) { return false; } else { return tab.add(x); }}

public boolean remove(Object x) { return tab.remove(x);}

public boolean contains(Object x) { return tab.find(x) != null;}

Hash tables and the Map interface• The Map methods are easily implemented using a hash

table

• Use a hash table that stores key/value Pairs– two Pairs are equal if their keys are equal– the hashCode() of a Pair is the hashCode() of its key

Map Pairs

class Pair<V> { public Object key; public V value; ... public boolean equals(Object o) { return ((o instanceof Pair) && key.equals(((Pair)o).key)); } public int hashCode() { return key.hashCode(); }}

Hash mapspublic V put(K key, V value) { Pair<V> p = new Pair<V>(key, value); Pair<V> r = tab.removeOne(p); tab.add(p); return (r == null) ? null : r.value;}

public V get(Object key) { Pair<V> p = new Pair<V>(key, null); Pair<V> r = tab.find(p); return (r == null) ? null : r.value;}

public V remove(Object key) { Pair<V> p = new Pair<V>(key,null); Pair<V> r = tab.removeOne(p); return (r == null) ? null : r.value;}

Hash Maps and Sets• Using multiplicative hashing:

– Theorem: A MultiplicativeHashSet supports• contains(x) in O(1) expected time• add(x) and remove(x) in O(1) expected amortized time

– Theorem: A MultiplicativeHashMap supports• get(k) in O(1) expected time• put(k,v) and remove(k) in O(1) expected amortized time

• Both theorems hold under the assumption that all stored objects have distinct hashCode()s

The hashCode() method• Default hashCode() and equals() use memory locations

– a.equals(b) if and only if a and b refer to the same memory location

– a.hashCode() is the (integer) memory location of a

• Therefore each object has a unique hashCode()

• We run into problems when we override the equals() method

Designing a good hashCode()• Recall:

– x.equals(y) → x.hashCode() = y.hashCode()

• We would like:– x.hashCode() = y.hashCode() → x.equals(y)

• But we can't always have this– hashCode() returns a 32 (or 64) bit integer• only 232 (or 264) possible return values

– Many objects can take on more values than this• e.g. there are 280 > 232 ASCII strings of length 10

Example of a bad hashCode()• This code will be very slow to execute

– The last for loop takes a loooong time - Why?

int n = 100000; Map<Integer,Integer> m = new HashMap<Integer,Integer>();for (int i = 1; i <= n; i++) { m.put(i,i);}Set<Map.Entry<Integer,Integer>> s = new HashSet<Map.Entry<Integer,Integer>>();for (Map.Entry<Integer,Integer> e : m.entrySet()) { s.add(e);}

Answer• From the Map.Entry documentation:

– e.hashCode() = e.getKey().hashCode()^e.getValue().hashCode()

public int hashCode()Returns the hash code value for this map entry. The hash code of a map entry e is defined to be: (e.getKey()==null ? 0 : e.getKey().hashCode()) ^ (e.getValue()==null ? 0 : e.getValue().hashCode())

...

public int hashCode()Returns the hash code value for this map entry. The hash code of a map entry e is defined to be: (e.getKey()==null ? 0 : e.getKey().hashCode()) ^ (e.getValue()==null ? 0 : e.getValue().hashCode())

...

^ is the bitwise exclusive-or (XOR) operation^ is the bitwise exclusive-or (XOR) operation

Answer• If the key and value are the same, then

– e.getKey() = e.getValue() so– e.getKey().hashCode() = e.getValue().hashCode()

• The XOR of two equal values is always 0– a XOR a = 0

• So all 100,000 elements have the same hashCode()– The hash table degenerates into 1 linked list– contains(x) takes O(n) time!– Creating the Set takes O(n2) time!

1

an,n

tabletable

a b4,4 3,3 a b2,2 1,1…

Some bad ideas• There are lots of bad ways to combine hashCode()s

– XOR: x.hashCode() ^ y.hashCode()• always gives 0 if x = y

– Commutative operators• addition, multiplication, bitwise operators• x.hashCode() + y.hashCode()– gives same value even if we swap x and y– e.g. (“ Craig” , “ James” ) versus (“ James” , “ Craig” )

– Lots of others bad examples

A good hashCode() recipe• Look at all the fields that are compared in the equals()

method– these, and only these, should be used– Recursively compute the hashCode() for each field to get

32-bit values• a1,a2,...,ak

– Output• (a1z1+a2z2+a3z3+...+ak-1zk-1+akzk)mod 232

• z1,...,zk are randomly chosen 32-bit integers

public int hashCode() { long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL}; // random long zz = 0xbea0107e5067d19dL; // random

long h0 = x0.hashCode() & ((1L<<32)-1); // unsigned int to long long h1 = x1.hashCode() & ((1L<<32)-1); long h2 = x2.hashCode() & ((1L<<32)-1);

return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz) >>> 32);}

Good hashCode() theorem• If (a1,...,ak) != (b1,...,bk) then

– Pr{a.hashCode() = b.hashCode()} ≤ 3/2w

• In Java, this means that, using the previous recipe,

• Pr{x.hashCode() = y.hashCode()} ≤ 3/232

= 3/4,294,967,296

Another recipe• Random numbers z1,...,zk can be hard to get

– If our object includes arrays then we might need a variable amount

– We could use the Java Random class

• In practice, powers of 37 are often used• a137k-1

1+a237k-2+a337k-3+...+ak-137+ak

public static int hashIt(Object[] a) { int h = 0; for (int i = 0; i < a.length; i++) h = (37 * h) + a[i].hashCode(); return h;}

The Prime Field Method• Pick a large prime number p and a random z in

{0,...,p-1}.

• h(a1,...,ak) = (a1z0 + a2z1 + ... + akzk-1) mod p

• Theorem: If (a1,...,ak) != (b1,...,bk) then– Pr{a.hashCode() = b.hashCode()} ≤ k/p

Hash tables summary• Hash tables allow for implementations of Set and Map

where basic operations take constant expected time– Requires a good hash function• Multiplicative hashing is efficient and provably good

– Requires a good hashCode() method• For x ≠ y, Pr{x.hashCode() = y.hashCode()} < c/2w

• We can find a lot of bad implementations of hash(x) and hashCode() online– even in things like the Java Collections Framework!

Hash tables: some perspectives• Hash tables and hashCode() use random numbers

– Should we pick these in advance, or at run-time?

• In advance:– Can get real random numbers • from random.org for example

– Doesn't protect us from an adversarial user

• At run-time:– Harder to get real random numbers• usually settle for pseudorandom numbers (java.util.Random)

– Can help protect against adversarial users

COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Documents