Maps and hashing

Hashing 1

Maps and hashing

Hashing 2Dr.Alagoz

MapsA map models a searchable collection of key-value entriesTypically a key is a string with an associated value (e.g. salary info)The main operations of a map are for searching, inserting, and deleting itemsMultiple entries with the same key are not allowedApplications: address book student-record database Salary information,etc

Hashing 3Dr.Alagoz

The Map ADT Map ADT methods: Get(k): if the map M has an entry with key k,

return its associated value; else, return null Put (k, v): insert entry (k, v) into the map M; if

key k is not already in M, then return null; else, return old value associated with k

Remove (k): if the map M has an entry with key k, remove it from M and return its associated value; else, return null

Size(), isEmpty() Keys(): return an iterator of the keys in M Values(): return an iterator of the values in M

Hashing 4Dr.Alagoz

ExampleOperation Output Map

isEmpty() true Øput(5,A) null (5,A) It returns null because it is not in Mput(7,B) null (5,A),(7,B)put(2,C) null (5,A),(7,B),(2,C)put(8,D) null (5,A),(7,B),(2,C),(8,D)put(2,E) C (5,A),(7,B),(2,E),(8,D)get(7) B (5,A),(7,B),(2,E),(8,D)get(4) null (5,A),(7,B),(2,E),(8,D) Because 4 is not

in M get(2) E (5,A),(7,B),(2,E),(8,D)size() 4 (5,A),(7,B),(2,E),(8,D)remove(5) A (7,B),(2,E),(8,D) Remove 5 and return Aremove(2) E (7,B),(8,D)get (2) null (7,B),(8,D) Because 4 is not in M isEmpty() false (7,B),(8,D) Because there are 2 in M

Hashing 5Dr.Alagoz

Comparison to java.util.MapMap ADT Methods java.util.Map Methods

size() size()isEmpty() isEmpty()get(k) get(k)put(k,v) put(k,v)remove(k) remove(k)

All same except:keys() keySet().iterator()

values() values().iterator()

Hashing 6Dr.Alagoz

Performance of a List-Based Map

Performance: put takes O(1) time, since we can insert the new item

at the beginning or at the end of the sequence get and remove take O(n) time since in the worst case

(the item is not found) we traverse the entire sequence to look for an item with the given key

The unsorted list implementation is effective only for maps of small size or for maps in which puts are the most common operations, while searches and removals are rarely performed (e.g., historical record of logins to a workstation)

Hashing 7Dr.Alagoz

Hashing

Hashing 8Dr.Alagoz

ContentIdea: when using various operations on binary search trees, we use hash table ADT

Hashing: implementation of hash tables for insert and find operations is called hashing.

Collision: when two keys hash to the same value

Resolving techniques for collision using linked listsRehashing: when a hash table is full, then operations will take longer time. Then we need to build a new double sized hash table. (HWLA??? Why double sized..)Extendible hashing: for fitting large data in main memory

Hashing 9Dr.Alagoz

Hashing as a Data Structure

Performs operations in O(1) Insert Delete FindIs not suitable for FindMin FindMax Sort or output as sorted

Hashing 10Dr.Alagoz

General IdeaArray of Fixed Size (TableSize)Search is performed on some part of the item (Key)Each key is mapped into some number between 0 and (TableSize-1)Mapping is called a hash functionEnsure that two distinct keys get different cellsProblem: Since there are a finite # of cells and virtually inexhaustible supply of keys, we need a hash function to distribute the keys evenly among the cells!!!!!!

Hashing 11Dr.Alagoz

Hash Functions and Hash Tables

A hash function h maps keys of a given type to integers in a fixed interval [0, N1]Example:

h(x) x mod Nis a hash function for integer keysThe integer h(x) is called the hash value of key xA hash table for a given key type consists of Hash function h Array (called table) of size N

When implementing a map with a hash table, the goal is to store item (k, o) at index i h(k)

Hashing 12Dr.Alagoz

ExampleWe design a hash table for a map storing entries as (SSN, Name), where SSN (social security number) is a nine-digit positive integerOur hash table uses an array of size N10,000 and the hash functionh(x)last four digits of x

01234

999799989999

…451-229-0004

981-101-0002

200-751-9998

025-612-0001

Hashing 13Dr.Alagoz

Hash Functions

Easy to compute Key is of type Integer

reasonable strategy is to return Key ModTableSize-1

Key is of type String (mostly used in practice) Hash function needs to be chased carefully! Eg. Adding up the ASCII values of the characters in

the string Proper selection of hash function is required

Hashing 14Dr.Alagoz

Hash Function (Integer)Simply return Key % TableSizeChoose carefully TableSize TableSize is 10, all keys end in

zero???To avoid such pitfalls, choose TableSize a prime number

Hashing 15Dr.Alagoz

Hash Function I (String)Adds up ASCII values of characters in the stringAdvantage: Simple to implement and computes quicklyDisadvantage: If TableSize is large (see Fig.5.2 in your book), function does not distribute keys well Example: Keys are at most 8 characters.

Maximum sum (8*256 = 2048), but TableSize 10007. Only 25 percent could be filled.

Hashing 16Dr.Alagoz

Hash Function II (String)Assumption: Key has at least 3 charactersHash Function: (26 characters for alphabet + blank)

key[0] + 27 * key[1] + 272 * key[2]Advantage: Distributes better than Hash Function I, and easy to compute.Disadvantage:

263 = 17,576 possible combinations of 3 characters However, English has only 2,851 different combinations

by a dictionary check. HWLA: Explain why? Read p.157 for the answer. Similar to Hash Function 1, it is not appropriate, if the

hash table is reasonably large!

Hashing 17Dr.Alagoz

Hash Function III (String)Idea: Computes a polynomial function of Key’s characters

P(Key with n+1 characters) =Key[0]+37Key[1]+372Key[2]+...+37nKey[n]

If find 37n then sum up complexity O(n2)Using Horner’s rule complexity drops to O(n)

((Key[n]*37+Key[n-1])*37+...+Key[1])*37+Key[0]Very simple and reasonably fast method, but there will be complexity problem if the key-characters are very long!The lore is to avoid using all characters to set a key. Eg: the keys could be a complete street address. The hash function might include a couple of characters from the street address, and may be a couple of characters from the city name, or zipcode.Think of other options??? Quiz: Lately 31 is proposed instead of 37.. Why not 19?

Hashing 18Dr.Alagoz

public static int hash( String key, int tableSize )

{

int hashVal = 0;

for( int i = 0; i < key.length( ); i++ )

hashVal = 37 * hashVal + key.charAt( i );

hashVal %= tableSize;

if( hashVal < 0 )

hashVal += tableSize;

return hashVal;

}

Hash Function III (String)

Hashing 19Dr.Alagoz

CollisionCollisions occur when different elements are mapped to the same cell

01234 451-229-0004 981-101-0004

025-612-0001

Hashing 20Dr.Alagoz

CollisionWhen an element is inserted, it hashes to the same value as an already inserted element we have collision. (e.g. 564 and 824 will collide at 4)Example: Hash Function (Key % 10)

Hashing 21Dr.Alagoz

Solving CollisionSeparate Chaining: keep a list of all elements hashing to the same

value, and traverse the list to find corresponding hash. Hint: lists should be large and kept as prime number table size to ensure a good distribution. (Limited use due to space limitations of lists, and needs linked lists!!!)

Open Addressing: at a collision, search for alternative cells until

finding an empty cell. Linear Probing Quadratic Probing Double Hashing

Hashing 22Dr.Alagoz

Solving CollisionDefine: The load factor: Lamda= #of elements/TablesizeHWLA: Search Internet for other techniques (is there any algorithm without using a linked list)

A) Binary search tree B) Using another hash table Explain why we donot use A and B? Solution: If the table size is large and a

proper hash function is used, then the list should be short. Therefore, it is not worth to find anything more complicated!!!

Hashing 23Dr.Alagoz

Separate Hashing

Separate Chaining: let each cell in the table point to a linked list of entries that map thereKeep a list of all elements that hash to the same valueEach element of the hash table is a Link ListSeparate chaining is simple, but requires additional memory outside the table

Insert keys [ 9, 8, 7, 6, 5,4,3,2,1,0] into the hast table using h(x)= x2 % TableSize

Hashing 24Dr.Alagoz

Load Factor (Lambda): Lambda: the number of elements in hash table divided by the

Tablesize. (Eg. Lambda = 1 for the table.) To perform a search is the constant time required evaluate the hash

function + the time to traverse the list

Unsuccessful search: (Lambda) nodes to be examined, on average. Successful search: (1+Lambda/2) links to be traversed. Note: Average number of other nodes in Tablesize of N with M lists:

(N-1)/M=N/M-1/M= Lambda-1/M=Lambda for large M. i.e., Tablesize is not important, but load factor is..

So, in separate chaining, Lambda should be kept nearer to 1. i.e. Make the has table as large as the number of elements

expected (for possible collision). Remember also that the tablesize should be prime for ensuring a

good distribution…....

Hashing 25Dr.Alagoz

Separate Hashing /** * Construct the hash table. */ public SeparateChainingHashTable( ){ this( DEFAULT_TABLE_SIZE ); }

/** * Construct the hash table. * @param size approximate table size. */ public SeparateChainingHashTable( int size ){ theLists = new LinkedList[ nextPrime( size ) ]; for( int i = 0; i < theLists.length; i++ ) theLists[ i ] = new LinkedList( ); }

Hashing 26Dr.Alagoz

Separate HashingFind Use hash function to determine which

list to traverse Traverse the list to find the element

public Hashable find( Hashable x ){

return (Hashable)theLists[ x.hash( theLists.length ) ] .find(x).retrieve( );

}

Hashing 27Dr.Alagoz

Separate HashingInsert Use hash function to determine in

which list to insert Insert element in the header of the list

public void insert( Hashable x ){

LinkedList whichList = theLists[x.hash(theLists.length) ];

LinkedListItr itr = whichList.find( x );

if( itr.isPastEnd( ) )

whichList.insert( x, whichList.zeroth( ) );

}

Hashing 28Dr.Alagoz

Separate HashingDelete Use hash function to determine from

which list to delete Search element in the list and delete

public void remove( Hashable x ){

theLists[ x.hash( theLists.length ) ].remove( x );

}

Hashing 29Dr.Alagoz

Separate HashingAdvantages Solves the collision problem totally Elements can be inserted anywhereDisadvantages Need the use of link lists.. And, all

lists must be short to get O(1) time complexity otherwise it take too long time to compute…

Hashing 30Dr.Alagoz

Separate Hashing needs extra space!!!

Alternatives to Using Link Lists Binary Trees Hash Tables However, If the Tablesize is large and a good

hash function is used, all the lists expected to be short already, i.e., no need to complicate!!!

Instead of the above alternative techniques, we use Open Addressing>>>>>

Hashing 31Dr.Alagoz

Open AddressingSolving collisions without using any other data structure such as link list

this is a major problem especially for other languages!!!Idea: If collision occurs, alternative cells are tried until an empty cell is found =>>> Cells h0(x), h1(x), ..., are tried in successionhi(x)=(hash(x) + f(i)) % TableSize

The function f is the collision resolution strategy with f(0)=0.

Since all data go inside the table, Open addressing technique requires the use of bigger table as compared to separate chaining.Lambda should be less than 0.5. (it was 1 for separate hashing)

Hashing 32Dr.Alagoz

Open AddressingDepending on the collision resolution strategy, f, we have Linear Probing: f(i) = i Quadratic Probing: f(i) = i2Double Hashing: f(i) = i hash2(x)

Hashing 33Dr.Alagoz

Linear Probing

Advantages: Easy to computeDisadvantages: Table must be big enough to get a free cell Time to get a free cell may be quite large Primary Clustering

Any key that hashes into the cluster will require several attempts to resolve the collision

f(i) = i is the amount to trying cells sequentially in search of an empty cell.

Hashing 34Dr.Alagoz

Example: Linear ProbingOpen addressing: the colliding item is placed in a different cell of the tableLinear probing handles collisions by placing the colliding item in the next (circularly) available table cellEach table cell inspected is referred to as a “probe”Colliding items lump together, causing future collisions to cause a longer sequence of probes

Example: h(x) x mod 13 Insert keys 18, 41,

22, 44, 59, 32, 31, 73, in this order

0 1 2 3 4 5 6 7 8 9 10 11 12

41 18445932223173 0 1 2 3 4 5 6 7 8 9 10 11 12

Hashing 35Dr.Alagoz

Example in the book: Linear Probing

Insert keys [ 89, 18, 49, 58, 69] into a hast table using hi(x)=(hash(x) + i) % TableSize

Hashing 36Dr.Alagoz

First collision occurs when 49 is inserted. (then, put in the next available cell, i.e. cell 0) 58 collides with 18, 89, and then 49 before an empty cell is found three awayThe collision 69 is handled as above.Note: insertions and unsuccessful searches require the same number of probes.Primary clustering?

If table is big enough, a free cell will always be found (even if it takes long time!!!)

If the table is relatively empty (lowerLambda) , yet key may require several attempts to resolve collision. Then, blocks of occupied cells start forming, i.e., need to add to cluster …..

Hashing 37Dr.Alagoz

Quadratic ProbingEliminates Primary Clustering problemTheorem: If quadratic probing is used, and the table size is prime, then a new element can always be inserted if the table is at least half emptySecondary Clustering Elements that hash to the same position

will probe the same alternative cells

Hashing 38Dr.Alagoz

Quadratic ProbingInsert keys [ 89, 18, 49, 58, 69]

into a hast table using hi(x)=(hash(x) + i2) % TableSize

Hashing 39Dr.Alagoz

Quadratic Probing /**

* Construct the hash table. */ public QuadraticProbingHashTable( ) { this( DEFAULT_TABLE_SIZE ); } /** * Construct the hash table. * @param size the approximate initial size. */ public QuadraticProbingHashTable( int size ) { allocateArray( size ); makeEmpty( ); }

Hashing 40Dr.Alagoz

Quadratic Probing /** * Method that performs quadratic probing resolution. * @param x the item to search for. * @return the position where the search terminates. */ private int findPos( Hashable x ) {/* 1*/ int collisionNum = 0;/* 2*/ int currentPos = x.hash( array.length );/* 3*/ while( array[ currentPos ] != null && !array[ currentPos ].element.equals( x ) ) {/* 4*/ currentPos += 2 * ++collisionNum - 1; /* 5*/ if( currentPos >= array.length ) /* 6*/ currentPos -= array.length; }/* 7*/ return currentPos; }

Hashing 41Dr.Alagoz

Double HashingDouble hashing uses a secondary hash function d(k) and handles collisions by placing an item in the first available cell of the series

(i jd(k)) mod N for j 0, 1, … , N 1The secondary hash function d(k) cannot have zero valuesThe table size N must be a prime to allow probing of all the cells

Common choice of compression function for the secondary hash function: d2(k) q k mod q

where q N q is a prime

The possible values for d2(k) are

1, 2, … , q

Hashing 42Dr.Alagoz

Double HashingPopular choice: f(i)=i. hash2(x) i.e., apply a second hash function to x and probe at a distance hash2(x) , 2hash2(x) , 3 hash2(x) …

hash2(x) = R – (x % R)Poor choice of hash2(x) could be disastrousObserve: hash2(x) =xmod9 would not work if 99 were inserted into the input in the previous example

R should also be a prime number smaller than TableSize

If double hashing is correctly implemented, simulations imply that the expected number of probes is almost the same as for a random collision resolution strategy

Hashing 43Dr.Alagoz

Consider a hash table storing integer keys that handles collision with double hashing

N13 h(k) k mod 13 d(k) 7 k mod 7

Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Example of Double Hashing

0 1 2 3 4 5 6 7 8 9 10 11 12

31 41 183259732244 0 1 2 3 4 5 6 7 8 9 10 11 12

k h (k ) d (k ) Probes18 5 3 541 2 1 222 9 6 944 5 5 5 1059 7 4 732 6 3 631 5 4 5 9 073 8 4 8

Hashing 44Dr.Alagoz

Example in the book Double Hashing

Hi(x) = (x + i (R – (x mod R))) % N, R = 7, N=10

The first collision occurs when 49 is inserted hash2 (49)=7-0=7 thus, 49 is inserted in position 6.hash2(58)=7-2=5, so 58 is inserted at location 3. Finally, 69 collides and is inserted at a distance hash2(69)=7-6=1 away. Observe a bad scenario: if we had an input 60, then what happens?

First, 60 collides with 69 in the position 0.Since hash2 (60)=7-4=3, we would then try positions 3, 6, 9, and then 2 until an empty cell is found.

Hashing 45Dr.Alagoz

RehashingIf Hash Table gets too full, running time for the operations will start taking too long timeInsertions might fail for open addressing with quadratic probingSolution: Rehashing build another table that is about twice as big…. Rehashing is used especially when too many removals intermixed with insertions

Hashing 46Dr.Alagoz

RehashingBuild another table that is about twice as big E.g. if N=11, then N’=23Associate a new hash functionScan down the entire original hash tableCompute the new hash value for each nondeleted element.Insert it in the new table

Hashing 47Dr.Alagoz

RehashingVery expensive operation; O(N)

Good news is that rehashing occurs very infrequently

If data structure is part of the program, effect is not noticeable.

If hashing is performed as part of an interactive system, then the unfortunate user whose insertion caused a rehash could observe a slowdown

Hashing 48Dr.Alagoz

RehashingWhen to apply rehashing? Strategy1: As soon as the table is half full Strategy2: Only when an insertion fails Strategy3: When the table reaches a certain

load factor No certain rule for the best strategy! Since

load factor is directly related to the performance of the system, 3rd strategy may work better.. Then what is the threshold???

Hashing 49Dr.Alagoz

Example: Rehashing

Suppose 13,15,23, 24, and 6 are inserted into a hash table of size 7.Assume h(x)= x mod 7, using linear probing we get the table on the left.But table is 5/7 full, rehashing is required…. The new tablesize is 17.17 is the first prime number greater than 2*7The new h(x)=x mod17. Scanned the old table, insert the elements 6,13,15,23,24 (as shown in table on the right).

Hashing 50Dr.Alagoz

Rehashing private void allocateArray( int arraySize ) { array = new HashEntry[ arraySize ]; } private void rehash( ) { HashEntry [ ] oldArray = array; // Create a new double-sized, empty table allocateArray( nextPrime( 2 * oldArray.length ) ); currentSize = 0; // Copy table over for( int i = 0; i < oldArray.length; i++ ) if( oldArray[ i ] != null && oldArray[ i ].isActive) insert( oldArray[ i ].element ); return; }

Hashing 51Dr.Alagoz

Extendible Hashing (Why?)Amount of data is too large to fit in main memory

Main consideration is the number of disk accesses required to retrieve data

Assume: we have N records to store, and at most M records fit in one disk block. (assume M=4)

Hashing 52Dr.Alagoz

Extendible Hashing (Why?)Open addressing or separate chaining is used, collisions could cause several disk blocks to be examined during a find, even for a well-distributed hash table.

When the table gets too full, rehashing requires O(N) disk accesses (very expensive!!!)

Instead, we use extendible hashing for a find with two disk access requirements only. Similarly, insertions may require only few accesses.

Hashing 53Dr.Alagoz

Extendible HashingUse of idea in B-Trees with a depth O(logM/2 N). Choose M too large so that B-Tree has a depth of 1

Now, a find needs one disk access, assuming that the root node could be stored in main memory. However?? We have a problem here!!!

Problem: Branching factor is too high, requires to much time to determine which leaf the data was in

This strategy works in practice only if the time to perform this step is reduced.. This is what we exactly do with extendible hashing strategy..

Hashing 54Dr.Alagoz

Example: Extendible HashingAssume our source data consists of several 6 bit integers. The root of the “tree” contains four links determined by the leading two bits of the data.Each leaf has at most M=4 elements based on the earlier assumption.(D=2) denotes the number of bits used by the root. D is known as the directory. (2^D) will be the number of entries in directory D. dL is the number of leading bits that all the elements of some leaf L have in common. dL<=D

Hashing 55Dr.Alagoz

Extendible HashingSuppose we want to insert 100100. since leading two bits is 10, this would go to 3rd leaf. But the 3rd leaf is already full (due to M=4)!! Thus, we split this leaf into tow leaves which are now determined with three bits..Need to increase the Directory size!! Note, although an entire directory is rewritten, none of the other leaves (1,2,4) is actually accessed..

Hashing 56Dr.Alagoz

Extendible HashingSuppose we want to insert 000000. since leading two bits is 00, this will split 1st leaf as shown below.. Only change in directory is updating 000 and 001.

Therefore, this is a very good and fast strategy for insert and find operations on large databases. However, READ: page 175-176 for scenarios when

this algorithm do not work, how to avoid possible problems!!!.

Hashing 57Dr.Alagoz

HWLAsProblem 2 in the book: When rehashing, we choose a table size that is roughly twice as large and prime. In our case,

the appropriate new table size is 19, with hash function h (x ) = x (mod 19).(a) Scanning down the separate chaining hash table, the new locations are 4371 in list 1, 1323 in

list 12, 6173 in list 17, 4344 in list 12, 4199 in list 0, 9679 in list 8, and 1989 in list 13.(b) The new locations are 9679 in bucket 8, 4371 in bucket 1, 1989 in bucket 13, 1323 in bucket

12, 6173 in bucket 17, 4344 in bucket 14 because both 12 and 13 are already occupied, and 4199 in bucket 0.

(c) The new locations are 9679 in bucket 8, 4371 in bucket 1, 1989 in bucket 13, 1323 in bucket 12, 6173 in bucket 17, 4344 in bucket 16 because both 12 and 13 are already occupied, and 4199 in bucket 0.

(d) The new locations are 9679 in bucket 8, 4371 in bucket 1, 1989 in bucket 13, 1323 in bucket 12, 6173 in bucket 17, 4344 in bucket 15 because 12 is already occupied, and 4199 in bucket 0.

Problems in CHP5: 1, 4, 5, 11, 16 Improved Merkle Cryptosystem

Hashing 58Dr.Alagoz

Java Example: hash table with linear probing (*)

/** A hash table with linear probing and the MAD hash function */public class HashTable implements Map { protected static class HashEntry implements Entry { Object key, value; HashEntry () { /* default constructor */ } HashEntry(Object k, Object v) { key = k; value = v; } public Object key() { return key; } public Object value() { return value; } protected Object setValue(Object v) { // set a new value, returning old Object temp = value; value = v; return temp; // return old value } } /** Nested class for a default equality tester */ protected static class DefaultEqualityTester implements EqualityTester

{ DefaultEqualityTester() { /* default constructor */ } /** Returns whether the two objects are equal. */ public boolean isEqualTo(Object a, Object b) { return a.equals(b); } } protected static Entry AVAILABLE = new HashEntry(null, null); // empty

marker protected int n = 0; // number of entries in the dictionary protected int N; // capacity of the bucket array protected Entry[] A; // bucket array protected EqualityTester T; // the equality tester protected int scale, shift; // the shift and scaling factors /** Creates a hash table with initial capacity 1023. */ public HashTable() { N = 1023; // default capacity A = new Entry[N]; T = new DefaultEqualityTester(); // use the default equality tester java.util.Random rand = new java.util.Random(); scale = rand.nextInt(N-1) + 1; shift = rand.nextInt(N); }

/** Creates a hash table with the given capacity and equality tester. */

public HashTable(int bN, EqualityTester tester) { N = bN; A = new Entry[N]; T = tester; java.util.Random rand = new java.util.Random(); scale = rand.nextInt(N-1) + 1; shift = rand.nextInt(N); }

Hashing 59Dr.Alagoz

Java Example (cont. *)

/** Determines whether a key is valid. */ protected void checkKey(Object k) { if (k == null) throw new InvalidKeyException("Invalid key: null."); } /** Hash function applying MAD method to default hash code. */ public int hashValue(Object key) { return Math.abs(key.hashCode()*scale + shift) % N; } /** Returns the number of entries in the hash table. */ public int size() { return n; } /** Returns whether or not the table is empty. */ public boolean isEmpty() { return (n == 0); } /** Helper search method - returns index of found key or -index-1, * where index is the index of an empty or available slot. */ protected int findEntry(Object key) throws InvalidKeyException { int avail = 0; checkKey(key); int i = hashValue(key); int j = i; do { if (A[i] == null) return -i - 1; // entry is not found if (A[i] == AVAILABLE) { // bucket is deactivated

avail = i; // remember that this slot is availablei = (i + 1) % N; // keep looking

} else if (T.isEqualTo(key,A[i].key())) // we have found our entry

return i; else // this slot is occupied--we must keep looking

i = (i + 1) % N; } while (i != j); return -avail - 1; // entry is not found } /** Returns the value associated with a key. */ public Object get (Object key) throws InvalidKeyException { int i = findEntry(key); // helper method for finding a key if (i < 0) return null; // there is no value for this key return A[i].value(); // return the found value in this case }

/** Put a key-value pair in the map, replacing previous one if it exists. */ public Object put (Object key, Object value) throws InvalidKeyException { if (n >= N/2) rehash(); // rehash to keep the load factor <= 0.5 int i = findEntry(key); //find the appropriate spot for this entry if (i < 0) { // this key does not already have a value A[-i-1] = new HashEntry(key, value); // convert to the proper index n++; return null; // there was no previous value } else // this key has a previous value return ((HashEntry) A[i]).setValue(value); // set new value & return old } /** Doubles the size of the hash table and rehashes all the entries. */ protected void rehash() { N = 2*N; Entry[] B = A; A = new Entry[N]; // allocate a new version of A twice as big as before java.util.Random rand = new java.util.Random(); scale = rand.nextInt(N-1) + 1; // new hash scaling factor shift = rand.nextInt(N); // new hash shifting factor for (int i=0; i&ltB.length; i++) if ((B[i] != null) && (B[i] != AVAILABLE)) { // if we have a valid entry

int j = findEntry(B[i].key()); // find the appropriate spotA[-j-1] = B[i]; // copy into the new array

} } /** Removes the key-value pair with a specified key. */ public Object remove (Object key) throws InvalidKeyException { int i = findEntry(key); // find this key first if (i < 0) return null; // nothing to remove Object toReturn = A[i].value(); A[i] = AVAILABLE; // mark this slot as deactivated n--; return toReturn; } /** Returns an iterator of keys. */ public java.util.Iterator keys() { List keys = new NodeList(); for (int i=0; i&ltN; i++) if ((A[i] != null) && (A[i] != AVAILABLE))

keys.insertLast(A[i].key()); return keys.elements(); }} // ... values() is similar to keys() and is omitted here ...

Hashing 60Dr.Alagoz

Hash Functions (*)A hash function is usually specified as the composition of two functions:Hash code: h1: keys integers

Compression function: h2: integers [0, N1]

The hash code is applied first, and the compression function is applied next on the result, i.e.,

h(x) = h2(h1(x))

The goal of the hash function is to “disperse” the keys in an apparently random way

Hashing 61Dr.Alagoz

Performance of Hashing(*)

In the worst case, searches, insertions and removals on a hash table take O(n) timeThe worst case occurs when all the keys inserted into the map collideThe load factor nN affects the performance of a hash tableAssuming that the hash values are like random numbers, it can be shown that the expected number of probes for an insertion with open addressing is

1 (1 )

The expected running time of all the dictionary ADT operations in a hash table is O(1) In practice, hashing is very fast provided the load factor is not close to 100%Applications of hash tables:

small databases compilers browser caches

Hashing 62Dr.Alagoz

Hash Codes (*)Memory address:

We reinterpret the memory address of the key object as an integer (default hash code of all Java objects)

Good in general, except for numeric and string keys

Integer cast: We reinterpret the bits of

the key as an integer Suitable for keys of length

less than or equal to the number of bits of the integer type (e.g., byte, short, int and float in Java)

Component sum: We partition the bits of

the key into components of fixed length (e.g., 16 or 32 bits) and we sum the components (ignoring overflows)

Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type (e.g., long and double in Java)

Hashing 63Dr.Alagoz

Hash Codes (* cont.)Polynomial accumulation:

We partition the bits of the key into a sequence of components of fixed length (e.g., 8, 16 or 32 bits) a0 a1 … an1

We evaluate the polynomialp(z) a0 a1 z a2 z2 … … an1zn1

at a fixed value z, ignoring overflows

Especially suitable for strings (e.g., the choice z 33 gives at most 6 collisions on a set of 50,000 English words)

Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:

The following polynomials are successively computed, each from the previous one in O(1) timep0(z) an1

pi (z) ani1 zpi1(z) (i 1, 2, …, n 1)

We have p(z) pn1(z)

Hashing 64Dr.Alagoz

Compression Functions (*)

Division: h2 (y) y mod N The size N of the

hash table is usually chosen to be a prime

The reason has to do with number theory and is beyond the scope of this course

Multiply, Add and Divide (MAD): h2 (y) (ay b) mod N a and b are

nonnegative integers such that

a mod N 0 Otherwise, every

integer would map to the same value b

Hashing 65Dr.Alagoz

Map Methods with Separate Chaining used for Collisions (*)

Delegate operations to a list-based map at each cell:Algorithm get(k):Output: The value associated with the key k in the map, or null if there is no

entry with key equal to k in the mapreturn A[h(k)].get(k) {delegate the get to the list-based map at A[h(k)]}Algorithm put(k,v):Output: If there is an existing entry in our map with key equal to k, then we

return its value (replacing it with v); otherwise, we return nullt = A[h(k)].put(k,v) {delegate the put to the list-based map at A[h(k)]}if t = null then {k is a new key}

n = n + 1return tAlgorithm remove(k):Output: The (removed) value associated with key k in the map, or null if there

is no entry with key equal to k in the mapt = A[h(k)].remove(k) {delegate the remove to the list-based map at A[h(k)]}if t ≠ null then {k was found}

n = n - 1return t

Hashing 66Dr.Alagoz

Search with Linear Probing (*)

Consider a hash table A that uses linear probingget(k)

We start at cell h(k) We probe consecutive

locations until one of the following occurs

An item with key k is found, or

An empty cell is found, or

N cells have been unsuccessfully probed

Algorithm get(k)i h(k)p 0repeat

c A[i]if c

return null else if c.key () k

return c.element()else

i (i 1) mod Np p 1

until p Nreturn null

Hashing 67Dr.Alagoz

Updates with Linear Probing(*)

To handle insertions and deletions, we introduce a special object, called AVAILABLE, which replaces deleted elementsremove(k)

We search for an entry with key k

If such an entry (k, o) is found, we replace it with the special item AVAILABLE and we return element o

Else, we return null

put(k, o) We throw an

exception if the table is full

We start at cell h(k) We probe

consecutive cells until one of the following occurs A cell i is found that is

either empty or stores AVAILABLE, or

N cells have been unsuccessfully probed

We store entry (k, o) in cell i

Hashing 68Dr.Alagoz

A Simple List-Based Map (*)

We can efficiently implement a map using an unsorted list We store the items of the map in a list S

(based on a doubly-linked list), in arbitrary order

trailerheader nodes/positions

entries9 c 6 c 5 c 8 c

Hashing 69Dr.Alagoz

The get(k) AlgorithmAlgorithm get(k):

B = S.positions() {B is an iterator of the positions in S}while B.hasNext() do

p = B.next() if the next position in Bgif p.element().key() = kthen

return p.element().value()return null {there is no entry with key equal to k}

Hashing 70Dr.Alagoz

The put(k,v) AlgorithmAlgorithm put(k,v):B = S.positions()while B.hasNext() do

p = B.next()if p.element().key() = k then

t = p.element().value()B.replace(p,(k,v))return t {return the old value}

S.insertLast((k,v))n = n + 1 {increment variable storing number of

entries}return null {there was no previous entry with key

equal to k}

Hashing 71Dr.Alagoz

The remove(k) AlgorithmAlgorithm remove(k):B =S.positions()while B.hasNext() do

p = B.next()if p.element().key() = k thent = p.element().value()S.remove(p)n = n – 1 {decrement number of entries}return t {return the removed value}

return null {there is no entry with key equal to k}

Maps and hashing

Documents