Hash Tables Briana B. Morrison Adapted from William Collins.

Hash Tables

Briana B. Morrison

Adapted from William Collins

Hashing 2

averageTimeS(n), THE AVERAGE TIME

FOR A SUCCESSFUL SEARCH

averageTimeU(n), … UNSUCCESSFUL …

worstTimeS(n)

worstTimeU(n)

Hashing 3

LET’S START WITH A REVIEW OFEARLIER SEARCH TECHNIQUES:

Hashing 4

Sequential Search

Given a vector of integers:

v = {12, 15, 18, 3, 76, 9, 14, 33, 51, 44}

What is the best case for sequential search? O(1) when value is the first element

What is the worst case? O(n) when value is last element, or value is not in the list

What is the average case? O(1/2 * n) which is O(n)

Hashing 5

SEQUENTIAL SEARCH IN STL // Postcondition: if there is an item in the range of iterators // from first (inclusive) through last // (exclusive) that is equal to value, the // iterator returned is the first iterator i in that // range such that *i = value. Otherwise, // last is returned. The worstTime(n) is O(n). template <typename InputIterator, typename T> InputIterator find(InputIterator first, InputIterator last, const T& value) { while (first != last && *first != value) ++first; return first; }

Hashing 6

THE worstTimeU(n) IS LINEAR IN n.

DITTO FOR worstTimeS(n),averageTimeU(n), AND averageTimeS(n).

Hashing 7

Binary Search

Given a vector of integers:v = {3, 9, 12, 14, 15, 18, 33, 44, 51, 76}

What is the best case for binary search? O(1) when element is the middle element

What is the worst case? O(log n) when element is first, last, or not in list

What is the average case? O(log n)

Hashing 8

BINARY SEARCH OF A SORTED

CONTAINER: template <typename ForwardIterator, typename T> inline bool binary_search (ForwardIterator first, ForwardIterator last, const T& value) example: if (binary_search (vector.begin(), vector.end(), value))

Hashing 9

Do you remember how binary search works? Distance len = last - first; Distance half; RandomAccessIterator middle; while (len > 0) { half = len / 2; middle = first + half; if (*middle < value) { first = middle + 1; len = len - half - 1; } else len = half; } return first; }

Hashing 10

THE worstTimeU(n) IS LOGARITHMIC INn.

DITTO FOR worstTimeS(n),averageTimeU(n), AND averageTimeS(n).

Hashing 11

NOW LET’S FOCUS ON AN UNUSUALBUT VERY EFFICIENT SEARCHTECHNIQUE:

HASHING

Hashing 12

THE CLASS IN WHICH HASHING IS

IMPLEMENTED IS THE hash_map

CLASS. THIS IS NOT YET IN THE

STANDARD TEMPLATE LIBRARY.

Hashing 13

TO A USER, THE hash_map CLASS

IS SIMILAR TO THE map CLASS,

EXCEPT hash_map HAS ONLY A FEW

METHODS, SUCH AS insert, erase, AND

find. AND THE TIMING ESTIMATES

FOR THOSE METHODS ARE LOWERTHAN IN THE map CLASS.

Hashing 14

RECALL THAT EACH VALUE (THATIS, ITEM) IN A MAP IS A PAIR WHOSE

FIRST COMPONENT IS OF TYPE Key

AND WHOSE SECOND COMPONENT IS

OF TYPE T. THE KEYS ARE UNIQUE,THAT IS, NO TWO DISTINCT VALUESHAVE THE SAME KEY.

Hashing 15

HERE ARE THE METHOD

INTERFACES FOR THE hash_map

CLASS:

Hashing 16

1. // Postcondition: this hash_map is empty. hash_map( );

2. // Postcondition: the number of items in this hash_map// has been returned.

int size( );

Hashing 17

3. // Postcondition: If an item with x's key had already been// inserted into this hash_map, the pair// returned consists of an iterator positioned// at the previously inserted item, and false. // Otherwise, the pair returned consists of

// an iterator positioned at the newly inserted// item, and true. Timing estimates are// discussed later.

pair<iterator, bool> insert ( const value_type<const key_type, T>& x);

Hashing 18

4. // Postcondition: if this hash_map already contains a value// whose key part is key, a reference to that// value's second component has been// returned. Otherwise, a new value, <key,// T( )>, is inserted into this hash_map. Timing// estimates are discussed later.

T& operator[ ] (const key_type& key);

Hashing 19

5. // Postcondition: If this hash_map contains a value whose// first component equals key, an iterator// positioned at that value has been returned.// Otherwise, an iterator at the same

// position as end() has been returned. // Timing estimates are discussed later. iterator find (const key_type& key);

6. // Precondition: itr is positioned at value in this hash_map. // Postcondition: the value that itr is positioned at has been // deleted from this hash_map. Timing // estimates are discussed later in this chapter. void erase (iterator itr);

Hashing 20

7. // Postcondition: an iterator positioned at the beginning // of this hash_map has been returned. // Timing estimates are discussed later. iterator begin( );

8. // Postcondition: an iterator has been returned that can be// used in comparisons to terminate iterating// through this hash_map.

iterator end( );

9. // Postcondition: the space for this hash_map object has// been deallocated.~hash_map( );

Hashing 21

Map vs. Hashmap

What are the differences between a map and a hashmap? Interface Efficiency Applications Implementation

Hashing 22

WE’LL STUDY THE TIME ESTIMATES

AFTER WE DEFINE THE METHODS.

BUT BASICALLY, FOR find, insert, AND

erase,

averageTime(n) IS CONSTANT!

Hashing 23

FIELDS IN THE hash_map CLASS

Hashing 24

CONTIGUOUS

array? vector? deque? heap?

LINKED

Linked? list? map?

BUT NONE OF THESE WILL GIVE

CONSTANT AVERAGE TIME FOR

SEARCHES, INSERTIONS AND

REMOVALS.

Hashing 25

HERE IS THE BASIC IDEA:

buckets // an array of values

count // the number of values in the hash_map

Hashing 26

LET’S SEE WHERE THAT LEADS.

SUPPOSE persons IS A HASH MAPTHAT WILL HOLD UP TO 1000VALUES. EACH VALUE CONSISTSOF A UNIQUE 3-DIGIT INTEGER (THEKEY), AND A NAME.

Hashing 27

buckets count 0 1 2 . . . 999

Hashing 28

Persons [351] = “Prashant”;

persons [108] = “Barrett”;

persons[435] = “Lin”;

WHERE SHOULD WE STORE THEVALUE WHOSE KEY IS 351?

Hashing 29

buckets count

108 351 435

108 Barrett

351 Prashant

435 Lin

? ?…

Hashing 30

NOW FOR SOMETHING SLIGHTLY

DIFFERENT: SUPPOSE persons IS A

HASH MAP THAT HOLDS UP TO 1000

VALUES. EACH VALUE CONSISTS OF

A 10-DIGIT TELEPHONE NUMBER

(THE KEY), AND A NAME.

Hashing 31

persons [9876543210] = “Prashant”;

persons [6103301256] = “Barrett”;

persons [6103309816] = “Lin”;

persons [4153576256] = “Sutey”;

WHERE SHOULD THESE VALUES

BE STORED?

Hashing 32

9876543210 6103301256

6103309816 4153576256

To make these values fit into the table, we need to mod by the table size; i.e., key % 1000.

Hashing 33

WHEN TWO DIFFERENT KEYS MAP TOTHE SAME INDEX, THAT IS CALLED ACOLLISION.

KEYS THAT MAP TO THE SAME INDEXARE CALLED SYNONYMS.

Hashing 34

HASHING:

AN ALGORITHM THAT TRANSFORMSA KEY INTO AN ARRAY INDEX.

Hashing 35

THE ALGORITHM HAS TWO PARTS:

1. A HASH FUNCTION: AN EASILYCOMPUTABLE OPERATION ON THE

KEY THAT RETURNS AN unsigned

long, WHICH IS THEN CONVERTED

INTO AN INDEX IN THE ARRAY

buckets;

2. A COLLISION HANDLER.

Hashing 36

Hash Codes Suppose we have a table of size N A hash code is:

A number in the range 0 to N-1 We compute the hash code from the key You can think of this as a “default position” when

inserting, or a “position hint” when looking up A hash function is a way of computing a hash code Desire: The set of keys should spread evenly over

the N values When two keys have the same hash code: collision

Hashing 37

Hash Functions

A hash function should be quick and easy to compute.

A hash function should achieve an even distribution of the keys that actually occur across the range of indices for both random and non-random data.

Calculation should involve the entire search key.

Hashing 38

Examples of Hash Functions Usually involves taking the key, chopping it

up, mix the pieces together in various ways Examples:

Truncation – ignore part of key, use the remaining part as the index

Folding – partition the key into several parts and combine the parts in a convenient way (adding, etc.)

After calculating the index, use modular arithmetic. Divide by the size of the index range, and take the remainder as the result

Hashing 39

Example Hash Function

h f(2 2 ) = 2 2 2 2 % 7 = 1

h f(4 ) = 4 4 % 7 = 4

t ab leE n t ry [1 ]

tab leE n t ry [4 ]

Hashing 40

Devising Hash Functions Simple functions often produce many collisions

... but complex functions may not be good either! It is often an empirical process

Adding letter values in a string: same hash for strings with same letters in different order

Better approach:size_t hash = 0;for (size_t i = 0; i < s.size(); ++i)

hash = hash * 31 + s[i];

Hashing 41

Devising Hash Functions (2) The String hash is good in that:

Every letter affects the value The order of the letters affects the value The values tend to be spread well over the integers

Hashing 42

Devising Hash Functions (3)

Guidelines for good hash functions:

Spread values evenly: as if “random”

Cheap to compute

Generally, number of possible values much greater than table size

Hashing 43

Hash Code Maps

Memory address: We reinterpret the memory

address of the key object as an integer

Good in general, except for numeric and string keys

Integer cast: We reinterpret the bits of the

key as an integer Suitable for keys of length

less than or equal to the number of bits of the integer type (e.g., char, short, int and float on many machines)

Component sum: We partition the bits of

the key into components of fixed length (e.g., 16 or 32 bits) and we sum the components (ignoring overflows)

Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type (e.g., long and double on many machines)

Hashing 44

Hash Code Maps (cont.)

Polynomial accumulation: We partition the bits of the key

into a sequence of components of fixed length (e.g., 8, 16 or 32 bits) a0 a1 … an1

We evaluate the polynomial

p(z) a0 a1 z a2 z2 … … an1zn1

at a fixed value z, ignoring overflows

Especially suitable for strings (e.g., the choice z 33 gives at most 6 collisions on a set of 50,000 English words)

Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:

The following polynomials are successively computed, each from the previous one in O(1) time

p0(z) an1

pi (z) ani1 zpi1(z) (i 1, 2, …, n 1)

We have p(z) pn1(z)

Hashing 45

HERE IS THE START OF THE

hash_map CLASS:

template<typename Key, typename T, typename HashFunc> class hash_map {

THE THIRD TEMPLATE PARAMETER

IS A FUNCTION CLASS: A CLASS IN

WHICH THE FUNCTION-CALL

OPERATOR, operator( ), IS

OVERLOADED. THIS IS THE HASH

FUNCTION CLASS.

Hashing 46

THE HEADING FOR operator( ) IS

unsigned long operator( ) (const key_type& key)

FOR EXAMPLE, WE CAN DEFINE A

SIMPLE HASH FUNCTION CLASS IF

EACH KEY IS AN int:

class hash_func { public: unsigned long operator( ) (const int& key) { return (unsigned long)key; } // overloaded operator( ) } // class hash_func

Hashing 47

HERE IS A PROGRAM WITH A

hash_map CLASS IN WHICH EACHVALUE CONSISTS OF A TELEPHONE

EXTENSION AND THE PERSON ATTHAT EXTENSION. THE ABOVE

hash_func IS USED.

Hashing 48

int main() { typedef hash_map<int, string, hash_func> hash_class; hash_class extensions; hash_class::iterator itr; extensions [5520] = "Yvonne"; extensions [5415] = "Jim"; extensions [5416] = "Penny"; extensions [5537] = "Chun Wai"; extensions [5273] = "Jim"; for (itr = extensions.begin(); itr != extensions.end(); itr++) cout << (*itr).first << " " << (*itr).second << endl; cout << "The number of items is " << extensions.size() << endl;

Hashing 49

if (extensions.find (5537) != extensions.end()) { cout << endl << "At extension " << 5537 << " is " << extensions [5537] << endl; extensions.erase (extensions.find (5537)); } // if for (itr = extensions.begin( ); itr != extensions.end( ); itr++) cout << (*itr).first << " " << (*itr).second << endl; return 0; } // main

Hashing 50

HERE IS THE OUTPUT: 5520 Yvonne 5537 Chun Wai 5415 Jim 5416 Penny 5273 Jim The number of items is 5 At extension 5537 is Chun Wai 5520 Yvonne 5415 Jim 5416 Penny 5273 Jim

Hashing 51

THERE IS NO OBVIOUS ORDER OFTHE KEYS. IF THE CONTAINER MUST

ALWAYS BE IN ORDER, USE A map

INSTEAD OF A hash_map.

Hashing 52

HERE IS ANOTHER hash_func CLASS,ONE IN WHICH THE KEY IS A STRINGOF UP TO 20 CHARACTERS.BASICALLY, WE ADD UP THE ASCIIVALUES OF THE KEY’S CHARACTERS.TO FURTHER SPREAD OUT THERESULT, PARTIAL TOTALS ARE MUL-TIPLIED BY 13, AND THE FINAL TOTALIS MULTIPLIED BY A BIG PRIME.

Hashing 53

class hash_func{ public:

unsigned long operator( ) (const string& key) { const unsigned long BIG_PRIME = 4294967291; unsigned long total = 0;

for (unsigned i = 0; i < key.length(); i++) total += 13 * key [i]; return total * BIG_PRIME; } // operator( )}; // class hash_func

Hashing 54

THE hash_func CLASS IS SUPPLIED BY

THE USER / CLIENT PROGRAMMER.

THE hash_map CLASS CONVERTS THE

unsigned long RETURNED BY operator( )

INTO AN ARRAY INDEX BY TAKING

THE REMAINDER % CAPACITY OF

buckets.

Hashing 55

EXERCISE: SUPPOSE THE CAPACITY OF buckets IS 203, AND FOR key1, key2, AND key3,

THE unsigned long NUMBERS

RETURNED BY hash_func (const string&

key) ARE 202, 203, AND 204

RESPECTIVELY. AT WHAT

LOCATIONS WOULD THE VALUES

WITH KEYS key1, key2, AND key3 BE

STORED?

Hashing 56

AS YOU MIGHT HAVE GUESSED,

HASHING IS INEFFICIENT WHEN

THERE ARE A LOT OF COLLISIONS.

Hashing 57

USERS OF THE hash_map CLASS“HOPE” THAT THE KEYS ARE

SCATTERED RANDOMLYTHROUGHOUT THE TABLE. THIS

HOPE IS FORMALLY STATED AS

FOLLOWS:

Hashing 58

THE UNIFORM HASHING ASSUMPTION

EACH KEY IS EQUALLY LIKELY TOHASH TO ANY ONE OF THE TABLEADDRESSES, INDEPENDENTLY OFWHERE THE OTHER KEYS HAVEHASHED.

Hashing 59

EVEN IF THE UNIFORM HASHINGASSUMPTION HOLDS, THERE MAYSTILL BE COLLISIONS.

Hashing 60

Collision Handlers

NOW WE’LL LOOK AT SPECIFIC COLLISION HANDLERS:

Chaining Linear Probing (Open Addressing) Double Hashing Quadratic Hashing

Hashing 61

Collision Handling

Collisions occur when different elements are mapped to the same cell

Chaining: let each cell in the table point to a linked list of elements that map there

Chaining is simple, but requires additional memory outside the table

01234 451-229-0004 981-101-0004

025-612-0001

Hashing 62

CHAINING (ALSO CALLED CHAINED

HASHING): AT INDEX i IN buckets,

STORE THE LIST OF ALL VALUES

WHOSE KEYS HASH TO i. HERE ARE THE FIELDS FOR CHAINED

HASHING:

Hashing 63

list <value_type< const key_type, T> >* buckets; // at each index in the array buckets, // we will store the list of all // items whose keys hashed to that index int count, // number of items in this hash_map length; // number of buckets in this hash_map // these two fields are used to calculate the load to // know when to increase the size of the table hash_func hash; // hash is a function object

Hashing 64

Chaining with Separate Lists Example

. . . .

8 9 ( 1 ) 4 5 ( 2 )

1 4 ( 1 )

3 5 ( 1 )

5 4 ( 1 ) 7 6 ( 2 )

9 4 ( 1 )

7 7 ( 1 )

Hashing 65

Chaining Picture

Two items hashed to bucket 3

Three items hashed to bucket 4

Hashing 66

INSERT VALUES WITH THESE KEYS:

21555516127178626358610330935861033090007178621359717862745121555543586103300451

ASSUME length = 1000. IGNORE 2ND COMPONENT

IN VALUE, IGNORE prev FIELD, USE ‘X’ AT END.

Hashing 67

buckets count 0 1

... 358

359... 451 ... 612

6103309000 X 8

7178626358 6103309358

7178627451

2155551612 X

6103300451 X

7178621359 2155554358 X X

Hashing 68

FOR THE find METHOD,

averageTimeS(n, m) n / 2m iterations.

<= 0.75 / 2

SO averageTimeS(n, m) <= A CONSTANT.

averageTimeS(n, m) IS CONSTANT.

Hashing 69

EVEN IF THE UNIFORM HASHING

ASSUMPTION HOLDS, IT IS POSSIBLE

FOR EACH KEY TO HASH TO THE

SAME INDEX. TO SEARCH THE LIST

AT THAT INDEX TAKES LINEAR-IN-n

SO worstTimeS(n, m) IS LINEAR IN n.

Hashing 70

THE SAME RESULTS, CONSTANT

AVERAGE TIME AND LINEAR WORST

TIME, HOLD FOR insert AND erase.

Hashing 71

The next collision handler is Linear Probing (OPEN-ADDRESS HASHING). AT MOST ONE VALUE IS STORED AT

EACH INDEX IN buckets.

Hashing 72

HERE IS HOW THE unsigned long

RETURNED BY hash_func IS

CONVERTED INTO AN INDEX: int index = hash_func (key) % length; THIS IS DONE IN THE HASH_MAP CLASS, BECAUSE ONLY THE HASH_MAP CLASS KNOWS THE LENGTH OF THE ARRAY.

Hashing 73

WHEN COLLISION OCCURS: SEARCH THE TABLE UNTIL AN

“OPEN” SLOT IN buckets IS FOUND.

THIS IS ALSO KNOWN AS “OFFSET-

OF-1” COLLISION HANDLER.

Hashing 74

OFFSET-OF-1 COLLISION HANDLER:

IF buckets [index] ALREADY HAS

ANOTHER ELEMENT, TRY

buckets [index + 1], buckets [index + 2], …,

buckets [length – 1], buckets [0],

buckets [1], …, buckets [index – 1].

Hashing 75

Hash Table Using Open Probe Addressing Example7 7

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

In s ert4 5

In s ert3 5

In s ert7 6

5 4 5 4 5 45 4

Insert 45

(mod by table size … % 11)

Hashing 76

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

In s ert4 5

In s ert3 5

In s ert7 6

5 4 5 4 5 45 4

Insert 35

Hashing 77

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

In s ert4 5

In s ert3 5

In s ert7 6

5 4 5 4 5 45 4

Insert 76

Hashing 78

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

In s ert4 5

In s ert3 5

In s ert7 6

5 4 5 4 5 45 4

Hashing 79

Linear Probing Open addressing: the

colliding item is placed in a different cell of the table

Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell

Each table cell inspected is referred to as a “probe”

Colliding items lump together, causing future collisions to cause a longer sequence of probes

Example: h(x) x mod 13 Insert keys 18, 41, 22,

44, 59, 32, 31, 73, in this order

0 1 2 3 4 5 6 7 8 9 10 11 12

41 18445932223173 0 1 2 3 4 5 6 7 8 9 10 11 12

Hashing 80

WE NEED TO KNOW WHEN A SLOT IS FULL

OR OCCUPIED.

INSTEAD OF JUST T() STORED IN THE BUCKETS (BECAUSE T() COULD BE A VALID VALUE), THE BUCKET WILL STORE AN INSTANCE OF THE VALUE_TYPE CLASS.

Hashing 81

TO INDICATE WHETHER A LOCATION

IS OCCUPIED, THE value_type CLASS

WILL HAVE bool occupied; IN ADDITION TO T key;

Hashing 82

key occupied

54 1069 % 203 = 54 55 460 % 203 = 54 56 1070 % 203 = 55

109 312 % 203 = 109

201 607 % 203 = 201 202

? false

… false

1069 true 460 true 1070 true

312 true

607 true false

Hashing 83

Retrieve

What about when we want to retrieve?

Consider the previous example….

Hashing 84

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

In s ert4 5

In s ert3 5

In s ert7 6

5 4 5 4 5 45 4

Find the value 35. (% 11)

Now find the value 76.

Hashing 85

In s ert5 4 , 7 7 , 9 4 , 8 9 , 1 4

In s ert4 5

In s ert3 5

In s ert7 6

5 4 5 4 5 45 4

Now delete 35. (% 11)

Hashing 86

Linear Probing Probe by incrementing the index If “fall off end”, wrap around to the beginning

Take care not to cycle forever!

1. Compute index as hash_fcn() % table.size()

2. if table[index] == NULL, item is not in the table

3. if table[index] matches item, found item (done)

4. Increment index circularly and go to 2 Why must we probe repeatedly?

hashCode may produce collisions remainder by table.size may produce collisions

Hashing 87

Search Termination

Ways to obtain proper termination Stop when you come back to your starting point Stop after probing N slots, where N is table size Stop when you reach the bottom the second time Ensure table never full

Reallocate when occupancy exceeds threshold

Hashing 88

IN THE SECOND EXAMPLE, SUPPOSE itr IS POSITIONED AT INDEX 54 AND THE MESSAGE IS my_map.erase (itr);

Hashing 89

key occupied

54 1069 % 203 = 54 55 460 % 203 = 54 56 1070 % 203 = 55

109 312 % 203 = 109

201 607 % 203 = 201 202

? false

… false

312 true

607 true false

Erase value 1069.

Hashing 90

key occupied

54 1069 % 203 = 54 55 460 % 203 = 54 56 1070 % 203 = 55

109 312 % 203 = 109

201 607 % 203 = 201 202

? false

… false

1069 false 460 true 1070 true

312 true

607 true false

Now search for 460.

Hashing 91

NOW A SEARCH 460 FOR WOULD BE

UNSUCCESSFUL BECAUSE 460

INITIALLY HASHES TO 54, AN

UNOCCUPIED LOCATION.

Hashing 92

SOLUTION:bool marked_for_removal;

THE CONSTRUCTOR FOR VALUE_TYPE SETS EACH bucket’s marked_for_removal FIELD TO false.insert SETS marked_for_removal TO false; erase SETS marked_for_removal TO true.SO AFTER THE INSERTIONS:

Hashing 93

marked_for_ key occupied removal

54 1069 % 203 = 54 55 460 % 203 = 54 56 1070 % 203 = 55

109 312 % 203 = 109

201 607 % 203 = 201 202

? false

… false

312 true

607 true false

falsefalsefalse

falsefalse

Hashing 94

AFTER DELETING THE VALUE WITH

KEY 1069:

Hashing 95

54 1069 % 203 = 54 55 460 % 203 = 54 56 1070 % 203 = 55

109 312 % 203 = 109

201 607 % 203 = 201 202

? false

… false

312 true

607 true false

truefalsefalse

falsefalse

Hashing 96

FOR find, AN UNSUCCESSFUL

SEARCH CANNOT STOP UNTIL buckets

[index].marked_for_removal = false.

Hashing 97

CLUSTER: A SEQUENCE OF NON-EMPTY LOCATIONS

KEYS THAT HASH TO 54 FOLLOW THE SAME COLLISION-PATH AS KEYS THAT HASH TO 55, …

Hashing 98

54 1069 % 203 = 54 55 460 % 203 = 54 56 1070 % 203 = 55

109 312 % 203 = 109

201 607 % 203 = 201 202

? false

… false

312 true

607 true false

falsefalsefalse

falsefalse

Hashing 99

PRIMARY CLUSTERING: THE

PHENOMENON THAT OCCURS WHEN

THE COLLISION HANDLER ALLOWS

THE GROWTH OF CLUSTERS TO

ACCUMULATE.

THIS WILL OCCUR WITH OFFSET-OF-

1 OR ANY CONSTANT OFFSET.

Hashing 100

SOLUTION 1: DOUBLE HASHING, THAT IS, OBTAIN BOTH INDICES AND OFFSETS BY HASHING:

unsigned long hash_int = hash (key);int index = hash_int % length,offset = hash_int / length;

NOW THE OFFSET DEPENDS ON THEKEY, SO DIFFERENT KEYS WILL USUALLY HAVE DIFFERENT OFFSETS, SO NO MORE PRIMARY CLUSTERING!

Secondary hash function

Hashing 101

TO GET A NEW INDEX:

index = (index + offset) % length;

Notice that if a collision occurs, you rehash from the NEW index value.

Hashing 102

EXAMPLE: length = 11

key index offset15

WHERE WOULD THESE KEYS GO IN buckets?

Hashing 103

47 1 2

19 910

Hashing 104

PROBLEM: WHAT IF OFFSET IS A MULTIPLE OF length?

EXAMPLE: length = 11key index offset15

22 // BUT 15 IS AT INDEX 4 // FOR KEY 246, NEW INDEX = (4 + 22) % 11 = 4. OOPS!

Hashing 105

SOLUTION:

if (offset % length == 0)

offset = 1;

ON AVERAGE, offset % length WILL

EQUAL 0 ONLY ONCE IN EVERY

length TIMES.

Hashing 106

FINAL PROBLEM: WHAT IF length HAS SEVERAL FACTORS?EXAMPLE: length = 20key index offset20 0 125 5 130 10 135 15 1110 10 5 // BUT 30 IS AT INDEX 10

FOR KEY 110, NEW INDEX = (10 + 5) % 20 = 15, WHICH IS OCCUPIED, SO NEW INDEX = (15 + 5) % 20, WHICH IS OCCUPIED, SO NEW INDEX = ...

Hashing 107

SOLUTION: MAKE length A PRIME.

Hashing 108

Consider a hash table storing integer keys that handles collision with double hashing

N13 h(k) k mod 13 d(k) 7 k mod 7

Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Example of Double Hashing

0 1 2 3 4 5 6 7 8 9 10 11 12

31 41 183259732244 0 1 2 3 4 5 6 7 8 9 10 11 12

k h (k ) d (k ) Probes18 5 3 541 2 1 222 9 6 944 5 5 5 1059 7 4 732 6 3 631 5 4 5 9 073 8 4 8

Hashing 109

THIS VERSION OF OPEN-ADDRESS

HASHING IS FAST. IF THE UNIFORM

HASHING ASSUMPTION HOLDS,

averageTime(n, m) FOR SEARCHING,

INSERTING AND REMOVING IS

CONSTANT O(1).

Hashing 110

ANOTHER SOLUTION: QUADRATIC HASHING, THAT IS, ONCE COLLISION OCCURS AT h, GO TO LOCATION h + 1, THEN IF COLLISION OCCURS THERE GO TO LOCATION h + 4, then h + 9, then h + 16, etc.unsigned long hash_int = hash (key);int index = hash_int % length,offset = i2;

Notice that h stays at the same location. No clustering.

Hashing 111

QUADRATIC REHASHINGEXAMPLE: length = 11

key index offset15

1, final place index = 635

Hashing 112

Performance

HOW DOES DOUBLE-HASHING COMPARE WITH CHAINED HASHING?

Hashing 113

Performance of Hash Tables Load factor = # filled cells / table size

Between 0 and 1 Load factor has greatest effect on performance Lower load factor better performance

Reduce collisions in sparsely populated tables Knuth gives expected # probes p for open addressing,

linear probing, load factor L: p = ½(1 + 1/(1-L)) As L approaches 1, this zooms up

For chaining, p = 1 + (L/2) Note: Here L can be greater than 1!

Hashing 114

Performance of Hash Tables (2)

L Number of Probes Linear Probing Chaining

0 1.00 1.00 0.25 1.17 1.13 0.5 1.50 1.25 0.75 2.50 1.38 0.83 3.38 1.43 0.9 5.50 1.45 0.95 10.50 1.48

Hashing 115

Performance of Hash Tables (3) Hash table: Insert: average O(1) Search: average O(1)

Sorted array: Insert: average O(n) Search: average O(log n)

Binary Search Tree: Insert: average O(log n) Search: average O(log n)

But balanced trees can guarantee O(log n)

Hashing 116

We know that hashing becomes inefficient as the table fills up. What to do?

EXPAND!

Hashing 117

WHAT ABOUT THE SIZE OF buckets,

AND SHOULD THAT ARRAY EVER BE

RE-SIZED? RE-SIZE WHENEVER THE LOAD FACTOR, THE RATIO OF count TO length, EXCEEDS 0.75.

Hashing 118

TO RE-SIZE, WE WILL DOUBLE THE

OLD CAPACITY, PLUS 1. WHY +1? ANOTHER OPTION…FIND NEXT PRIME NUMBER AFTER DOUBLING. NOTE THAT WE RE-SIZE WHENEVER

THE LOAD FACTOR, THAT IS, THE

AVERAGE LIST SIZE, EXCEEDS 0.75.

Hashing 119

IN check_for_expansion, IF count >=

length * 0.75, CREATE A NEW ARRAY

OF DOUBLE THE OLD LENGTH (PLUS

1). FOR EACH VALUE IN THE OLD

ARRAY, ITERATE THROUGH

AND HASH EACH VALUE TO

THE NEW ARRAY. FINALLY, ERASE

THE OLD ARRAY.

Hashing 120

GROUP EXERCISE: ASSUME THAT length = 13. INSERT THE FOLLOWING KEYS INTO A HASH TABLE USING 1) OPEN ADDRESS, 2) DOUBLE HASHING, and 3) CHAINING 20, 33, 49, 22, 26, 140, 38, 9, 7, 3, 0, 1

Hashing 121121

Summary Slide 1§- Hash Table - simulates the fastest searching technique, knowing

the index of the required value in a vector and array and apply the index to access the value, by applying a hash function that converts the data to an integer

- After obtaining an index by dividing the value from the hash function by the table size and taking the remainder, access the table. Normally, the

number of elements in the table is much smaller than the number of distinct data values, so collisions occur.

- To handle collisions, we must place a value that collides with an existing table element into the

table in such a way that we can efficiently access it later.

Hashing 122122

Summary Slide 2

§- Hash Table (Cont…) - average running time for a search of a hash table is

- the worst case is O(n)

Hashing 123123

Summary Slide 3

§- Collision Resolution - Types:

1) linear open probe addressing

- the table is a vector or array of static size

- After using the hash function to compute a table index, look up the entry in the table.

- If the values match, perform an update if necessary.

- If the table entry is empty, insert the value in the table.

Hashing 124124

Summary Slide 4

§- Collision Resolution (Cont…) - Types:

1) linear open probe addressing

- Otherwise, probe forward circularly, looking for a match or an empty table slot.

- If the probe returns to the original starting point, the table is full.

- you can search table items that hashed to different table locations.

- Deleting an item difficult.

Hashing 125125

Summary Slide 5§- Collision Resolution (Cont…)

2) chaining with separate lists.

- the hash table is a vector of list objects

- Each list is a sequence of colliding items.

- After applying the hash function to compute the table index, search the list for the data value.

- If it is found, update its value; otherwise, insert the value at the back of the list.

- you search only items that collided at the same table location

Hashing 126126

Summary Slide 6

§- Collision Resolution (Cont…)- there is no limitation on the number of values

in the table, and deleting an item from the table involves only erasing it from its

corresponding list

Hash Tables Briana B. Morrison Adapted from William Collins.

z slide

si slide

collision slide

result slide

olog n slide

table size slide

example hash function

william collins slide

Documents

Hash-JoinAlgorithmen · Hash-Join Algorithmen | Matthias...

Electrostatics emma and briana

Trees Briana B. Morrison Adapted from Alan Eugenio.

Blogs-Briana White

CSE 1301 Lecture 5B Conditionals & Boolean Expressions...

Briana Las abejas

By: Jalexis Singleton and Briana Peters

Lists Briana B. Morrison Adapted from Alan Eugenio & William...

Briana Ingram Tissues G3

Briana Etec

Briana Hope Talent : Model Portfolio

Briana Shae Photography Senior Guide

Advanced Trees Part III Briana B. Morrison Adapted from Alan...

Museum Entrance Welcome to the Lobby Family Friends Future.....

Priority Queues Briana B. Morrison Adapted from Alan Eugenio...

Inheritance Polymorphism Briana B. Morrison CSE 1302C Spring...