Top Banner
1 CSCD 326 Data Structures I Hashing
21

1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

Jan 19, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

1

CSCD 326 Data Structures IHashing

Page 2: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

2

Hashing Background

Goal: provide a constant time complexity method of searching for stored dataThe best traditional searching time complexity available is O(log2n) for binary search

Binary search requires that data be stored in sorted order.

Hashing approach to data storage and retrieval:

Contiguous memory is not used and memory is sacrificed for speed.

Often used for symbol table management in compilers, assemblers, and linker/loaders.

Page 3: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

3

Hashing - Basic Ideas

Data storage - hashing relies primarily on arrays for data storage but not on contiguous storage within the array

Data storage/retrieval method: use a math function which, when given the key or data value to be stored, returns an array index in which to store the value.

This is referred to as a "hash function."The same function will be used to retrieve the value later on.

Page 4: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

4

Simple Example of hashing

Employee data is to be stored using employee number as a key.

Employee numbers are unique and run from 10,000 to 19,999.

Storage: use an array of size 10,000.Hash function: Emp. Number - 10000 provides a unique index into the array and that array location is used to store/retrieve information for this employee.Problem: key values (in other situations) are often not unique or do not fall into a range which allows a reasonable size array.

Page 5: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

5

Goals for Hashing Functions

The same key value (value used for insertion) should always return the same index.

If it does not - data can't be retrieved later.

As much as possible - different key values should not hash to the same index.

This is done by mixing things up with the hash function so that common patterns in key values do not hash to the same locations.This can never be prevented however - so collision handling becomes an issue.

Page 6: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

6

Hash Function Construction Methods

Using numeric ASCII values of characters:

Example key: JUNK

Add ASCII values of characters (74 + 85 + 78 + 75) to produce a single integer (312).

This may suffice but the integer produced is not unique to "JUNK".

Page 7: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

7

Hash Function Construction Methods (2)

Concatenation of ASCII values:

Represent A - Z as integers 0-25 and concatenate these values.

So JUNK becomes: 20 13 109

01001101000110101010 = 315818

25210215

32768=323 1024=322 32=321

and so the concatenation can be expressed as: 9 * 323 + 20 * 322 + 13 * 321 + 10 = 315818

Page 8: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

8

Hash Function Construction Methods (3)

Using the mod operator:

Allows reduction of large values into the range of actual hash table indices.

in the example above if the table is an array of size 10000 --315818 % 10000 = 5818.

Note here that the mod operator simply removes the first two digits and this makes the hashed value less unique to the string used to generate it.

Page 9: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

9

Hash Function Construction Methods (4)

Using the mod operator:

Problems with use of mod operator - choice of exact table size is very important - if there are a large number of common factors - many collisions can be generated.e.g. table size 15 Key values 10, 20, 30, 40, 50, 60, 70 - here 7 values hash to three indices - 30,60 to 0 - 20,50 to 5 and 10,40,70 to 10Solution - use an array size which is prime - thus it can't have any common factors with key values.

Page 10: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

10

Hash Function Construction Methods (5)

Using pseudo-random number generators:

Given the same starting seed pseudo-random number generators always produce the same sequence of values.Here use a number generated from the key string as a seed and use the first resulting pseudo-random sequence value to generate the hash table index.

Page 11: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

11

Hash Function Construction Methods (6)

FoldingScrambles numeric values to remove the effects of recurring patterns- e.g. add the numeric values.

Boundary FoldingBreaks numbers into segments and adds digits in the segments.e.g. social security numbers: 534-65-9234 - breaks at dashes - hash value is 534 + 65 + 9234

Fan Folding Like boundary but reverses the digits in every other value.

Page 12: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

12

Hash Function Construction Methods (7)

Digit or character extractionAnother way to scramble similar patterns in multiple keys - can be used in two ways:

1) Simply remove characters likely to be similar in many keys (or use dissimilar characters).

2) Mid-Square technique

Represent key as a number.

Square the number.

Extract from the middle of the squared value enough bits to form an array index.

Page 13: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

13

Linked Collision Processing

Linked method of collision overflow handling divides memory into two parts:

One part for primary storage (the hash table itself)

A separate secondary part for collision overflow (may be either dynamically allocated or a separate fixed allocation area).

Page 14: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

14

Linked Collision Processing (2)

Linked collision overflow handling:Assume the hash table is composed of an array of objects which contain an instance variable which is a reference to an object of the same type.On collision:

dynamically allocate a new node and place data into it. link the new node through the reference.overflowed items are stored in a linked list off the original table item.

Page 15: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

15

Linked Collision Processing (3)

Primary Memory(Hash Table)

Secondary Memory(Overflow)

Page 16: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

16

Linked Collision Processing (4)

Search time with linked overflow

If there have been many collisions - the search is no longer constant time complexity since a sequential search must be done through the linked list.

Thus the time complexity becomes O(n) where n is the number of collisions.

Page 17: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

17

Linear Collision Processing

Also called Linear Probing - no primary and secondary memory - original array holds both.

When a collision occurs: Start at hashed location (site of first collision)Proceed sequentially through the array until available storage is found - store at this locationThe array must be treated circularly since a probe could reach the end and need to start again at beginning.

Page 18: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

18

Linear Collision Processing

Problem with linear probing: clustering If the hash function produces one value more than others - parts of the table will quickly fill up while others are empty.

Clustering causes further collisions later.

Page 19: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

19

Analysis of Linear Probing

Depends on the loading density of the hash table

D - Number of Records in Hash Table / Size of Hash Table Array --- D = 1 indicates maximum density

Average number of probes is proportional to:For a successful search: (½ (1 + 1/(1-D))Unsuccessful search: (½ (1 + 1/(1-D)2))for D = 0.1 --- 1.06 and 1.18for D = 0.5 --- 1.50 and 2.50for D = 0.8 --- 3.00 and 13.00for D = 0.9 --- 5.50 and 50.50

This is why Linear Probing is referred to as a Density Dependant Search Technique

Page 20: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

20

Rehashing

Alternative to linear probing to avoid clustering.

After a collision occurs - apply a different hash function to get a new location altogether.

If new location is taken either resort to linear probing from there or apply a 3rd or 4th hash function

Eventually some probing method must be used.

Page 21: 1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

21

Quadratic Probing

Another alternative to linear probing: if a collision occurs at initial index k:try to store in index k +1for all successive collisions (k + 1, etc)try to store in index k + r2 where r is a count of how many collisions have occurred

Variation on rehashing-double hashingUse the second hash function to determine a fixed increment to move through the array.