Chapter 11 - Hashing |
Document last modified: |
OVERVIEW
Why hash?
Hashes are functions that transform or map an input to a value, often an index of a table.
Hash functions are often used to map a large range of inputs to a smaller range of table indices.
Using the modulus 10 function as a hash function:
19 mod 10 is 9
119 mod 10 is 9, etc.
Linear searches are Q(n) and binary are Q(lg n). Hashing can, with certain restrictions, perform search near Q(1).
The load factor can be a > 1 and still perform search in Q(n)
Why care?
Some recent uses are:
Secure communications over public networks to verify that the message sent is the message received.
A hash maps input (such as a message text) to some number.
For example, a modulus 256 hash could operate by:
Sum the byte values of the message, the sum modulus 256 will be a value between 0-255.
A B C D E F 65 +66 +67 +68 +69 +70
= 405 mod 256 = 149SHA-1 (Secure Hash Algorithm 1) developed by NSA processes 512-bit blocks to produce a 160-bit hash. Very difficult to change the message and produce the same hash.
Other common hashes are MD5 (a 32-bit hash by Ronald Rivest, one of the text's authors). That means there are 232 hash values.
Hashing this file you are reading might produce the hash value: 88659b92
Changing one 'a' to 'A' would produce a different hash value, perhaps: 1c807dfd
Verification process:
Sender computes hash of message and sends along with message, usually both are encrypted using a key common to sender and receiver.
Receiver recomputes the hash using the same algorithm.
If sender and receiver hash agree, the message has not been altered.
Often serves as part of secure transmissions, such as Secure Socket Layer (SSL), over public networks.
Question 11.5
Does knowing the hash value provide much useful information about the message?
Can one include the hash as part of the hashed message or must it be separate?
Routers
IBM recently announced the release into the public domain of a hashing algorithm that will significantly reduce the size of router tables on the Internet. Since router table size is a key factor in router performance, the larger the table the more resources required and the slower the routing, with the corresponding increased likelihood of dropped packets due to routing delays, this is a big deal for bandwidth carnivores.
But I liked direct addressing
The problem with direct addressing is if the universe U is large, storing a table of size |U| may be impractical or impossible.
Often, the set K of keys actually stored is small, compared to U, so that most of the space allocated for T is wasted.
• When K is much smaller than U, a hash table requires much less space than a direct-address table.
• Can reduce storage requirements to Q(|K|).
• Can still get O(1) search time, but in the average case, not the worst case.Idea: Instead of storing an element with key k in slot k, use a function h and store the element in slot h(k).
• We call h a hash function.
h : U " {0, 1, . . . ,m − 1}, so that h(k) is a legal slot number in T, the storage table.
We say that k hashes to slot h(k).• 19 mod 10 is 9
• 19 hashes to slot 9
Collisions: When two or more keys hash to the same slot.
• Can happen when there are more possible keys than slots (|U| > m).
• For a given set K of keys with |K| ≤ m, may or may not happen. Definitely happens if |K| > m.
• Therefore, must be prepared to handle collisions in all cases.
• Use two methods: chaining and open addressing.
• Chaining is usually better than open addressing. The text examines both.

When to use a Hash Implementation
|K| << |U|
When we don't know ahead of time which specific ki
Î
U will be in K (e.g. Social Security numbers of employees)
When we can create a hashing function
h: U ® {0, 1, ..., m - 1}
where h has the Simple Uniform Hashing property
Simple Uniform Hashing - any given key is equally likely to hash to any
of the m slots of the hash table T
Hashing
Two possible implementations (more later):
First transform key k to a natural number if it's not already, then
use division method (includes using modulus operator to
force output of h to be in the range of [0 .. m-1]
use the multiplication method
Example - Division
K = { 1055, 1492, 1776, 1812, 1918, 1945 }
h: U ® {0, 1, ..., 7 }
h( k ) = 5•k mod 8
h(1776) = 5•1776 mod 8 = 0
h(1492) = 5•1492 mod 8 = 4
h(1812) = 5•1812 mod 8 = 4
Collision
h(1492) = h(1812) hash to same T[ 4 ]
Collision resolution by chaining
Put all elements that hash to the same slot into a linked list.
Example
Collision: h(k2) = h(k5) = h(k7)
Insertion: Always add to front of list, worst case time of O(1).
Question 11.6 - Why O(1)?
Figure below shows singly linked lists. If we want to delete elements, it’s better to use doubly linked lists.
Question 11.7 - Why?
Slot j contains a pointer to the head of the list of all stored elements that hash to j (or to the sentinel if using a circular, doubly linked list with a sentinel)
If there are no such elements, slot j contains NIL.

Implementation (Doubly linked)
Initially, each location in the hash table is initialized to NIL
When a location in the table is being used, it holds an address to a two-way linked list of nodes, where each node contains key and the satellite data.
previous
NILsatellite
datanext
®
¬
previoussatellite
datanext
®
¬
previoussatellite
datanext
NILThis linked list is known as a chain, and this method is known as chaining and it handles the situation where more than one node hashes to the same array table location, i.e., same linked list, aka bucket.
Question 11.8 - What are the keys in one of the buckets of the hash table above?
Let Node_Ptr be the data type defined for the pointer to the node
Let Hash_Table be the data type for the table, each location in the table holds a Node_Ptr pointer, or NIL
A key k Î U
Example - Node_Ptr
previous key next
| Chained-Hash-Insert (T: Hash_Table, x: Node_Ptr) -- pre:
"y:
Node_Ptr, y Î T's nodes: x
¹
y, x.previous = NIL
|
x![]()
|
||||
|
Chained-Hash-Search (T: Hash_Table, k:U ) : Node_Ptr -- post: NIL is returned if k is not found in any
|
|||||
| Chained-Hash-Delete (T: Hash_Table, x: Node_Ptr) -- pre:
$y:
Node_Ptr, y Î T's nodes: x = y -- post: node referenced by x is deleted
|
Question 11.9 - Use h( k ) = k mod 2
Diagram a data structure for hashing to implement a dictionary of dates translating from historical dates to English words, for example:
1492 ® fourteen ninety two
Which is the key? satellite data? What does T hold?
Construct the hash table for keys: 2005, 1492, 1984, 1776, 2013
Insert a different node with key 2005. Any problems?
What about inserting the same node twice?
Search for 1492, 2002 and 2005. Any problems?
Delete 1492, 2005 and 2002. Any problems?
Give insert and search algorithms for singly-linked lists.
Suggest an algorithm for delete using singly-linked lists.
Performance - Doubly linked
Load factor - a = n/m, where n = # of elements to store, m = # of slots or size of T, and assumes Simple Uniform Hashing; a > 1 a = 1, or a < 1
Question 11.10 - What is implied by: a = 1, a < 1, a > 1?
Chained-Hash-Delete (T: Hash_Table, x: Node_Ptr):
Note that parameter x is the element in the linked list so can be accessed directly.
Q(1) deletion time for doubly-linked lists.
Question 11.11 - What is the problem with singly linked lists?
Chained-Hash-Insert:
Q(1) time to insert at head of list
Chained-Hash-Search:
worst case: Q(n)
A sequential search can occur when m=1, that is a = n/1; or when all n keys hash to the same slot; the list is length n.
Question 11.12 - Would using a binary search reduce search to Q(lg n)? Linked lists are preferable, why?
average: Q(1 + a)
Depends on uniform distribution of keys to slots by hashing function, assume that is the case.
For j = 0, 1, . . . ,m − 1, denote the length of list T[j] by nj. Then the sum of the length of all lists is:
n = n0 + n1 +· · · +nm-1
Average value of nj is E[ nj ] = a = n/m
Assume hash function is O(1) time so search time for element with key k depends on the length nh(k) of the list T[ h( k ) ]
Two cases:
Unsuccessful - contains no element with key k
Theorem - An unsuccessful search takes expected time Q(1 + a)
Proof Simple uniform hashing ® any key not already in the table is equally likely to hash to any of the m slots.
To search unsuccessfully for any key k, need to search to the end of the list T[ h( k ) ].
This list has expected length E [nh(k)] = a. Therefore, the expected number of elements examined in an unsuccessful search is a.
Adding in the time to compute the hash function, the total time required is Q(1 + a).
Question 11.13 - What is Q(1 + a) when:
n = 10, m = 100
n = m3
Successful - T contains element with key k
The expected time for a successful search is also Q(1 + a).
The circumstances are slightly different from an unsuccessful search.
The probability that each list is searched is proportional to the number of elements the list contains.
Theorem - A successful search takes expected time Q(1 + a)
The text gives a rigorous proof based on material we will not cover.
A simple, non-rigorous, explanation (Levitin, p263):
Proof Simple uniform hashing ® any key not already in the table is equally likely to hash to any of the m slots.
To search successfully for any key k, need to search on average 1/2 the elements of the list T[ h( k ) ].
This list has expected length E [nh(k)] = a. Therefore, the average expected number of elements examined in a successful search is a/2 + 1
Adding in the time to compute the hash function, the total time required is Q(1 + a/2) or just Q(1 + a) .
For m and n elements the average length of the linked list is a = n/m
An element hashes into a slot i, the average number of comparisons is a/2 + 1 to successfully locate the element in i.
For the following:
Ave. Comparisonssuccess » a/2 + 1
= (n/m)/2 + 1
= (6/8)/2+1
= 11/8
A total of 7 key comparisons would be done to locate each of the 6 elements once.
Question 11.14 - Explain.
In the worst case, all elements hash into the same slot and the average successful search takes about n/2 or Q(n) key comparisons; no better than a linear search of the list.
Hash functions
Truncation - Ignore part of the key, use the remaining part directly as an index.
Keys are 18-digit integers and m=1000; the 1st, 2nd and 5th digits might be the hash function: 21296876 maps to 976.
Question 11.15 Give an example where this might not satisfy simple uniform hashing.
Folding - Partition the key into several parts and combine the parts (addition, etc) to obtain index.
Divide 8-digit integer into groups of 3, 3, and 2-digits; add together and truncate to proper range of integers.
21296867 maps to 212+968+76=1256; truncate to 256.
Since all information in key affects index, folding often achieves better spread than truncation alone.
Modular arithmetic - Spread depends upon the the modulus, the size of the hash array m.
Modulus can ensure good spread and maps to index.
Modulus a power of a small integer, 2 or 10, many keys tend to map to same index.
Best choice of m is often, not always, a prime number.
Rather than choose hash table size 1000 choose 997 or 1009. 210 is also poor choice.
Caution
Simple Uniform Hashing doesn't just happen
care has to be taken in selecting keys
care has to be taken when constructing the hash function h (more on that next)
Question 11.16
What are some of the criteria for key selection?
Would a key based on the year of birth for first-graders be useful?
Would a hash function with m = 26 and a hash function that extracted the first vowel of a name satisfy the Simple Uniform Hashing assumption?
For example: h( k ) = (first vowel of k) mod 26
A hash that extracted the month of birth from the key?
For example: h( k ) = month of birth in k