Chapter 11 - Hashing


Document last modified: 

OVERVIEW

Why hash?

Hashes are functions that transform or map an input to a value, often an index of a table.

Hash functions are often used to map a large range of inputs to a smaller range of table indices.

Using the modulus 10 function as a hash function:

19 mod 10 is 9

119 mod 10 is 9, etc. 

Linear searches are Q(n) and binary are Q(lg n). Hashing can, with certain restrictions, perform search near Q(1).

The load factor can be a > 1 and still perform search in Q(n)

Why care?

Some recent uses are:

Secure communications over public networks to verify that the message sent is the message received.

A hash maps input (such as a message text) to some number.

For example, a modulus 256 hash could operate by:

Sum the byte values of the message, the sum modulus 256 will be a value between 0-255.

A B C D E F
65 +66 +67 +68 +69 +70
 

 
= 405 mod 256 = 149

SHA-1 (Secure Hash Algorithm 1) developed by NSA processes 512-bit blocks to produce a 160-bit hash. Very difficult to change the message and produce the same hash.

Other common hashes are MD5 (a 32-bit hash by Ronald Rivest, one of the text's authors). That means there are 232 hash values.

Hashing this file you are reading might produce the hash value: 88659b92

Changing one 'a' to 'A' would produce a different hash value, perhaps: 1c807dfd

Verification process:

Sender computes hash of message and sends along with message, usually both are encrypted using a key common to sender and receiver.

Receiver recomputes the hash using the same algorithm.

If sender and receiver hash agree, the message has not been altered.

Often serves as part of secure transmissions, such as Secure Socket Layer (SSL), over public networks.

Question 11.5

Does knowing the hash value provide much useful information about the message?

Can one include the hash as part of the hashed message or must it be separate?

Routers

IBM recently announced the release into the public domain of a hashing algorithm that will significantly reduce the size of router tables on the Internet. Since router table size is a key factor in router performance, the larger the table the  more resources required and the slower the routing, with the corresponding increased likelihood of dropped packets due to routing delays, this is a big deal for bandwidth carnivores.

 

But I liked direct addressing

The problem with direct addressing is if the universe U is large, storing a table of size |U| may be impractical or impossible.

Often, the set K of keys actually stored is small, compared to U, so that most of the space allocated for T is wasted.

• When K is much smaller than U, a hash table requires much less space than a direct-address table.

• Can reduce storage requirements to Q(|K|).

• Can still get O(1) search time, but in the average case, not the worst case.

Idea: Instead of storing an element with key k in slot k, use a function h and store the element in slot h(k).

• We call h a hash function.

h : U " {0, 1, . . . ,m − 1}, so that h(k) is a legal slot number in T, the storage table.

We say that k hashes to slot h(k).

• 19 mod 10 is 9

• 19 hashes to slot 9

Collisions: When two or more keys hash to the same slot.

• Can happen when there are more possible keys than slots (|U| > m).

• For a given set K of keys with |K| ≤ m, may or may not happen. Definitely happens if |K| > m.

• Therefore, must be prepared to handle collisions in all cases.

• Use two methods: chaining and open addressing.

• Chaining is usually better than open addressing. The text examines both.

When to use a Hash Implementation

Hashing

Collision resolution by chaining

Put all elements that hash to the same slot into a linked list.

Example

Collision: h(k2) = h(k5) = h(k7)

Insertion: Always add to front of list, worst case time of O(1).

Question 11.6 - Why O(1)?

Figure below shows singly linked lists. If we want to delete elements, it’s better to use doubly linked lists.

Question 11.7 - Why?

Implementation (Doubly linked)

Example - Node_Ptr

previous key next
Chained-Hash-Insert (T: Hash_Table, x: Node_Ptr)

   -- pre: "y: Node_Ptr, y Î T's nodes: x ¹ y, x.previous = NIL
   -- post: node referenced by x is
   --         inserted at the head of the list at T[ h( x.key ) ]

Node_Ptr head ¬ T[ h( x.key ) ]

x.next ¬ head

if head ¹ NIL  then head.previous ¬ x

T[ h( x.key ) ] ¬ x

 x
Before

insert

After

Chained-Hash-Search (T: Hash_Table, k:U ) : Node_Ptr

   -- post: NIL is returned if k is not found in any
   --         of the nodes in the list at T[ h(k) ]
   --         x is returned where x is the address of
   --         a node in T where x.key = k

Node_Ptr x ¬ T[ h(k) ];

while x ¹ NIL and x.key ¹ k do x ¬ x.next

return x

 
Chained-Hash-Delete (T: Hash_Table, x: Node_Ptr)

   -- pre: $y: Node_Ptr, y Î T's nodes: x = y
   --        Node_Ptr is doubly linked

   -- post: node referenced by x is deleted
   --          from the list at T[ h( key [x] ) ]

if x.previous = NIL                              -- head

    then T[ h( x.key ) ] ¬ x.next

            if x.next ¹ NIL then x.next.previous ¬ NIL

    else x.previous.next ¬ x.next

Question 11.9 - Use h( k ) = k mod 2

Diagram a data structure for hashing to implement a dictionary of dates translating from historical dates to English words, for example:

1492 ® fourteen ninety two

Which is the key? satellite data? What does T hold?

Construct the hash table for keys: 2005, 1492, 1984, 1776, 2013

Insert a different node with key 2005. Any problems?

What about inserting the same node twice?

Search for 1492, 2002 and 2005. Any problems?

Delete 1492, 2005 and 2002. Any problems?

Question

Give insert and search algorithms for singly-linked lists.

Suggest an algorithm for delete using singly-linked lists.

 

Performance - Doubly linked

Load factor - a = n/m, where n = # of elements to store, m = # of slots or size of T, and assumes Simple Uniform Hashing; a > 1 a = 1, or a < 1

Question 11.10 - What is implied by: a = 1, a < 1, a > 1?

 

Hash functions

Caution

Question 11.16

What are some of the criteria for key selection?

Would a key based on the year of birth for first-graders be useful?

Would a hash function with m = 26 and a hash function that extracted the first vowel of a name satisfy the Simple Uniform Hashing assumption?

For example: h( k ) = (first vowel of k) mod 26

A hash that extracted the month of birth from the key?

For example: h( k ) = month of birth in k