Chapter 11 - Hash functions
|
Document last modified: |
OVERVIEW
What makes a good hash function?
Ideally, the hash function satisfies the assumption of simple uniform hashing.
In practice, it’s not possible to satisfy this assumption, since we don’t know in advance the probability distribution that keys are drawn from, and the keys may not be drawn independently.
Often use heuristics, based on the domain of the keys, to create a hash function that performs well.
Keys as natural numbers
- Hash functions assume that the keys are natural numbers.
- When they’re not, have to interpret them as natural numbers.
Example
- Interpret a character string as an integer expressed in some radix notation. Radix 10 is the radix of decimal numbers.
- For "CLRS"
- ASCII values: C = 67, L = 76, R = 82, S = 83
- 128 basic ASCII values, use radix 128
- Interpret CLRS as (67 · 1283)+ (76 · 1282)+ (82 · 1281)+ (83 · 1280) = 141,764,947
Division method for h( k )
h(k) = k mod m
Example:
m = 20 and k = 91 ®
h(91) = 91 mod 20
= 11Advantage: Fast, since requires just one division operation.
Disadvantage: Have to avoid certain values of m
- Powers of 2 are bad. If m = 2p for integer p, then h(k) is just the least significant p bits of k.
128 = 27
CLRS as (67 · 1283)+ (76 · 1282)+ (82 · 1281)+ (83 · 1280) mod 27 = 83 = 010100112
ABCS as (65 · 1283)+ (66 · 1282)+ (67 · 1281)+ (83 · 1280) mod 27 = 83 = 010100112
Question 11.17 - Are powers of 10 for integer keys bad? Give an example.
- If k is a character string interpreted in radix 2p (as in CLRS example), then m = 2p − 1 is bad: permuting characters in a string does not change its hash value (Exercise 11.3-3).
- Means CLRS mod 27-1 = RLCS mod 27-1
27 = 128 radix
27-1 = 127 is m
- CLRS as (67 · 1283)+ (76 · 1282)+ (82 · 1281)+ (83 · 1280) mod 127 = 54
- SRLC as (83 · 1283)+ (82 · 1282)+ (76 · 1281)+ (67 · 1280) mod 127 = 54
Question 11.18 - Show that m=9 is a bad choice for integer keys (i.e. radix 101).
Try keys 123, 321, 231
Good choice for m:
A prime not too close to an exact power of 2.
Generally, a prime not too close to the radix.
Multiplication method for h( k )
- Choose constant A in the range 0 < A < 1.
- Multiply key k by A.
- Extract the fractional part of k·A
- Multiply the fractional part by number of slots, m.
- Take the floor of the result.
Mathematically
h( k ) = ëm · (k·A mod 1)û
where k·A mod 1 = k·A − ëk·Aû = fractional part of k·A
- Disadvantage: Slower than division method.
- Advantage: Value of m is not critical.
Example
m = 8 (implies m = 23, p = 3)
w = 5
k = 21
- Must have 0 < s < 25; choose s = 13 ® A = 13/32.
- h(k) = ëm · (k·A mod 1)û
h( 21 ) = ë8 · (21·13/32 mod 1)û = 4
k·A = 21 · 13/32 = 273/32 = 8 17/32 ® k·A mod 1 = 17/32
® m · ( k·A mod 1) = 8 · 17/32 = 17/4 = 4 1/4
® ëm · ( k·A mod 1)û = 4
so that h(21) = 4.
Question 11.19 - Find h( k ) for:
m = 16 (implies m = 24, p = 4)
w = 8
k = 21
s = 32 Must have 0 < s < 28; choose s = 32 ® A = s / 28
- A = ?
- k·A = ?
- k·A mod 1 = ?
- m · (k·A mod 1) = ?
- h( k ) = ? h(k) = ëm · (k·A mod 1)û
Implementation
Computing the hash key used four operations but can be implemented with fewer, mostly bit operations.
- Choose m = 2p for some integer p.
- Let the word size of the machine be w bits.
- Assume that k fits into a single word. (k takes w bits.)
- Let s be an integer in the range 0 < s < 2w. (s takes w bits.)
- Restrict A to be a fraction of the form s/2w
0 < A < 1
- Multiply k by s.
- Since we’re multiplying two w-bit words, the result is 2·w bits, r1·2w+r0, where r1 is the high-order word of the product and r0 is the low-order word.
- r1 holds the integer part of k·A (i.e. ëk·Aû ) and r0 holds the fractional part of k·A (k·A mod 1 = k·A − ëk·Aû ). Think of the “binary point” (analog of decimal point, but for binary representation) as being between r1 and r0. Since integer part of k·A is not used, forget r1 and just use r0.
- Since we want ëm·( k·A mod 1)û , we could get that value by shifting r0 to the left by p = lg m bits and then taking the p bits that were shifted to the left of the binary point.
- We don’t need to shift. The p bits that would have been shifted to the left of the binary point are the p most significant bits of r0. So we can just take these bits after having formed r0 by multiplying k by s.
Example
m = 8 (implies m = 23, p = 3)
w = 5
k = 21
s = 13
k · s = 21 · 13 = 273 = 8 · 25 + 17 = r1 . r0 r1
r0= 8 · 25
= 17 = 100012
- Written in w = 5 bits, r0 = 100012
- The p = 3 most significant bits of r0 is 1002 or 410, so h(21) = 4.
Question 11.20 - Find h( k ) for:
m = 4 (implies m = 22, p = 2)
w = 3
k = 12
s = 5 0 < s < 2w = 23 = 8
k · s = 12 · 5 = ? = ? · 23 + ? = r1 . r0 r1
r0= ? · 23
= ? = ?2
- Written in w = 3 bits, r0 = ?2
- The p = 2 most significant bits of r0 is ?2 or ?10, so h(12) = ?.
How to choose A:
- The multiplication method works with any legal value of A.
- But it works better with some values than with others, depending on the keys being hashed.
- Knuth suggests using A ≈ (√5 − 1)/2 = 1.236
Example:
h(k) = ëm * ((k * A) mod 1)û
- w - word size of the machine is w bits
- 0 < A < 1
- restrict A to be a fraction of the form s/2w
- where 0 < s < 2w
- Knuth recommends choosing A ≈ (50.5 - 1) / 2
- m = 2p - for some integer p
- (k * A) mod 1 = (k * A) - ëk * Aû
- k = 123456
- p = 14
- m = 2p = 214 = 16384
- w = 32 bits
Determine:
- s - where A = s/232 and A = Knuth's constant
- s = A · 232 = ((50.5 - 1) / 2 · 232 = 2654435769
Then:
- k · s
= 123456 * 2654435769
= 327706022297664 (this is a 64 bit value)
- r1 = (k · s)/232 which is the high-order 32 bits of (k · s)
= 327706022297664/232= 76300
- Note: this division of 64 bits can be implemented as a bit shift; shift 32 bits to the right
- 32770602229766410 = 10010101000001100000000010000110011000000010000002
- 7630010 = 100101010000011002
- r0 = (k · s) - (r1 · 232 )
= 327706022297664 - (76300 · 232)= 17612864= 10000110011000000010000002= 000000010000110011000000010000002 (in 32 bits have to add some leading 0's)
- Note: this multiplication (76300 * 232) can be implemented as a bit shift to the left
(7630010 * 232)
= 1001010100000110000000000000000000000000000000002
- h(k) = the p most significant bits of r0
- r0 = 000000010000110011000000010000002
- p = 14, so take the 14 most significant bits of r0
- h(123456) = 00000001000011
- h(123456) = 6710
- Note: taking the 14 most significant bits of 32 bit r0 is another bit operation.
Question 11.21 - What is the final bit operation needed to recover 14 bit h(123456) from 32 bit r0?
The Multiplication Method Algorithm for h(k)
- compute (k · s) - using multiplication resulting in a 2·w bit value of the form: r1 · 2w + r0
- compute r1 = (k · s)/232 by shifting bits 32 to the right
- compute r0 = (k · s) - (r1 * 232 ) by 1st shifting r1 to the left 32 bits, and then by doing a subtraction
- compute h(k) = the p most significant bits of r0 by shifting r0 w-p (e.g. 32-14) bits to the right or rotating p bits to the left.
Comparison
Division Method
When using the division method, we usually avoid certain values of m.
Unless it is known that all low-order bits patterns are
equally likely, it is better to make the hash function depend on all the
bits of the key.
Multiplication Method
An advantage of the multiplication method is that the value of m is not critical.
| Multiplication Method | Division Method | ||
| m | 1000 |
1000 |
|
| A | 0.618033988749895 | ||
| key | h(key) = ëm * (k * A) mod 1û | h(key) = mod(key, 1000) | |
| sample key from p. 264: | 123456 | 4 | 456 |
| 123459 | 858 | 459 | |
| These keys are a | 123496 | 725 | 496 |
| one digit modification | 123956 | 21 | 956 |
| of the sample key: | 129456 | 208 | 456 |
| 123456 | 193456 | 383 | 456 |
| 923456 | 195 | 456 |