Huffman

Document last modified: 
Huffman (C)
-- pre:  C is the characters that comprise the alphabet
-- post: return the root of the optimal encoding tree 
   n ← |C|
   Q ← C    -- insert all the elements of C into Q, using the frequency as the priority
   for i ← 1 to n-1 do
		z ←  a new tree node
		x ←  Extract-Minimum (Q) 
		y ←  Extract-Minimum (Q)
		left[ z ] ←  x
		right[ z ] ← y
		f[z] ←  f[x] + f[y]
		Insert (Q, z)
   return Extract-Minimum (Q)        -- the root of the encoding tree

Below are the contents of the min-heap Q after each step through the for loop:

  1. Initially, all nodes are leaf nodes. Insert all 8 in Q: 
    c q m p n s r t o
    f(c) 2 3 4 5 6 7 8 9
  2. Join the two nodes with least value; now there are 7 things in Q: 
    c p   n s r t o
    f(c) 4   5
     /  \
    q    m
    2    3
    5 6 7 8 9
  3. Then the next two with least value, Q has 6 elements: 
    c n s r t   o
    f(c) 5 6 7 8   9
     /  \
    p   5
    4  /  \
      q    m
      2    3
    9
  4. Then the next two with least value, Q has 5 elements: 
    c r t   o  
    f(c) 7 8   9
     /  \
    p   5
    4  /  \
      q    m
      2    3
    9   11
     /   \
    n    s
    5    6
  5. Q has 4 elements: 
    c   o    
    f(c)   9
     /  \
    p   5
    4  /  \
      q    m
      2    3
    9   11
     /   \
    n    s
    5    6
      15
     /   \
    r     t
    7    8

     

  6. Three items left: 
    c      
    f(c)   11
     /   \
    n    s
    5    6
      15
     /   \
    r     t
    7    8

     

      18
      /  \
    o    9
    9   /  \
       p   5
       4  /  \
         q    m
         2    3
  7. Two big trees left: 
    c    
    f(c)   18
      /  \
    o    9
    9   /  \
       p   5
       4  /  \
         q    m
         2    3
           26
         /     \
      11      15
     /   \    /   \
    n    s  r     t
    5    6 7    8
  8. Finally, we join the whole thing up: 
    c  
    f(c)         __44 ___
          /               \
      18                  26
      /  \               /     \
    o    9           11     15
    9   /  \        /   \    /   \
       p   5       n    s  r     t
       4  /  \     5    6  7    8
         q    m
         2    3
At each point, the heuristic chooses the joining that would force less frequent characters to be deeper in the tree. 
So an optimal prefix code is: 
00	o
010	p
100	n
101	s
110	r
111	t
0110	q
0111	m
And B(T) = 9*2+3*4+3*5+3*6+3*7+3*8+2*4+3*4 = 
              = 18+12+15+18+21+24+8+12 = 128 bits.
A savings of 64% in the size of the data over 352 bits as 8-bit encoding.