Huffman Encoding

See also: Compression | Data | JPEG

A method of Lossless Compression based on the recurrence of characters within a document. Huffman Encoding is used in JPEG as well as some other file compression techniques. Essentially each character within the document is assigned a unique bit series, with the most frequently occuring characters having the smallest series and the least frequently occuring characters having the largest series.

To establish the Huffman Code for a particular document, the entire document must be scanned and the number of occurences of each character recorded. A Huffman Tree is then constructed. A Huffamn Tree is a Binary Tree with a special construction, where only the leaf nodes contain data. The algorithm for constructing a Huffman Tree is as follows.

Create a pool of all values (nodes) and their number of occurences (count).

  1. Select the two nodes with the smallest number of occurences. If there is a tie, randomly select one.
  2. Join these two nodes together with a common parent node to create a subtree.
  3. Assign the parent node a count that is the sum of the counts of the child nodes.
  1. Place this subtree back in the pool.

When all nodes have been joined together, this is the final Huffman Tree. As with a normal binary tree, each branch is assigned either a 1 or a 0 (left or right). Thus, since the values with the smallest number of occurences will be closer to the top of the tree, they will have the smallest number of bits representing them.

The problem with Huffman Encoding is that the entire dataset must be *known* before anything can be compressed. This makes it difficult to use in streamign MPEG, since the data set is constantly growing.

TakeDown.NET -> “Huffman-Encoding