# Huffman Coding

### source symbol symbols original

The most popular entropy-based encoding technique is the Huffman code. It provides the least amount of information units (bits) per source symbol. This short article describes how it works.

The first step in the Huffman algorithm consists in creating a series of source reductions, by sorting the probabilities of each symbol and combining the (two) least probable symbols into a single symbol, which will then be used in the next source reduction stage.

Figure 1 shows an example of consecutive source reductions. The original source symbols appear on the left-hand side, sorted in decreasing order by their probability of occurrence. In the first reduction, the two least probable symbols (a 3 with prob. = 0.06 and a 5 with prob. = 0.04) are combined into a composite symbol, whose probability is 0.06 + 0.04 = 0.1. This composite symbol and its probability are copied onto the first source reduction column at the proper slot (so as to enforce the requirement that the probabilities are sorted in decreasing order at any reduction stage). The process continues until we reach a 2-symbol reduced source (which happens after four steps in this example).

The first step in the Huffman algorithm consists in encoding each reduced source, starting from the right hand side and moving towards the original source. The most compact binary code for a binary source is, obviously, formed by the symbols 0 and 1. These values are assigned to the two rightmost symbols (in this example, following a convention that states that ‘the least probable symbol gets a bit 0’). Since the symbol whose probability is 0.6 was obtained as a combination of two other symbols during the source reduction steps, the 0 used to encode it is now assigned to both symbols from which it originated, appending a 0 or 1 to the right of each symbol (following the same convention) to tell them apart. The process is repeated for each reduced source until we return to the original source (Figure 2). The resulting code appears on the third column of Figure 2. The average codeword length can be calculated as:

The source entropy (calculated using eq. (1) under the **Image compression and coding** entry) in this case is 2.14 bits/symbol. The efficiency of the Huffman code is, therefore, equal to 2.14 / 2.2 = 0.973.

The Huffman algorithm is optimal, in the sense that it generates the shortest possible average codeword length for a certain source. It is often referred to as ‘instantaneous, uniquely decodable block coding’, because each source symbol is encoded into a fixed sequence (block) of bits, each codeword can be decoded instantaneously, i.e., without the need to make reference to past or subsequent symbols, and there is only one way of decoding a certain string of Os and Is (i.e., no codeword is the prefix of another valid codeword).

When a large number of source symbols needs to be encoded, however, the original source reduction algorithm becomes impractical and computationally inefficient. This practical constraint has motivated the design of alternatives to the original Huffman coding algorithm in which the optimality of the code is partially sacrificed in name of implementation efficiency. Many variants of the original Huffman coding algorithm have been proposed in the literature. They are usually referred by names such as modified, truncated, or adaptive Huffman coding. Detailed descriptions of some of these variants, as well as many other relevant resources can be accessed from .

The optimality of the Huffman coding algorithm was also challenged by the arrival of arithmetic coding, a technique that encodes the entire message into a single number, *n* (0.0 = *n* < 1.0), overcoming the integer number of bits per symbol limitation of the original Huffman coding scheme.

## User Comments