Learn Huffman coding through animated graphics

String encoding

The simplest way to encode a string of text in the English language, is to translate each character seperately into a binary code and string these together. A standard code for English is ASCII, which originally translated characters into 7 bits of code and isn't widely used any more. On the diagram above we are using 8 bits for better byte alignment. See how the short string on the left gets translated into binary.

ASCII decoding

Using the same code table, the binary can be translated back into English, by looking up every code which represents a letter. This approach is rather wasteful of space though. For one, the Latin alphabet only has 26 letters, but with 8 bits we can represent 256 different characters. Furthermore we want to use variable length codes where common letters get a short binary representation and rare occurences get a longer one. This brings us to Huffman codes.

Counting characters

Huffman codes are named for their inventor, David Huffman. They are a form of prefix-free variable length encoding using a tree-based algorithm. Prefix-free means that no character's binary code is the prefix of any other, which would make the encoding ambiguous. We will see more about this later on this page. The first step in Huffman coding is to count out the number of occurrences of every character to be encoded. You can see this in action above.

Building the Huffman tree

The next step, once we have the counts is to make a tree. We start with every character in a leaf node, together with its count. The smallest two nodes by count are joined with a common internal parent node, which gets annotated with the sum of the counts. Ties for the count of the nodes can be arbitrarily resolved. By and by a binary tree is built up where the leaf nodes all represent a character. Each edge under an internal node is later labelled uniquely either with 0 or 1, as you can see above.

Huffman code table

For encoding a string we convert the tree into a table mapping letters to binary codes, which can be seen above. For this we traverse the entire tree and keep track of the bits on the edges. Every time a leaf node is reached the path from the root determines the binary representation of that character. Depending on the depth from the root, each encoding is possibly a different length.

Huffman encoding

The last step of encoding is to translate just as before with ASCII, but using our purpose-built Huffman table. Overall we use fewer bits for the string than before. In this sense, Huffman encoding is actually known to be optimal.

Huffman decoding

To decode the binary string back into text we use the Huffman tree directly. Starting at the root, we read one bit at a time: a 0 means go left, a 1 means go right. When we reach a leaf node, we have found a decoded character and output it. Then we return to the root and repeat. Because Huffman codes are prefix-free, the decoding is unambiguous — there is always exactly one way to parse the bit stream into characters.

What we have left out here, is that the Huffman tree itself needs to be encoded too and prepended to the binary output. Without it the correct decoding is impossible. For this there are further algorithms which we will not illustrate on this page. Due to this requirement the encoding length might not be optimal compared to other codings.

The Claude Opus LLM helped create this page. There are more algorithms and data structures to be found on the main page.