Huffman Coding Algorithm




Huffman Coding Algorithm

Huffman coding is a widely used algorithm for lossless data compression. It was developed by David A. Huffman in 1952 while he was a graduate student at MIT. Huffman coding is a variable-length prefix coding technique that assigns shorter codes to frequently occurring characters and longer codes to less frequent ones. This approach allows for efficient compression of data, where the most common symbols are represented by fewer bits than the rare symbols.

The Huffman coding algorithm works by building a binary tree called a Huffman tree or Huffman encoding tree. The tree is constructed based on the frequency of occurrence of each symbol in the input data. Here's an overview of the algorithm -

Frequency counting - First, the algorithm scans the input data and counts the frequency of each symbol or character.

Building the Huffman tree - The algorithm creates a binary tree where each leaf node represents a symbol and its frequency. Initially, each symbol is treated as a separate tree node.

Combining nodes - The algorithm repeatedly combines the two nodes with the lowest frequencies into a new internal node. The frequency of the new node is the sum of the frequencies of its children. This process continues until all the nodes are combined into a single tree.

Assigning codes - Starting from the root of the tree, traverse each path to a leaf node, assigning a "0" for a left branch and a "1" for a right branch. The resulting binary code assigned to each symbol is unique and depends on its position in the tree.

Encoding - Replace each symbol in the input data with its corresponding Huffman code. The encoded data is a compact representation of the original data.

Decoding - To decode the encoded data, start from the root of the Huffman tree and follow the encoded bits. Whenever a leaf node is reached, output the corresponding symbol and return to the root to continue decoding the remaining bits.

Huffman coding achieves compression by using shorter codes for frequently occurring symbols, resulting in overall space savings. However, it requires sharing the codebook (the tree structure) between the encoder and decoder to ensure proper decoding.

It's worth noting that the efficiency of Huffman coding depends on the distribution of symbols in the input data. If the symbols are evenly distributed, Huffman coding may not provide significant compression. However, for data with non-uniform symbol frequencies, Huffman coding can yield substantial compression gains.

Input and Output

Input:
A string with different characters, say “ACCEBFFFFAAXXBLKE”
Output:
Code for different characters:
Data: K, Frequency: 1, Code: 0000
Data: L, Frequency: 1, Code: 0001
Data: E, Frequency: 2, Code: 001
Data: F, Frequency: 4, Code: 01
Data: B, Frequency: 2, Code: 100
Data: C, Frequency: 2, Code: 101
Data: X, Frequency: 2, Code: 110
Data: A, Frequency: 3, Code: 111

Huffman Coding Algorithm

The Huffman coding algorithm, as already discussed, follows the greedy design approach to arrive at an optimal solution. It uses a Huffman tree to encode and decode the data. A Huffman tree is created using the following steps:

Create a leaf node for each character of the text.

Arrange all the nodes in the increasing order of their frequency.

Considering the first two nodes have the minimum frequency.

-Create an internal node.

-The frequency of this internal node is the sum of the frequency of the previous two nodes.

-Make the first and second nodes the left and right children respectively of the newly created node.

Repeat steps 2 and 3 until all the nodes form a tree. The tree thus obtained is a Huffman tree.

After the Huffman tree is created, each character is assigned a unique code. For decoding, the same Huffman tree is traversed.

Applications of Huffman Coding

We have thoroughly discussed the Huffman coding approach (Greedy Design Technique) and its algorithm. Now let us see the applications of the Huffman coding mechanism, which are as follows -

It is applied where a series of frequently occurring characters are used.

Used for transmitting the data in the form of text or fax etc.

Used by conventional compression formats like GZIP, PKZIP, etc.

Multimedia formats like JPEG, PNG, and MP3 use Huffman encoding.

Example

#include
#include
#include
using namespace std;

struct node {
   int freq;
   char data;
   const node *child0, *child1;

   node(char d, int f = -1) { //assign values in the node
      data = d;
      freq = f;
      child0 = NULL;
      child1 = NULL;
   }

   node(const node *c0, const node *c1) {
      data = 0;
      freq = c0->freq + c1->freq;
      child0=c0;
      child1=c1;
   }

   bool operator<( const node &a ) const { //< operator performs to find priority in queue
      return freq >a.freq;
   }

   void traverse(string code = "")const {
      if(child0!=NULL) {
         child0->traverse(code+'0'); //add 0 with the code as left child
         child1->traverse(code+'1'); //add 1 with the code as right child
      }else {
         cout << "Data: " << data<< ", Frequency: "< qu;
   int frequency[256];

   for(int i = 0; i<256; i++)
      frequency[i] = 0; //clear all frequency

   for(int i = 0; i1) {
      node *c0 = new node(qu.top()); //get left child and remove from queue
      qu.pop();
      node *c1 = new node(qu.top()); //get right child and remove from queue
      qu.pop();
      qu.push(node(c0, c1)); //add freq of two child and add again in the queue
   }
   cout << "The Huffman Code: "<

Output

The Huffman Code:
Data: K, Frequency: 1, Code: 0000
Data: L, Frequency: 1, Code: 0001
Data: E, Frequency: 2, Code: 001
Data: F, Frequency: 4, Code: 01
Data: B, Frequency: 2, Code: 100
Data: C, Frequency: 2, Code: 101
Data: X, Frequency: 2, Code: 110
Data: A, Frequency: 3, Code: 111