Huffman coding with length constraint

Today, let’s take a look at how to construct optimal Huffman codes of symbols.

Input: count of each symbol’s occurrence output: code-length of each symbol

Note that once we have the code lengths of the symbols, we can easily reconstruct the Huffman codes using the canonical Huffman codes.

Unconstrained

The simplest case is without any constraint on how many bits each code can use. Finding the optimal code-lengths can be implemented using a binary heap, merging the symbols with lowest counts. Here is a good reference.

We first need to define Package which keeps track of the merged symbols and the total count. We define a function to merge two Packages. We also make it comparable by the total count.

use std::cmp::Ordering;
use std::ops::Add;

/// a package that constitutes one or more symbols
#[derive(PartialEq, Eq, Clone)]
pub struct Package {
    count: usize,        // sum of counts among all constituents
    symbols: Vec<usize>, // constituent symbols
}

impl Package {
    pub fn new(symbol: usize, count: usize) -> Self {
        Self {
            count,
            symbols: vec![symbol],
        }
    }

    pub fn symbols(&self) -> &[usize] {
        &self.symbols
    }
}

/// compare by count
impl Ord for Package {
    fn cmp(&self, other: &Self) -> Ordering {
        self.count.cmp(&other.count)
    }
}

impl PartialOrd for Package {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

impl Add for Package {
    type Output = Self;

    /// merge two packages into one
    fn add(mut self, rhs: Self) -> Self::Output {
        self.count += rhs.count;
        self.symbols.extend_from_slice(&rhs.symbols);
        self
    }
}

In the main function, we initialize a min-heap of packages of non-zero count symbols. We also initialize code lengths of each non-zero count symbols as 1.

We merge two packages of least-count and increment code length of each symbol in the merged package. We repeat the process until we are left with two or fewer packages.

// huffman.rs
use std::cmp::Reverse;
use std::collections::BinaryHeap;

use crate::package::Package;

/// generate optimal huffman codes
/// iter: counts of each symbol
/// returns code-lengths of each symbol
pub fn generate_canonical_huffman_code(iter: impl Iterator<Item = usize>) -> Vec<usize> {
    let mut queue = BinaryHeap::new(); // max-heap
    let mut nsymbols = 0;
    for (symbol, count) in iter.enumerate() {
        if count > 0 {
            queue.push(Reverse(Package::new(symbol, count))); // min-heap by count
        }
        nsymbols += 1;
    }

    let mut code_lengths = vec![0; nsymbols];
    for x in queue.iter() {
        code_lengths[x.0.symbols()[0]] += 1;
    }

    while queue.len() > 2 {
        let x = queue.pop().unwrap();
        let y = queue.pop().unwrap();
        let merged = x.0 + y.0;
        for symbol in merged.symbols() {
            code_lengths[*symbol] += 1;
        }
        queue.push(Reverse(merged));
    }

    code_lengths
}

Let’s add some unit tests to see if our function works as expected.

#[cfg(test)]
mod test {
    use crate::huffman::generate_canonical_huffman_code;

    #[test]
    fn test0() {
        let freq = vec![0, 1];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [0, 1]);
    }

    #[test]
    fn test1() {
        let freq = vec![40, 35, 20, 5];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [1, 2, 3, 3]);
    }

    #[test]
    fn test2() {
        let freq = vec![1];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [1]);
    }

    #[test]
    fn test3() {
        let freq = vec![40, 35, 20, 0];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [1, 2, 2, 0]);
    }

    #[test]
    fn test4() {
        let freq = vec![7, 4, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5]);
    }

    #[test]
    fn test5() {
        let freq = vec![0, 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [0, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]);
    }

    #[test]
    fn test6() {
        let freq = vec![];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, []);
    }

    #[test]
    fn test7() {
        let freq = vec![0, 0, 0, 0, 0];
        let bitlen = generate_canonical_huffman_code(freq.into_iter());
        assert_eq!(bitlen, [0, 0, 0, 0, 0]);
    }
}

Length-Constrained

In practice, one may want to limit the maximum code length of the symbols. For example, .gz files are compressed with Huffman codes up to 15-bits. This makes it a bit more complicated to find the optimal code-lengths compared to the unconstrained case. Here is a good reference on how to perform this.

The main idea is to use package-merge algorithm, which employs the binary coin collector’s problem. Here is my summary of coin collector’s problem

Suppose we have a lot of rare coins in the collection. The face-value of a coin is fixed by its type (e.g., penny, nickel, dime, quarter, etc) but its collection-value is proportional to its rarity, regardless of its face-value. We are out of money and need to buy an item from the the coin collection. We want to find a set of coins whose face-value is N-1 with minimal collection-value.

Here, we are minimizing the collection-value of the spending under the constraint of fixed face-value. We can draw a parallel with the Huffman code lengths:

collection-value corresponds to the count of each symbol
N is the number of non-zero count symbols.
the different coin-types correspond to possible number of code-lengths. For maximum code length constraint L, each Huffman code can be of 1-, 2-, …, L-bits, so these correspond to L binary face-value coin types.

Now, let’s implement a function that finds the optimal Huffman code-lengths under the maximum code-length constraint. We will re-use the Package data type.

We start with an empty set. We extend our set with all non-zero count symbols. We continuously merge two least-count symbols into a package. One difference from the unconstrained case is that we collect these merged packages in a separate set.

We repeat the process L times. Now, we have a set of packages, each of which has face-value of 1. Now, we select N-1 of such packages of least count. Then, the code-length of each symbol corresponds to the number of occurrence of the symbols in the final selection.

// huffman_constrained.rs
use std::cmp::Reverse;

use crate::package::Package;

/// generate optimal huffman codes under code-length constraint
/// iter: counts of each symbol
/// returns code-lengths of each symbol
pub fn generate_canonical_huffman_code_with_constraint(
    iter: impl Iterator<Item = usize>,
    max_code_length: usize,
) -> Vec<usize> {
    let mut nsymbols = 0;
    let mut xs = Vec::new(); // nonzero count symbols
    for (symbol, count) in iter.enumerate() {
        if count > 0 {
            xs.push(Reverse(Package::new(symbol, count)));
        }
        nsymbols += 1;
    }

    assert!(1 <= max_code_length || xs.is_empty());
    assert!(xs.len() <= 1 << max_code_length);
    let mut code_lengths = vec![0; nsymbols];

    if xs.len() == 1 {
        // corner case
        let s = xs[0].0.symbols()[0];
        code_lengths[s] += 1;
        return code_lengths;
    }

    let mut result = Vec::new();
    for _ in 0..max_code_length {
        result.extend_from_slice(&xs);
        result = merge_pairs(result);
    }

    // choose N - 1 packages with least counts where N = xs.len()
    let idx = result.len() + 1 - xs.len();
    if idx >= result.len() {
        // corner case
        return code_lengths;
    }
    result.select_nth_unstable(idx); // last N-1 items correspond to N-1 packages with least counts

    for x in result.into_iter().skip(idx) {
        for s in x.0.symbols() {
            code_lengths[*s] += 1;
        }
    }

    code_lengths
}

/// given packages, merge two least-count packages into one
fn merge_pairs(mut src: Vec<Reverse<Package>>) -> Vec<Reverse<Package>> {
    src.sort(); // descending order by count
    let mut result = Vec::new();
    while src.len() >= 2 {
        let x = src.pop().unwrap();
        let y = src.pop().unwrap();
        let merged = x.0 + y.0;
        result.push(Reverse(merged));
    }
    result
}

Alright, hopefully it wasn’t too bad. Now, let’s add some unit tests to see if this works as expected.

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn test0() {
        let freq = vec![0, 1];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 1);
        assert_eq!(bitlen, [0, 1]);
    }

    #[test]
    fn test1() {
        let freq = vec![40, 35, 20, 5];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 3);
        assert_eq!(bitlen, [1, 2, 3, 3]);
    }

    #[test]
    fn test1a() {
        let freq = vec![40, 35, 20, 5];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 2);
        assert_eq!(bitlen, [2, 2, 2, 2]);
    }

    #[test]
    fn test2() {
        let freq = vec![1];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 1);
        assert_eq!(bitlen, [1]);
    }

    #[test]
    fn test3() {
        let freq = vec![40, 35, 20, 0];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 2);
        assert_eq!(bitlen, [1, 2, 2, 0]);
    }

    #[test]
    fn test4() {
        let freq = vec![7, 4, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 5);
        assert_eq!(bitlen, [3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5]);
    }

    #[test]
    fn test5() {
        let freq = vec![0, 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 10);
        assert_eq!(bitlen, [0, 10, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]);
    }

    #[test]
    fn test5a() {
        let freq = vec![0, 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 9);
        assert_eq!(bitlen, [0, 9, 9, 9, 9, 7, 6, 5, 4, 3, 2, 1]);
    }

    #[test]
    fn test5b() {
        let freq = vec![0, 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 8);
        assert_eq!(bitlen, [0, 8, 8, 8, 8, 7, 7, 5, 4, 3, 2, 1]);
    }

    #[test]
    fn test5c() {
        let freq = vec![0, 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 7);
        assert_eq!(bitlen, [0, 7, 7, 7, 7, 7, 7, 6, 4, 3, 2, 1]);
    }

    #[test]
    fn test5d() {
        let freq = vec![0, 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 6);
        assert_eq!(bitlen, [0, 6, 6, 6, 6, 6, 6, 5, 4, 4, 2, 1]);
    }

    #[test]
    fn test6() {
        let freq = vec![];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 0);
        assert_eq!(bitlen, []);
    }

    #[test]
    fn test7() {
        let freq = vec![0, 0, 0, 0, 0];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 0);
        assert_eq!(bitlen, [0, 0, 0, 0, 0]);
    }

    #[test]
    fn test8() {
        let freq = vec![0, 0, 0, 1, 10000];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 1);
        assert_eq!(bitlen, [0, 0, 0, 1, 1]);
    }

    #[test]
    fn test8b() {
        let freq = vec![0, 0, 0, 1, 10000];
        let bitlen = generate_canonical_huffman_code_with_constraint(freq.into_iter(), 2);
        assert_eq!(bitlen, [0, 0, 0, 1, 1]);
    }
}

You can find full source code here.

Huffman coding with length constraint

Unconstrained

Length-Constrained

References