Rust — write gunzip from scratch 5

In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below

part 1: main() function and skeletal structure
part 2: bitread module for reading bits from a byte stream
part 3: gzip header & footer for parsing the metadata and checksum
part 4: inflate for block type 0 (uncompressed) data
part 5: codebook and huffman_decoder modules for decoding Huffman codes
part 6: lz77 and sliding_window modules for decompressing LZ77-encoded data
part 7: inflate for block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes
part 8: checksum_write module for verifying the decompressed data
part 9: performance optimization
part 10: multithread support
part 11: streaming support
part 12: memory optimization
part 13: bug fix to reject non-compliant .gz file

Before we can write code in Inflate, we first need to have a good understanding of the deflate algorithm. There are great resources to this, so feel free to checkout these: 1 2 3 4 5. Here is my attempt to explain what it does.

Today, we will implement HuffmanDecoder and Codebook. It will be a bit more involved than previous articles, so buckle up! Before we go any further, let us define some terms we will use throughout the series

bits: continuous stream of bits
bitcode: a sequence of bits that make up a Huffman code. We say the length of the bitcode to be the number of bits of in the bitcode
symbol: what a Huffman bitcode actually represents. In deflate, a symbol is a numerical value, corresponding to 0~287 for literal/length code and 0~29 for distance code.
code: a short-hand name for LZ77 code

A Huffman decoder should be able to identify a bitcode within a continuous stream of bits. Once the bitcode is identified, it knows what symbol the bitcode represents. It also needs to return the length of the bitcode so that BitRead can consume the number of bits corresponding to the symbol.

Suppose we are provided with the Huffman code below. When input bits 011001 is provided, the Huffman decoder shall be able to chunk the bits into bitcodes 011|00|1 and map each symbols C|A|B and the lengths 3|2|1.

The Codebook is this actual mapping between Symbols and Bitcodes. In deflate algorithm, there are restrictions on the mapping

RFC1951

These rules enable efficient representation of the codebook because a list of bitcode lengths of each symbol is enough to unambiguously represent the mapping. The example mapping above will be re-coded following the rules:

This means that given the code lengths [2,1,3,3], we can reconstruct the mapping above, which is explained with code in RFC1951. So, that will be the job of our codebook.

Let’s implement Codebook following RFC1951. This should be fairly straightforward, as the source code is already provided in the specification. The only additional functionality we will need is to store the maximum length of the bitcode within the codebook, which we will need later in Huffman Decoder.

RFC1951

// lib/codebook.rs
use crate::error::{Error, Result}; // todo

pub const MAX_CODELENGTH: u32 = 15;
pub const MAX_LL_SYMBOL: u32 = 288;

pub struct CodeBook {
    tree: Vec<(u32, u32)>, // bitcode, length
    max_length: u32,
}

impl CodeBook {
    pub fn new(lengths: &[u32]) -> Result<Self> {
        let err = Err(Error::InvalidCodeLengths);

        // follow RFC1951 https://www.rfc-editor.org/rfc/rfc1951#ref-1
        if lengths.is_empty() || lengths.len() > MAX_LL_SYMBOL as usize + 1 {
            return err;
        }

        let mut tree = Vec::with_capacity(lengths.len());
        let mut max_len = 0; // max(lengths)

        // step 1
        // # of codes having bitcode length count
        let mut bl_count = [0 as u32; MAX_CODELENGTH as usize + 1];
        for l in lengths {
            bl_count[*l as usize] += 1;
            tree.push((0, *l));
            max_len = max_len.max(*l);
        }

        if max_len > MAX_CODELENGTH {
            return err;
        }

        // step 2
        let mut next_code = [0 as u32; MAX_CODELENGTH as usize + 1];
        let mut code = 0;
        bl_count[0] = 0; // this is a must!!
        for bits in 1..=max_len as usize {
            code = (code + bl_count[bits - 1]) << 1;
            next_code[bits] = code;
        }

        // step 3
        for pair in &mut tree {
            let len = pair.1 as usize;
            if len != 0 {
                pair.0 = next_code[len];
                next_code[len] += 1;
            }
        }

        Ok(Self {
            tree,
            max_length: max_len,
        })
    }

    /// maximum number of bits within the codebook
    pub fn max_length(&self) -> u32 {
        self.max_length
    }
}

Don’t forget to import the new module and add the error variant.

// append to src/lib.rs
pub mod codebook;

// replace in src/error.rs
pub enum Error {
    StdIoError(ErrorKind),
    EmptyInput,
    InvalidGzHeader,
    InvalidBlockType,
    BlockType0LenMismatch,
    InvalidCodeLengths,
}

We are not quite done with Codebook just yet. Block type 1 uses static Huffman codes, where the codebooks are fixed. Let us add two more constructors to Codebook to initialize with static Huffman codes

// append to Codebook in lib/codebook.rs
    pub fn default_ll() -> Self {
        let lengths = [
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
            8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9,
            9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
            9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
            9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
            9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 7, 7, 7, 7, 7,
            7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8,
        ];

        Self::new(&lengths).unwrap()
    }

    pub fn default_dist() -> Self {
        let lengths = [5; 30];
        Self::new(&lengths).unwrap()
    }

These codebooks correspond to the literal/length codebook and a distance codebook as specified in RFC1951

RFC1951

Now, we are ready to implement HuffmanDecoder. A naive implementation would be to use a binary tree to represent a prefix code and traverse the tree by reading one bit at a time, as shown below.

RFC1951

This may be the easiest to implement, but also the least efficient one, as we need to read and process a bit by bit. A more efficient way is to read multiple bits at once and decode a symbol using a lookup table. For that, the decoder will take in at least 15-bits at a time (the longest bitcode) and return not only the Symbol but also the Length of the bitcode. The caller can then call consume() method of the BitRead.

So, the implementation comes down to creating a decoding look up table of size 2¹⁵ that maps every possible 15-bits input the corresponding to (Symbol, Length) tuple. This turns out to be much more efficient than the naive prefix-code approach, but still not the most efficient one. For starters, it is a lot of work to precompute decoding table for every 2¹⁵ bit combination, but also memory lookup speed will significantly slow down when the lookup table size grows larger than the CPU cache size. Note that 2¹⁵ by itself is already 32k, so storing just a byte alone in the table is already approaching L1 cache size of modern CPUs.

A better solution would be to decode in a hierarchical two step processes. Create two lookup tables. The first table maps at most N-bits of the input, and the second table maps the rest 15-N bits. The number N here would be a fixed hyperparameter that we could tune later for best performance. This two-step decoding is more efficient because more frequent symbols will be encoded with shorter bitcodes, while rarer symbols will be encoded with longer bitcodes. Hence, we can assume that the majority of the symbols shall only require a single table lookup.

There is one caveat when dealing with bits: bit ordering. In RFC1951, the Huffman codes are read from the most significant bit to the least significant bit, which is the opposite of how bits are read in our BitRead. For that, we will need a function that efficiently reverses the bits.

RFC1951

Let’s implement this function first. Below is one way to do it. Note that we are only reversing up to 16-bits , as the longest code length is 15-bits.

// src/huffman_decoder.rs
// https://stackoverflow.com/questions/2602823/in-c-c-whats-the-simplest-way-to-reverse-the-order-of-bits-in-a-byte
fn reverse_bits(mut bits: u32) -> u32 {
    bits = (bits & 0xFF00) >> 8 | (bits & 0x00FF) << 8;
    bits = (bits & 0xF0F0) >> 4 | (bits & 0x0F0F) << 4;
    bits = (bits & 0xCCCC) >> 2 | (bits & 0x3333) << 2;
    bits = (bits & 0xAAAA) >> 1 | (bits & 0x5555) << 1;
    bits
}

// append to src/lib.rs
pub mod huffman_decoder;

Below is implementation of HuffmanDecoder. I have documented the code to help understand what is going on. In a nutshell, we concatenate single-step lookup and second-step lookup tables into one big table. We first lookup the bits through the first-step lookup table. If the length is less or equal to NUM_BITS_FIRST_LOOKUP, we read the symbol and return. If not, we jump to the second table and do another lookup, from which we obtain the symbol.

// append to huffman_decoder.rs
use crate::error::{Error, Result};
use crate::codebook::CodeBook;

const NUM_BITS_FIRST_LOOKUP: u32 = 9;

pub struct HuffmanDecoder {
    /// lookup[bitcode] where bitcode is the bit-order reversed huffman code, right-aligned
    /// this is so that we can feed in multiple bits at once, from right to left
    lookup: Vec<(u32, u32)>, // symbol, length
    primary_mask: u32, // mask NUM_BITS_FIRST_LOOKUP bits
    secondary_mask: u32, // mask the rest of the bits
}

impl HuffmanDecoder {
    pub fn new(codebook: CodeBook) -> Self {
        let mut lookup = Vec::new();
        let max_nbits = codebook.max_length();
        let (nbits, secondary_mask) = if max_nbits > NUM_BITS_FIRST_LOOKUP {
            (
                NUM_BITS_FIRST_LOOKUP,
                (1 << (max_nbits - NUM_BITS_FIRST_LOOKUP)) - 1,
            )
        } else {
            // all codes can be decoded in a single step
            // secondary mask not needed
            (max_nbits, 0)
        };
        let primary_mask = (1 << nbits) - 1;

        lookup.resize(1 << nbits, (0, 0));
        for (symbol, (mut bitcode, len)) in codebook.into_iter().enumerate() {
            if len == 0 {
                continue;
            }

            // reverse bit-order of huffman
            bitcode = reverse_bits(bitcode);
            bitcode >>= 16 - len; // right-align
            if len <= nbits {
                // single-step lookup
                // populate table
                let delta = nbits - len;
                for idx in 0..(1 << delta) {
                    lookup[(bitcode | (idx << len)) as usize] = (symbol as u32, len);
                }
            } else {
                // two-step lookup
                // symbol will point to the base index for the second lookup step
                let base = (bitcode & primary_mask) as usize;
                let offset = if lookup[base].0 == 0 {
                    // first time the base has been seen
                    // will append the second-step lookup table
                    let offset = lookup.len() as u32;
                    // set up the index
                    lookup[base] = (offset, len);
                    // make room for the second-step lookup table
                    let new_len = lookup.len() + (1 << (max_nbits - nbits));
                    lookup.resize(new_len, (0, 0));
                    offset
                } else {
                    // second-step lookup already created
                    lookup[base].0
                };
                let secondary_len = len - nbits;
                let base = offset + ((bitcode >> nbits) & secondary_mask);
                // populate second-step lookup table
                for idx in 0..(1 << (max_nbits - len)) {
                    lookup[(base + (idx << secondary_len)) as usize] = (symbol as u32, len);
                }
            }
        }

        Self {
            lookup,
            primary_mask,
            secondary_mask
        }
    }

    /// Look up the code given at least max_length bits
    /// Returns symbol and its length upon match
    pub fn decode(&self, bits: u32) -> Result<(u32, u32)> {
        let (symbol, len) = self.lookup[(bits & self.primary_mask) as usize];
        if len == 0 {
            Err(Error::HuffmanDecoderCodeNotFound)
        } else if len <= NUM_BITS_FIRST_LOOKUP {
            Ok((symbol, len))
        } else {
            let base = symbol as usize;
            let idx = (bits >> NUM_BITS_FIRST_LOOKUP) & self.secondary_mask;
            Ok(self.lookup[base + idx as usize])
        }
    }
}

// replace in src/error.rs
pub enum Error {
    StdIoError(ErrorKind),
    EmptyInput,
    InvalidGzHeader,
    InvalidBlockType,
    BlockType0LenMismatch,
    InvalidCodeLengths,
    HuffmanDecoderCodeNotFound,
}

Note that NUM_BITS_FIRST_LOOKUP is arbitrarily set to 9 here, but one can choose any number from 1 to 15. Later when we complete implementation of gunzip, feel free to experiment and see much impact this parameter has.

Finally, we also need to implement IntoIter trait for Codebook so that we can efficiently read (Symbol, Length) tuples.

// append to src/codebook.rs
impl IntoIterator for CodeBook {
    type Item = (u32, u32);

    type IntoIter = std::vec::IntoIter<Self::Item>;

    fn into_iter(self) -> Self::IntoIter {
        self.tree.into_iter()
    }
}

With this, we should be able to compile successfully. We will implement LZ77 decoding algorithm in the next article, so stay tuned!

Previous in series: https://techhara.me/posts/2023-09-29_rust—write-gunzip-from-scratch-4-d20f9da26f86/

Next in series: https://techhara.me/posts/2023-10-04_rust—write-gunzip-from-scratch-6-7bc9e8b2c08e/