Rust — write gunzip from scratch 5
In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below
- part 1:
main()function and skeletal structure - part 2:
bitreadmodule for reading bits from a byte stream - part 3:
gzipheader & footer for parsing the metadata and checksum - part 4:
inflatefor block type 0 (uncompressed) data - part 5:
codebookandhuffman_decodermodules for decoding Huffman codes - part 6:
lz77andsliding_windowmodules for decompressing LZ77-encoded data - part 7:
inflatefor block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes - part 8:
checksum_writemodule for verifying the decompressed data - part 9: performance optimization
- part 10: multithread support
- part 11: streaming support
- part 12: memory optimization
- part 13: bug fix to reject non-compliant
.gzfile
Before we can write code in Inflate, we first need to have a good understanding of the deflate algorithm. There are great resources to this, so feel free to checkout these: 1 2 3 4 5. Here is my attempt to explain what it does.

Today, we will implement HuffmanDecoder and Codebook. It will be a bit more involved than previous articles, so buckle up! Before we go any further, let us define some terms we will use throughout the series
- bits: continuous stream of bits
- bitcode: a sequence of bits that make up a Huffman code. We say the length of the bitcode to be the number of bits of in the bitcode
- symbol: what a Huffman bitcode actually represents. In
deflate, a symbol is a numericalvalue, corresponding to0~287for literal/length code and0~29for distance code. - code: a short-hand name for LZ77 code
A Huffman decoder should be able to identify a bitcode within a continuous stream of bits. Once the bitcode is identified, it knows what symbol the bitcode represents. It also needs to return the length of the bitcode so that BitRead can consume the number of bits corresponding to the symbol.
Suppose we are provided with the Huffman code below. When input bits 011001 is provided, the Huffman decoder shall be able to chunk the bits into bitcodes 011|00|1 and map each symbols C|A|B and the lengths 3|2|1.

The Codebook is this actual mapping between Symbols and Bitcodes. In deflate algorithm, there are restrictions on the mapping
RFC1951
These rules enable efficient representation of the codebook because a list of bitcode lengths of each symbol is enough to unambiguously represent the mapping. The example mapping above will be re-coded following the rules:

This means that given the code lengths [2,1,3,3], we can reconstruct the mapping above, which is explained with code in RFC1951. So, that will be the job of our codebook.
Let’s implement Codebook following RFC1951. This should be fairly straightforward, as the source code is already provided in the specification. The only additional functionality we will need is to store the maximum length of the bitcode within the codebook, which we will need later in Huffman Decoder.
RFC1951
// lib/codebook.rs
use crate::error::{Error, Result}; // todo
pub const MAX_CODELENGTH: u32 = 15;
pub const MAX_LL_SYMBOL: u32 = 288;
pub struct CodeBook {
tree: Vec<(u32, u32)>, // bitcode, length
max_length: u32,
}
impl CodeBook {
pub fn new(lengths: &[u32]) -> Result<Self> {
let err = Err(Error::InvalidCodeLengths);
// follow RFC1951 https://www.rfc-editor.org/rfc/rfc1951#ref-1
if lengths.is_empty() || lengths.len() > MAX_LL_SYMBOL as usize + 1 {
return err;
}
let mut tree = Vec::with_capacity(lengths.len());
let mut max_len = 0; // max(lengths)
// step 1
// # of codes having bitcode length count
let mut bl_count = [0 as u32; MAX_CODELENGTH as usize + 1];
for l in lengths {
bl_count[*l as usize] += 1;
tree.push((0, *l));
max_len = max_len.max(*l);
}
if max_len > MAX_CODELENGTH {
return err;
}
// step 2
let mut next_code = [0 as u32; MAX_CODELENGTH as usize + 1];
let mut code = 0;
bl_count[0] = 0; // this is a must!!
for bits in 1..=max_len as usize {
code = (code + bl_count[bits - 1]) << 1;
next_code[bits] = code;
}
// step 3
for pair in &mut tree {
let len = pair.1 as usize;
if len != 0 {
pair.0 = next_code[len];
next_code[len] += 1;
}
}
Ok(Self {
tree,
max_length: max_len,
})
}
/// maximum number of bits within the codebook
pub fn max_length(&self) -> u32 {
self.max_length
}
}
Don’t forget to import the new module and add the error variant.
// append to src/lib.rs
pub mod codebook;
// replace in src/error.rs
pub enum Error {
StdIoError(ErrorKind),
EmptyInput,
InvalidGzHeader,
InvalidBlockType,
BlockType0LenMismatch,
InvalidCodeLengths,
}
We are not quite done with Codebook just yet. Block type 1 uses static Huffman codes, where the codebooks are fixed. Let us add two more constructors to Codebook to initialize with static Huffman codes
// append to Codebook in lib/codebook.rs
pub fn default_ll() -> Self {
let lengths = [
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8,
];
Self::new(&lengths).unwrap()
}
pub fn default_dist() -> Self {
let lengths = [5; 30];
Self::new(&lengths).unwrap()
}
These codebooks correspond to the literal/length codebook and a distance codebook as specified in RFC1951
RFC1951
RFC1951
Now, we are ready to implement HuffmanDecoder. A naive implementation would be to use a binary tree to represent a prefix code and traverse the tree by reading one bit at a time, as shown below.
RFC1951
This may be the easiest to implement, but also the least efficient one, as we need to read and process a bit by bit. A more efficient way is to read multiple bits at once and decode a symbol using a lookup table. For that, the decoder will take in at least 15-bits at a time (the longest bitcode) and return not only the Symbol but also the Length of the bitcode. The caller can then call consume() method of the BitRead.
So, the implementation comes down to creating a decoding look up table of size 2¹⁵ that maps every possible 15-bits input the corresponding to (Symbol, Length) tuple. This turns out to be much more efficient than the naive prefix-code approach, but still not the most efficient one. For starters, it is a lot of work to precompute decoding table for every 2¹⁵ bit combination, but also memory lookup speed will significantly slow down when the lookup table size grows larger than the CPU cache size. Note that 2¹⁵ by itself is already 32k, so storing just a byte alone in the table is already approaching L1 cache size of modern CPUs.
A better solution would be to decode in a hierarchical two step processes. Create two lookup tables. The first table maps at most N-bits of the input, and the second table maps the rest 15-N bits. The number N here would be a fixed hyperparameter that we could tune later for best performance. This two-step decoding is more efficient because more frequent symbols will be encoded with shorter bitcodes, while rarer symbols will be encoded with longer bitcodes. Hence, we can assume that the majority of the symbols shall only require a single table lookup.
There is one caveat when dealing with bits: bit ordering. In RFC1951, the Huffman codes are read from the most significant bit to the least significant bit, which is the opposite of how bits are read in our BitRead. For that, we will need a function that efficiently reverses the bits.
RFC1951
Let’s implement this function first. Below is one way to do it. Note that we are only reversing up to 16-bits , as the longest code length is 15-bits.
// src/huffman_decoder.rs
// https://stackoverflow.com/questions/2602823/in-c-c-whats-the-simplest-way-to-reverse-the-order-of-bits-in-a-byte
fn reverse_bits(mut bits: u32) -> u32 {
bits = (bits & 0xFF00) >> 8 | (bits & 0x00FF) << 8;
bits = (bits & 0xF0F0) >> 4 | (bits & 0x0F0F) << 4;
bits = (bits & 0xCCCC) >> 2 | (bits & 0x3333) << 2;
bits = (bits & 0xAAAA) >> 1 | (bits & 0x5555) << 1;
bits
}
// append to src/lib.rs
pub mod huffman_decoder;
Below is implementation of HuffmanDecoder. I have documented the code to help understand what is going on. In a nutshell, we concatenate single-step lookup and second-step lookup tables into one big table. We first lookup the bits through the first-step lookup table. If the length is less or equal to NUM_BITS_FIRST_LOOKUP, we read the symbol and return. If not, we jump to the second table and do another lookup, from which we obtain the symbol.
// append to huffman_decoder.rs
use crate::error::{Error, Result};
use crate::codebook::CodeBook;
const NUM_BITS_FIRST_LOOKUP: u32 = 9;
pub struct HuffmanDecoder {
/// lookup[bitcode] where bitcode is the bit-order reversed huffman code, right-aligned
/// this is so that we can feed in multiple bits at once, from right to left
lookup: Vec<(u32, u32)>, // symbol, length
primary_mask: u32, // mask NUM_BITS_FIRST_LOOKUP bits
secondary_mask: u32, // mask the rest of the bits
}
impl HuffmanDecoder {
pub fn new(codebook: CodeBook) -> Self {
let mut lookup = Vec::new();
let max_nbits = codebook.max_length();
let (nbits, secondary_mask) = if max_nbits > NUM_BITS_FIRST_LOOKUP {
(
NUM_BITS_FIRST_LOOKUP,
(1 << (max_nbits - NUM_BITS_FIRST_LOOKUP)) - 1,
)
} else {
// all codes can be decoded in a single step
// secondary mask not needed
(max_nbits, 0)
};
let primary_mask = (1 << nbits) - 1;
lookup.resize(1 << nbits, (0, 0));
for (symbol, (mut bitcode, len)) in codebook.into_iter().enumerate() {
if len == 0 {
continue;
}
// reverse bit-order of huffman
bitcode = reverse_bits(bitcode);
bitcode >>= 16 - len; // right-align
if len <= nbits {
// single-step lookup
// populate table
let delta = nbits - len;
for idx in 0..(1 << delta) {
lookup[(bitcode | (idx << len)) as usize] = (symbol as u32, len);
}
} else {
// two-step lookup
// symbol will point to the base index for the second lookup step
let base = (bitcode & primary_mask) as usize;
let offset = if lookup[base].0 == 0 {
// first time the base has been seen
// will append the second-step lookup table
let offset = lookup.len() as u32;
// set up the index
lookup[base] = (offset, len);
// make room for the second-step lookup table
let new_len = lookup.len() + (1 << (max_nbits - nbits));
lookup.resize(new_len, (0, 0));
offset
} else {
// second-step lookup already created
lookup[base].0
};
let secondary_len = len - nbits;
let base = offset + ((bitcode >> nbits) & secondary_mask);
// populate second-step lookup table
for idx in 0..(1 << (max_nbits - len)) {
lookup[(base + (idx << secondary_len)) as usize] = (symbol as u32, len);
}
}
}
Self {
lookup,
primary_mask,
secondary_mask
}
}
/// Look up the code given at least max_length bits
/// Returns symbol and its length upon match
pub fn decode(&self, bits: u32) -> Result<(u32, u32)> {
let (symbol, len) = self.lookup[(bits & self.primary_mask) as usize];
if len == 0 {
Err(Error::HuffmanDecoderCodeNotFound)
} else if len <= NUM_BITS_FIRST_LOOKUP {
Ok((symbol, len))
} else {
let base = symbol as usize;
let idx = (bits >> NUM_BITS_FIRST_LOOKUP) & self.secondary_mask;
Ok(self.lookup[base + idx as usize])
}
}
}
// replace in src/error.rs
pub enum Error {
StdIoError(ErrorKind),
EmptyInput,
InvalidGzHeader,
InvalidBlockType,
BlockType0LenMismatch,
InvalidCodeLengths,
HuffmanDecoderCodeNotFound,
}
Note that NUM_BITS_FIRST_LOOKUP is arbitrarily set to 9 here, but one can choose any number from 1 to 15. Later when we complete implementation of gunzip, feel free to experiment and see much impact this parameter has.
Finally, we also need to implement IntoIter trait for Codebook so that we can efficiently read (Symbol, Length) tuples.
// append to src/codebook.rs
impl IntoIterator for CodeBook {
type Item = (u32, u32);
type IntoIter = std::vec::IntoIter<Self::Item>;
fn into_iter(self) -> Self::IntoIter {
self.tree.into_iter()
}
}
With this, we should be able to compile successfully. We will implement LZ77 decoding algorithm in the next article, so stay tuned!
Previous in series: https://techhara.me/posts/2023-09-29_rust—write-gunzip-from-scratch-4-d20f9da26f86/
Next in series: https://techhara.me/posts/2023-10-04_rust—write-gunzip-from-scratch-6-7bc9e8b2c08e/