Rust — write gunzip from scratch 7 Link to heading
In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below
- part 1:
main()function and skeletal structure - part 2:
bitreadmodule for reading bits from a byte stream - part 3:
gzipheader & footer for parsing the metadata and checksum - part 4:
inflatefor block type 0 (uncompressed) data - part 5:
codebookandhuffman_decodermodules for decoding Huffman codes - part 6:
lz77andsliding_windowmodules for decompressing LZ77-encoded data - part 7:
inflatefor block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes - part 8:
checksum_writemodule for verifying the decompressed data - part 9: performance optimization
- part 10: multithread support
- part 11: streaming support
- part 12: memory optimization
- part 13: bug fix to reject non-compliant
.gzfile
Let’s complete inflate_block1() and inflate_block2() methods today. The only difference between the two blocks is whether custom (dynamic) codebooks are used or not. Therefore, it is a good idea to define a few more functions so that we can share as much code as possible
// append to src/inflate.rs
use crate::codebook::CodeBook;
use crate::huffman_decoder::HuffmanDecoder;
// replace in src/inflate.rs
fn inflate_block1(&mut self) -> Result<()> {
let ll_decoder = HuffmanDecoder::new(CodeBook::default_ll());
let dist_decoder = HuffmanDecoder::new(CodeBook::default_dist());
self.inflate(ll_decoder, dist_decoder)
}
fn inflate_block2(&mut self) -> Result<()> {
let (ll_decoder, dist_decoder) = self.read_dynamic_codebooks()?;
self.inflate(ll_decoder, dist_decoder)
}
// add methods to implementation of Inflate struct in src/inflate.rs
fn inflate(&mut self, ll_decoder: HuffmanDecoder, dist_decoder: HuffmanDecoder) -> Result<()> {
todo!()
}
fn read_dynamic_codebooks(&mut self) -> Result<(HuffmanDecoder, HuffmanDecoder)> {
todo!()
}
Let’s first implement read_dynamic_codebooks() method. In the block type2, code lengths for both literal/length codebook and distance codebook immediately follow the header bits. For greater compactness, they are encoded in Huffman code themselves

RFC1951 provides detailed specification, so we just need to implement it accordingly

// implement in src/inflate.rs
fn read_dynamic_codebooks(&mut self) -> Result<(HuffmanDecoder, HuffmanDecoder)> {
let hlit = self.reader.read_bits(5)? as usize + 257;
let hdist = self.reader.read_bits(5)? as usize + 1;
let hclen = self.reader.read_bits(4)? as usize + 4;
let mut cl_lengths = [0 as u32; 19];
for idx in [
16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15,
]
.into_iter()
.take(hclen)
{
cl_lengths[idx] = self.reader.read_bits(3)? as u32;
}
let cl_codes = CodeBook::new(&cl_lengths)?;
let cl_decoder = HuffmanDecoder::new(cl_codes);
// The code lengths contain LL codes and Distance codes as a single table
let num_codes = hlit + hdist;
let mut lengths = Vec::with_capacity(num_codes);
while lengths.len() < num_codes {
let (cl_code, len) = cl_decoder
.decode(self.reader.peek_bits()?)
.or(Err(Error::ReadDynamicCodebook))?;
self.reader.consume(len);
match cl_code {
0..=15 => {
lengths.push(cl_code as u32);
}
16 if !lengths.is_empty() => {
let length = 3 + self.reader.read_bits(2)? as usize;
let x = *lengths.last().unwrap();
lengths.resize(lengths.len() + length, x);
}
17 => {
let length = 3 + self.reader.read_bits(3)? as usize;
lengths.resize(lengths.len() + length, 0);
}
18 => {
let length = 11 + self.reader.read_bits(7)? as usize;
lengths.resize(lengths.len() + length, 0);
}
_ => {
unreachable!()
}
}
}
if lengths.len() != num_codes {
return Err(Error::ReadDynamicCodebook);
}
let ll_codes = CodeBook::new(&lengths[..hlit])?;
let dist_codes = CodeBook::new(&lengths[hlit..])?;
Ok((
HuffmanDecoder::new(ll_codes),
HuffmanDecoder::new(dist_codes),
))
}
// replace in lib/error.rs
pub enum Error {
StdIoError(ErrorKind),
EmptyInput,
InvalidGzHeader,
InvalidBlockType,
BlockType0LenMismatch,
InvalidCodeLengths,
HuffmanDecoderCodeNotFound,
DistanceTooMuch,
EndOfBlockNotFound,
ReadDynamicCodebook,
}
Note that we are reading a single code length table, which we will split into two, as stated in RFC1951

To implement inflate() method in Inflate struct, we first need to implement a code iterator which is used in decode() function in lz77 module. Let’s create a struct CodeIterator
// append to src/lz77.rs
use crate::bitread::BitRead;
use crate::huffman_decoder::HuffmanDecoder;
pub struct CodeIterator<B: BitRead> {
reader: B,
ll_decoder: HuffmanDecoder,
dist_decoder: HuffmanDecoder,
}
impl<B: BitRead> CodeIterator<B> {
pub fn new(reader: B, ll_decoder: HuffmanDecoder, dist_decoder: HuffmanDecoder) -> Self {
Self {
reader,
ll_decoder,
dist_decoder,
}
}
}
impl<B: BitRead> Iterator for CodeIterator<B> {
type Item = Result<Code>;
fn next(&mut self) -> Option<Self::Item> {
todo!()
}
}
The code iterator needs three components: a BitRead and literal/length and distance HuffmanDecoder. These will be arguments to decode() function.
Before we implement the code iterator, here is a recap of decoding algorithm from RFC1951

We can implement the iterator directly in next() method, but this will make the code verbose as we can’t easily propagate error with ? operator due to misalignment in the return value Option<_> vs Result<_>. Hence, it is easier to delegate it to a new method next_helper() that returns Result<_> as below
// append to src/lz77.rs
use std::cmp::Ordering::*;
// add to src/lz77.rs CodeIterator impl
fn next_helper(&mut self) -> Result<Code> {
let bitcode = self.reader.peek_bits()?;
let (symbol, len) = self.ll_decoder.decode(bitcode)?;
self.reader.consume(len);
match symbol.cmp(&END_OF_BLOCK) {
Less => Ok(Code::Literal(symbol as u8)),
Equal => Ok(Code::EndOfBlock),
Greater => {
let (bits, mut length) = SYMBOL2BITS_LENGTH[(symbol & 0xFF) as usize];
length += self.reader.read_bits(bits)?;
let bitcode = self.reader.peek_bits()?;
let (symbol, len) = self.dist_decoder.decode(bitcode)?;
self.reader.consume(len);
let (bits, mut distance) = SYMBOL2BITS_DISTANCE[symbol as usize];
distance += self.reader.read_bits(bits)?;
Ok(Code::Dictionary { distance: distance as u16, length: length as u16 })
}
}
}
// replace in src/lz77.rs Iterator impl
fn next(&mut self) -> Option<Self::Item> {
Some(self.next_helper())
}
// append to rc/lz77.rs
pub const SYMBOL2BITS_LENGTH: [(u32, u32); 30] = [
(0, 0),
(0, 3),
(0, 4),
(0, 5),
(0, 6),
(0, 7),
(0, 8),
(0, 9),
(0, 10),
(1, 11),
(1, 13),
(1, 15),
(1, 17),
(2, 19),
(2, 23),
(2, 27),
(2, 31),
(3, 35),
(3, 43),
(3, 51),
(3, 59),
(4, 67),
(4, 83),
(4, 99),
(4, 115),
(5, 131),
(5, 163),
(5, 195),
(5, 227),
(0, 258),
];
pub const SYMBOL2BITS_DISTANCE: [(u32, u32); 30] = [
(0, 1),
(0, 2),
(0, 3),
(0, 4),
(1, 5),
(1, 7),
(2, 9),
(2, 13),
(3, 17),
(3, 25),
(4, 33),
(4, 49),
(5, 65),
(5, 97),
(6, 129),
(6, 193),
(7, 257),
(7, 385),
(8, 513),
(8, 769),
(9, 1025),
(9, 1537),
(10, 2049),
(10, 3073),
(11, 4097),
(11, 6145),
(12, 8193),
(12, 12289),
(13, 16385),
(13, 24577),
];
The constant tables are directly from RFC1951


Finally, we are ready to implement inflate() method in src/inflate.rs module.
// append to src/inflate.rs
use crate::lz77::{CodeIterator, DecodeResult, decode};
// replace in src/inflate.rs
fn inflate(&mut self, ll_decoder: HuffmanDecoder, dist_decoder: HuffmanDecoder) -> Result<()> {
let mut iter = CodeIterator::new(&mut self.reader, ll_decoder, dist_decoder);
let mut done = false;
loop {
let boundary = self.window.boundary();
let n = match decode(self.window.buffer(), boundary, &mut iter)? {
DecodeResult::Done(n) => {
done = true;
n
}
DecodeResult::WindowIsFull(n) => n,
DecodeResult::Error(e) => {
return Err(e);
}
};
self.writer.write_all(&self.window.write_buffer()[..n])?;
self.window.slide(n);
if done {
break;
}
}
Ok(())
}
All it does is to provide glue between different components, i.e., decode() function, SlidingWindow struct, and Write trait.
As always, we should make sure the code compiles. We have completed virtually all the hard work. In the next article, we will finalize the implementation and run tests, so stay tuned!
Previous in series: https://medium.com/@techhara/rust-write-gunzip-from-scratch-6-7bc9e8b2c08e
Next in series: https://medium.com/@techhara/rust-write-gunzip-from-scratch-8-fa51fa98aa60