Rust — write gunzip from scratch 7 Link to heading

In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below

  • part 1: main() function and skeletal structure
  • part 2: bitread module for reading bits from a byte stream
  • part 3: gzip header & footer for parsing the metadata and checksum
  • part 4: inflate for block type 0 (uncompressed) data
  • part 5: codebook and huffman_decoder modules for decoding Huffman codes
  • part 6: lz77 and sliding_window modules for decompressing LZ77-encoded data
  • part 7: inflate for block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes
  • part 8: checksum_write module for verifying the decompressed data
  • part 9: performance optimization
  • part 10: multithread support
  • part 11: streaming support
  • part 12: memory optimization
  • part 13: bug fix to reject non-compliant .gz file

Image

Let’s complete inflate_block1() and inflate_block2() methods today. The only difference between the two blocks is whether custom (dynamic) codebooks are used or not. Therefore, it is a good idea to define a few more functions so that we can share as much code as possible

// append to src/inflate.rs
use crate::codebook::CodeBook;
use crate::huffman_decoder::HuffmanDecoder;

// replace in src/inflate.rs
    fn inflate_block1(&mut self) -> Result<()> {
        let ll_decoder = HuffmanDecoder::new(CodeBook::default_ll());
        let dist_decoder = HuffmanDecoder::new(CodeBook::default_dist());
        self.inflate(ll_decoder, dist_decoder)
    }

    fn inflate_block2(&mut self) -> Result<()> {
        let (ll_decoder, dist_decoder) = self.read_dynamic_codebooks()?;
        self.inflate(ll_decoder, dist_decoder)
    }

// add methods to implementation of Inflate struct in src/inflate.rs
    fn inflate(&mut self, ll_decoder: HuffmanDecoder, dist_decoder: HuffmanDecoder) -> Result<()> {
        todo!()
    }

    fn read_dynamic_codebooks(&mut self) -> Result<(HuffmanDecoder, HuffmanDecoder)> {
        todo!()
    }

Let’s first implement read_dynamic_codebooks() method. In the block type2, code lengths for both literal/length codebook and distance codebook immediately follow the header bits. For greater compactness, they are encoded in Huffman code themselves

RFC1951

RFC1951 provides detailed specification, so we just need to implement it accordingly

RFC1951

// implement in src/inflate.rs
    fn read_dynamic_codebooks(&mut self) -> Result<(HuffmanDecoder, HuffmanDecoder)> {
        let hlit = self.reader.read_bits(5)? as usize + 257;
        let hdist = self.reader.read_bits(5)? as usize + 1;
        let hclen = self.reader.read_bits(4)? as usize + 4;
        let mut cl_lengths = [0 as u32; 19];
        for idx in [
            16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15,
        ]
        .into_iter()
        .take(hclen)
        {
            cl_lengths[idx] = self.reader.read_bits(3)? as u32;
        }
        let cl_codes = CodeBook::new(&cl_lengths)?;
        let cl_decoder = HuffmanDecoder::new(cl_codes);

        // The code lengths contain LL codes and Distance codes as a single table
        let num_codes = hlit + hdist;
        let mut lengths = Vec::with_capacity(num_codes);
        while lengths.len() < num_codes {
            let (cl_code, len) = cl_decoder
                .decode(self.reader.peek_bits()?)
                .or(Err(Error::ReadDynamicCodebook))?;
            self.reader.consume(len);
            match cl_code {
                0..=15 => {
                    lengths.push(cl_code as u32);
                }
                16 if !lengths.is_empty() => {
                    let length = 3 + self.reader.read_bits(2)? as usize;
                    let x = *lengths.last().unwrap();
                    lengths.resize(lengths.len() + length, x);
                }
                17 => {
                    let length = 3 + self.reader.read_bits(3)? as usize;
                    lengths.resize(lengths.len() + length, 0);
                }
                18 => {
                    let length = 11 + self.reader.read_bits(7)? as usize;
                    lengths.resize(lengths.len() + length, 0);
                }
                _ => {
                    unreachable!()
                }
            }
        }

        if lengths.len() != num_codes {
            return Err(Error::ReadDynamicCodebook);
        }

        let ll_codes = CodeBook::new(&lengths[..hlit])?;
        let dist_codes = CodeBook::new(&lengths[hlit..])?;
        Ok((
            HuffmanDecoder::new(ll_codes),
            HuffmanDecoder::new(dist_codes),
        ))
    }

// replace in lib/error.rs
pub enum Error {
    StdIoError(ErrorKind),
    EmptyInput,
    InvalidGzHeader,
    InvalidBlockType,
    BlockType0LenMismatch,
    InvalidCodeLengths,
    HuffmanDecoderCodeNotFound,
    DistanceTooMuch,
    EndOfBlockNotFound,
    ReadDynamicCodebook,
}

Note that we are reading a single code length table, which we will split into two, as stated in RFC1951

RFC1951

To implement inflate() method in Inflate struct, we first need to implement a code iterator which is used in decode() function in lz77 module. Let’s create a struct CodeIterator

// append to src/lz77.rs
use crate::bitread::BitRead;
use crate::huffman_decoder::HuffmanDecoder;

pub struct CodeIterator<B: BitRead> {
    reader: B,
    ll_decoder: HuffmanDecoder,
    dist_decoder: HuffmanDecoder,
}

impl<B: BitRead> CodeIterator<B> {
    pub fn new(reader: B, ll_decoder: HuffmanDecoder, dist_decoder: HuffmanDecoder) -> Self {
        Self {
            reader,
            ll_decoder,
            dist_decoder,
        }
    }
}

impl<B: BitRead> Iterator for CodeIterator<B> {
    type Item = Result<Code>;

    fn next(&mut self) -> Option<Self::Item> {
        todo!()
    }
}

The code iterator needs three components: a BitRead and literal/length and distance HuffmanDecoder. These will be arguments to decode() function.

Before we implement the code iterator, here is a recap of decoding algorithm from RFC1951

RFC1951

We can implement the iterator directly in next() method, but this will make the code verbose as we can’t easily propagate error with ? operator due to misalignment in the return value Option<_> vs Result<_>. Hence, it is easier to delegate it to a new method next_helper() that returns Result<_> as below

// append to src/lz77.rs
use std::cmp::Ordering::*;

// add to src/lz77.rs CodeIterator impl
    fn next_helper(&mut self) -> Result<Code> {
        let bitcode = self.reader.peek_bits()?;
        let (symbol, len) = self.ll_decoder.decode(bitcode)?;
        self.reader.consume(len);
        match symbol.cmp(&END_OF_BLOCK) {
            Less => Ok(Code::Literal(symbol as u8)),
            Equal => Ok(Code::EndOfBlock),
            Greater => {
                let (bits, mut length) = SYMBOL2BITS_LENGTH[(symbol & 0xFF) as usize];
                length += self.reader.read_bits(bits)?;
                let bitcode = self.reader.peek_bits()?;
                let (symbol, len) = self.dist_decoder.decode(bitcode)?;
                self.reader.consume(len);
                let (bits, mut distance) = SYMBOL2BITS_DISTANCE[symbol as usize];
                distance += self.reader.read_bits(bits)?;
                Ok(Code::Dictionary { distance: distance as u16, length: length as u16 })
            }
        }
    }

// replace in src/lz77.rs Iterator impl
    fn next(&mut self) -> Option<Self::Item> {
        Some(self.next_helper())
    }


// append to rc/lz77.rs
pub const SYMBOL2BITS_LENGTH: [(u32, u32); 30] = [
    (0, 0),
    (0, 3),
    (0, 4),
    (0, 5),
    (0, 6),
    (0, 7),
    (0, 8),
    (0, 9),
    (0, 10),
    (1, 11),
    (1, 13),
    (1, 15),
    (1, 17),
    (2, 19),
    (2, 23),
    (2, 27),
    (2, 31),
    (3, 35),
    (3, 43),
    (3, 51),
    (3, 59),
    (4, 67),
    (4, 83),
    (4, 99),
    (4, 115),
    (5, 131),
    (5, 163),
    (5, 195),
    (5, 227),
    (0, 258),
];

pub const SYMBOL2BITS_DISTANCE: [(u32, u32); 30] = [
    (0, 1),
    (0, 2),
    (0, 3),
    (0, 4),
    (1, 5),
    (1, 7),
    (2, 9),
    (2, 13),
    (3, 17),
    (3, 25),
    (4, 33),
    (4, 49),
    (5, 65),
    (5, 97),
    (6, 129),
    (6, 193),
    (7, 257),
    (7, 385),
    (8, 513),
    (8, 769),
    (9, 1025),
    (9, 1537),
    (10, 2049),
    (10, 3073),
    (11, 4097),
    (11, 6145),
    (12, 8193),
    (12, 12289),
    (13, 16385),
    (13, 24577),
];

The constant tables are directly from RFC1951

RFC1951 length table

RFC1951 distance table

Finally, we are ready to implement inflate() method in src/inflate.rs module.

// append to src/inflate.rs
use crate::lz77::{CodeIterator, DecodeResult, decode};

// replace in src/inflate.rs
    fn inflate(&mut self, ll_decoder: HuffmanDecoder, dist_decoder: HuffmanDecoder) -> Result<()> {
        let mut iter = CodeIterator::new(&mut self.reader, ll_decoder, dist_decoder);

        let mut done = false;
        loop {
            let boundary = self.window.boundary();
            let n = match decode(self.window.buffer(), boundary, &mut iter)? {
                DecodeResult::Done(n) => {
                    done = true;
                    n
                }
                DecodeResult::WindowIsFull(n) => n,
                DecodeResult::Error(e) => {
                    return Err(e);
                }
            };
            self.writer.write_all(&self.window.write_buffer()[..n])?;
            self.window.slide(n);
            if done {
                break;
            }
        }
        Ok(())
    }

All it does is to provide glue between different components, i.e., decode() function, SlidingWindow struct, and Write trait.

As always, we should make sure the code compiles. We have completed virtually all the hard work. In the next article, we will finalize the implementation and run tests, so stay tuned!

Previous in series: https://medium.com/@techhara/rust-write-gunzip-from-scratch-6-7bc9e8b2c08e

Next in series: https://medium.com/@techhara/rust-write-gunzip-from-scratch-8-fa51fa98aa60