Rust — write gunzip from scratch 4 Link to heading

In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below

  • part 1: main() function and skeletal structure
  • part 2: bitread module for reading bits from a byte stream
  • part 3: gzip header & footer for parsing the metadata and checksum
  • part 4: inflate for block type 0 (uncompressed) data
  • part 5: codebook and huffman_decoder modules for decoding Huffman codes
  • part 6: lz77 and sliding_window modules for decompressing LZ77-encoded data
  • part 7: inflate for block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes
  • part 8: checksum_write module for verifying the decompressed data
  • part 9: performance optimization
  • part 10: multithread support
  • part 11: streaming support
  • part 12: memory optimization
  • part 13: bug fix to reject non-compliant .gz file

Image

Today, we will finally dive into the block section within each member. As a reminder, here is the how .gz file is structured.

Image

The compression algorithm used in .gz file is called Deflate, and thus it is conventional to call decompression of Deflate as Inflate. We will start by modifying gunzip() function to call Inflate to take care of processing block sections.

// append to src/lib.rs
pub mod inflate;

use inflate::Inflate; // todo

// replace in src/lib.rs
pub fn gunzip(read: impl Read, mut write: impl Write) -> Result<()> {    
    let mut reader = BitReader::new(read);
    let mut member_idx = 0;

    while reader.has_data_left()? {
        Header::read(&mut reader)?;
        member_idx += 1;
        
        // read one or more blocks
        Inflate::new(&mut reader, &mut write).run()?;

        let footer = Footer::read(&mut reader)?;
        // TODO: do something with footer
    }

    if member_idx == 0 {
        Err(Error::EmptyInput)
    } else {
        Ok(())
    }
}

The to-be-created module Inflate will take in a BitRead and Write for reading and writing, and has a function run() to start decompression. Let’s create src/inflate.rs file and generate the skeleton

// lib/inflate.rs
use crate::bitread::BitRead;
use crate::error::Result;
use std::io::Write;

pub struct Inflate<R: BitRead, W: Write> {
    reader: R,
    writer: W,
}

impl<R: BitRead, W: Write> Inflate<R, W> {
    pub fn new(reader: R, writer: W) -> Self {
        Self { reader, writer }
    }

    pub fn run(mut self) -> Result<()> {
        todo!()
    }
}

Note that the signature run(mut self) -> Result<()>. It takes mut self rather than &mut self to prevent accidentally calling run() more than once.

If we try to compile at this point, it will fail. Just like in the previous article, this is because we need to blanket-implement BitRead for any &mut BitRead, since we are passing &mut BitReader to Inflate rather than the owned object BitReader itself.

// append src/bitread.rs
impl<R: BitRead> BitRead for &mut R {
    fn peek_bits(&mut self) -> std::io::Result<u32> {
        (**self).peek_bits()
    }

    fn consume(&mut self, n: u32) {
        (**self).consume(n)
    }

    fn byte_align(&mut self) {
        (**self).byte_align()
    }

    fn has_data_left(&mut self) -> std::io::Result<bool> {
        (**self).has_data_left()
    }
}

Finally, let’s look into the block section within .gz file. According to RFC1951, every block starts with 3-bit header. The first bit for BFINAL, indicating whether this is the last block within the member. The subsequent two-bits are BTYPE, which is either 00, 01, or 10. If we read 11 for the block type, the program shall exit indicating an error.

Image

One or more blocks are concatenated, and only the final block must have the BFINAL bit set to 1. The main logic for inflate, given we have functions for processing each type of block, is straightforward. Let’s write this part of the code

// append to src/inflate.rs
use crate::error::Error;

// modify in src/inflate.rs
    pub fn run(mut self) -> Result<()> {
        loop {
            let header = self.reader.read_bits(3)?;
            let is_final = header & 1 == 1;
            match header & 0b110 {
                0b000 => self.inflate_block0()?,
                0b010 => self.inflate_block1()?,
                0b100 => self.inflate_block2()?,
                _ => return Err(Error::InvalidBlockType),
            }
            if is_final {
                break;
            }
        }
        Ok(())
    }

// add to Inflate in src/inflate.rs
    fn inflate_block0(&mut self) -> Result<()> {
        todo!()
    }

    fn inflate_block1(&mut self) -> Result<()> {
        todo!()
    }

    fn inflate_block2(&mut self) -> Result<()> {
        todo!()
    }

// modify src/error.rs
pub enum Error {
    StdIoError(ErrorKind),
    EmptyInput,
    InvalidGzHeader,
    InvalidBlockType
}

Now, our task is to write three functions corresponding to each block type. For today, let’s do inflate_blcok0() method. This block contains uncompressed data preceded by 2-bytes long LEN and NLEN values.

Image

Here is our implementation

// replace in lib/error.rs
pub enum Error {
    StdIoError(ErrorKind),
    EmptyInput,
    InvalidGzHeader,
    InvalidBlockType,
    BlockType0LenMismatch
}

// modify lib/inflate.rs
    fn inflate_block0(&mut self) -> Result<()> {
        self.reader.byte_align();
        let len = self.reader.read_bits(16)?;
        let nlen = self.reader.read_bits(16)?;
        if len ^ nlen != 0xFFFF {
            Err(Error::BlockType0LenMismatch)
        } else {
            let mut buf = vec![0; len as usize];
            self.reader.read_exact(&mut buf)?;
            self.writer.write_all(&buf)?;
            Ok(())
        }
    }

With this, our code should all compile. In the next article, we will finally dive into deflate algorithm, so stay tuned!

Previous in series: https://towardsdev.com/rust-write-gunzip-from-scratch-3-56f21c29a97b

Next in series: https://medium.com/@techhara/rust-write-gunzip-from-scratch-5-670beb29b11f