Rust — write gunzip from scratch 1 Link to heading

In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below

  • part 1: main() function and skeletal structure
  • part 2: bitread module for reading bits from a byte stream
  • part 3: gzip header & footer for parsing the metadata and checksum
  • part 4: inflate for block type 0 (uncompressed) data
  • part 5: codebook and huffman_decoder modules for decoding Huffman codes
  • part 6: lz77 and sliding_window modules for decompressing LZ77-encoded data
  • part 7: inflate for block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes
  • part 8: checksum_write module for verifying the decompressed data
  • part 9: performance optimization
  • part 10: multithread support
  • part 11: streaming support
  • part 12: memory optimization
  • part 13: bug fix to reject non-compliant .gz file

Let’s first understand how .gz format works. The best references are RFC1951 and RFC1952. RFC1951 specifies Deflate compression format, which is used in both zlib and .gz, and RFC1952 specifies how a .gz file is formatted, which consists of Deflate data. If you prefer a video format, this lecture is the best version of it.

Below is the main data pipeline. We read compressed file from Read trait as bytes. We will then need to implement a bit reader to be able to read as a bit stream. We then need a Huffman decoder to identify variable-length bit codes and obtain the LZ77 code. Finally, we convert to byte stream and write to the output through Write trait. Below shows overall data pipeline for decompression.

Rust gunzip pipeline

Now to the program. We will create a library crate

$ cargo new --lib gunzip
$ cd gunzip

For this series, we will write a minimal command line program that decompresses a .gz file. Specifically, it will read the input from stdin and write decompressed data to stdout. Let’s create the binary source first.

// src/bin/gunzip.rs
use gunzip::error::Result; // todo
use gunzip::gunzip; // todo

fn usage(program: &str) {
    eprintln!("Usage: {}", program);
    eprintln!("\tDecompresses .gz file read from stdin and outputs to stdout");
    eprintln!("Example: {} < input.gz > output", program);
}

fn main() -> Result<()> {
    let mut args = std::env::args();
    let program = args.next().unwrap();
    if args.next().is_some() {
        usage(&program);
        std::process::exit(-1);
    }

    let reader = std::io::stdin();
    let writer = std::io::stdout();
    gunzip(reader, writer)
}

The actual implementation is delegated to gunzip() function, which takes in a Read and Write trait for input and output. Also, this function should return Result<()> to indicate whether there was any error. For that, let’s define minimal custom Error and Result type.

// src/error.rs
use std::fmt::Display;
use std::io::ErrorKind;

#[derive(Debug)]
pub enum Error {
    StdIoError(ErrorKind),
}

impl From<std::io::Error> for Error {
    fn from(e: std::io::Error) -> Self {
        Self::StdIoError(e.kind())
    }
}

impl std::error::Error for Error {}

impl Display for Error {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "{:?}", self)
    }
}

pub type Result<R> = std::result::Result<R, Error>;

We will add more error variants later on, but for now we only have std::io::Error, which may arise from either Read or Write traits. Let’s write a placeholder for gunzip() method

// src/lib.rs
pub mod error;

use crate::error::Result;
use std::io::{Read, Write};

pub fn gunzip(read: impl Read, write: impl Write) -> Result<()> {
    todo!()
}

OK. We should at least be able to compile now.

$ cargo build

We will tackle BitRead stage of the data pipeline in the subsequent series, so stay tuned!

Next in series: https://towardsdev.com/rust-write-gunzip-from-scratch-2-31dcab43791