Rust — write gunzip from scratch 2 Link to heading

In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below

  • part 1: main() function and skeletal structure
  • part 2: bitread module for reading bits from a byte stream
  • part 3: gzip header & footer for parsing the metadata and checksum
  • part 4: inflate for block type 0 (uncompressed) data
  • part 5: codebook and huffman_decoder modules for decoding Huffman codes
  • part 6: lz77 and sliding_window modules for decompressing LZ77-encoded data
  • part 7: inflate for block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes
  • part 8: checksum_write module for verifying the decompressed data
  • part 9: performance optimization
  • part 10: multithread support
  • part 11: streaming support
  • part 12: memory optimization
  • part 13: bug fix to reject non-compliant .gz file

Rust - Write Gunzip From Scratch 2

Today, we will implement BitRead. As a reminder, here is the data pipeline.

Data Pipeline

Let’s first define the interface we want. We shall be able to read up to 15-bits at once, as that is the maximum code length as defined in RFC1951. However, the code length is not known a priori during the decoding process because, well we don’t know what code we are reading in the first place. The solution is a two-step process — we read at least 15-bits at a time first, pass it to the decoder so that the decoder can figure out the code and give back the code length, and then we consume that amount of bits.

Let’s define BitRead trait as follows, extending Read. Notice peek_bits() and consume() methods do exactly what I described above. For convenience, we also define read_bits() method that can be useful if we know the number of bits to read in advance. I also added a few more methods that will prove useful later on.

// src/bitread.rs
use std::io::Read;
use std::mem::size_of; // will need it

pub trait BitRead: Read {
    /// peak at least 24-bits without consuming
    /// returns error if less than 24-bits are remaining
    /// the caller is expected to apply the mask
    fn peek_bits(&mut self) -> std::io::Result<u32>;

    /// consume n-bits
    fn consume(&mut self, n: u32);

    /// consume 0 to 7 bits to byte-align
    fn byte_align(&mut self);

    /// read up to 24-bits and consume
    /// return error if less than n-bits are remaining
    fn read_bits(&mut self, n: u32) -> std::io::Result<u32> {
        debug_assert!(n <= 24);
        let bits = self.peek_bits()?;
        self.consume(n);
        Ok(bits & ((1 << n) - 1))
    }

    /// indicate whether there is more data, even a single bit left
    /// it may refill the buffer
    fn has_data_left(&mut self) -> std::io::Result<bool>;
}

You may be wondering why at least 24-bits here. This is arbitrary and reflects an implementation I have in mind. As long as we can read 15-bits or more, we should be good.

Bits will be refilled from an inner Read trait object. If there is any error from the inner Read trait object, we propagate it. While at it, let’s define a struct that implements BitRead trait.

// append to src/bitread.rs
const BUFFER_SIZE: usize = 16 << 10;

pub struct BitReader<R: Read> {
    read: R,
    nbits: u32, // # consumed bits within the first byte
    buf: Vec<u8>,
    begin: usize,
    cap: usize,
}

impl<R: Read> BitReader<R> {
    pub fn new(read: R) -> Self {
        Self {
            read,
            nbits: 0,
            buf: vec![0; BUFFER_SIZE],
            begin: 0,
            cap: 0,
        }
    }

    fn buffer(&self) -> &[u8] {
        &self.buf[self.begin..self.cap]
    }

    /// # bits remaining within the buffer
    fn bit_len(&self) -> usize {
        self.buffer().len() * 8 - self.nbits as usize
    }

    /// refill buffer
    /// returns # additional bytes added to buffer
    fn fill_buf(&mut self) -> std::io::Result<usize> {
        self.buf.copy_within(self.begin..self.cap, 0);
        self.cap -= self.begin;
        self.begin = 0;
        let n = self.read.read(&mut self.buf[self.cap..])?;
        self.cap += n;
        Ok(n)
    }
}

This will hold an inner Read trait object to read bytes from. It also has a large buffer that holds to-be-read data as specified by buffer() method. The first nbits of the first byte of the buffer is already consumed, so we will need to right-shift those before we return the bits. This is where minimum 24-bits come from in the API. Finally, fill_buf() method refills the buffer by first moving left-over data to the beginning and then refilling the rest from its inner Read trait.

Next, let’s implement Read trait, since we also need to be able to read bytes as well. Whenever this method is called, we shall discard left-over bits to byte-align. We first read from the buffer, and from the inner Read trait object.

// append to src/bitread.rs
impl<R: Read> Read for BitReader<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        self.byte_align();
        // read off from the buffer first
        let mut n = buf.len().min(self.buffer().len());
        buf[..n].copy_from_slice(&self.buffer()[..n]);
        self.begin += n;

        // read the rest directly from Read
        n += self.read.read(&mut buf[n..])?;
        Ok(n)
    }
}

Finally, we need to implement BitRead. Try it out yourself as an exercise. Here is my implementation.

// append to src/bitread.rs
impl<R: Read> BitRead for BitReader<R> {
    fn byte_align(&mut self) {
        if self.nbits > 0 {
            self.nbits = 0;
            self.begin += 1;
        }
    }

    fn consume(&mut self, n: u32) {
        debug_assert!(self.bit_len() >= n as usize);
        self.nbits += n;
        self.begin += (self.nbits / 8) as usize;
        self.nbits %= 8;
    }

    fn peek_bits(&mut self) -> std::io::Result<u32> {
        // fill_buf may not return enough bytes at once
        while self.buffer().len() < size_of::<u32>() {
            if self.fill_buf()? == 0 {
                return Err(std::io::Error::from(std::io::ErrorKind::UnexpectedEof));
            }
        }
        let bits = u32::from_le_bytes(
            self.buf[self.begin..self.begin + size_of::<u32>()]
                .try_into()
                .unwrap(),
        );
        Ok(bits >> self.nbits)
    }

    fn has_data_left(&mut self) -> std::io::Result<bool> {
        Ok(!self.buffer().is_empty() || self.fill_buf()? != 0)
    }
}

Mostly straightforward. In peek_bits() method, we first check to see if we have enough bits left. If not, we refill the data. Then, we obtain the first 4-bytes, remove already consumed bits, and return.

OK, so we have completed the second stage of the data pipeline, which is BitRead. Before we wrap up, don’t forget to include bitread module in lib.rs

// append to src/lib.rs
pub mod bitread;

and test if everything compiles

cargo build

Feel free to add unit tests for BitRead. Stay tuned for the next component!

Previous in series: https://towardsdev.com/rust-write-gunzip-from-scratch-1-a0100648b246

Next in series: https://towardsdev.com/rust-write-gunzip-from-scratch-3-56f21c29a97b