Rust — write gunzip from scratch 2 Link to heading
In this series, we will be writing gunzip decompressor from scratch in Rust. We want to write it ourselves not only to learn Rust but also to understand how it .gz compression works under the hood. For full source code, check out this Github repo. You can find all articles of the series below
- part 1:
main()function and skeletal structure - part 2:
bitreadmodule for reading bits from a byte stream - part 3:
gzipheader & footer for parsing the metadata and checksum - part 4:
inflatefor block type 0 (uncompressed) data - part 5:
codebookandhuffman_decodermodules for decoding Huffman codes - part 6:
lz77andsliding_windowmodules for decompressing LZ77-encoded data - part 7:
inflatefor block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes - part 8:
checksum_writemodule for verifying the decompressed data - part 9: performance optimization
- part 10: multithread support
- part 11: streaming support
- part 12: memory optimization
- part 13: bug fix to reject non-compliant
.gzfile
Today, we will implement BitRead. As a reminder, here is the data pipeline.

Let’s first define the interface we want. We shall be able to read up to 15-bits at once, as that is the maximum code length as defined in RFC1951. However, the code length is not known a priori during the decoding process because, well we don’t know what code we are reading in the first place. The solution is a two-step process — we read at least 15-bits at a time first, pass it to the decoder so that the decoder can figure out the code and give back the code length, and then we consume that amount of bits.
Let’s define BitRead trait as follows, extending Read. Notice peek_bits() and consume() methods do exactly what I described above. For convenience, we also define read_bits() method that can be useful if we know the number of bits to read in advance. I also added a few more methods that will prove useful later on.
// src/bitread.rs
use std::io::Read;
use std::mem::size_of; // will need it
pub trait BitRead: Read {
/// peak at least 24-bits without consuming
/// returns error if less than 24-bits are remaining
/// the caller is expected to apply the mask
fn peek_bits(&mut self) -> std::io::Result<u32>;
/// consume n-bits
fn consume(&mut self, n: u32);
/// consume 0 to 7 bits to byte-align
fn byte_align(&mut self);
/// read up to 24-bits and consume
/// return error if less than n-bits are remaining
fn read_bits(&mut self, n: u32) -> std::io::Result<u32> {
debug_assert!(n <= 24);
let bits = self.peek_bits()?;
self.consume(n);
Ok(bits & ((1 << n) - 1))
}
/// indicate whether there is more data, even a single bit left
/// it may refill the buffer
fn has_data_left(&mut self) -> std::io::Result<bool>;
}
You may be wondering why at least 24-bits here. This is arbitrary and reflects an implementation I have in mind. As long as we can read 15-bits or more, we should be good.
Bits will be refilled from an inner Read trait object. If there is any error from the inner Read trait object, we propagate it. While at it, let’s define a struct that implements BitRead trait.
// append to src/bitread.rs
const BUFFER_SIZE: usize = 16 << 10;
pub struct BitReader<R: Read> {
read: R,
nbits: u32, // # consumed bits within the first byte
buf: Vec<u8>,
begin: usize,
cap: usize,
}
impl<R: Read> BitReader<R> {
pub fn new(read: R) -> Self {
Self {
read,
nbits: 0,
buf: vec![0; BUFFER_SIZE],
begin: 0,
cap: 0,
}
}
fn buffer(&self) -> &[u8] {
&self.buf[self.begin..self.cap]
}
/// # bits remaining within the buffer
fn bit_len(&self) -> usize {
self.buffer().len() * 8 - self.nbits as usize
}
/// refill buffer
/// returns # additional bytes added to buffer
fn fill_buf(&mut self) -> std::io::Result<usize> {
self.buf.copy_within(self.begin..self.cap, 0);
self.cap -= self.begin;
self.begin = 0;
let n = self.read.read(&mut self.buf[self.cap..])?;
self.cap += n;
Ok(n)
}
}
This will hold an inner Read trait object to read bytes from. It also has a large buffer that holds to-be-read data as specified by buffer() method. The first nbits of the first byte of the buffer is already consumed, so we will need to right-shift those before we return the bits. This is where minimum 24-bits come from in the API. Finally, fill_buf() method refills the buffer by first moving left-over data to the beginning and then refilling the rest from its inner Read trait.
Next, let’s implement Read trait, since we also need to be able to read bytes as well. Whenever this method is called, we shall discard left-over bits to byte-align. We first read from the buffer, and from the inner Read trait object.
// append to src/bitread.rs
impl<R: Read> Read for BitReader<R> {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
self.byte_align();
// read off from the buffer first
let mut n = buf.len().min(self.buffer().len());
buf[..n].copy_from_slice(&self.buffer()[..n]);
self.begin += n;
// read the rest directly from Read
n += self.read.read(&mut buf[n..])?;
Ok(n)
}
}
Finally, we need to implement BitRead. Try it out yourself as an exercise. Here is my implementation.
// append to src/bitread.rs
impl<R: Read> BitRead for BitReader<R> {
fn byte_align(&mut self) {
if self.nbits > 0 {
self.nbits = 0;
self.begin += 1;
}
}
fn consume(&mut self, n: u32) {
debug_assert!(self.bit_len() >= n as usize);
self.nbits += n;
self.begin += (self.nbits / 8) as usize;
self.nbits %= 8;
}
fn peek_bits(&mut self) -> std::io::Result<u32> {
// fill_buf may not return enough bytes at once
while self.buffer().len() < size_of::<u32>() {
if self.fill_buf()? == 0 {
return Err(std::io::Error::from(std::io::ErrorKind::UnexpectedEof));
}
}
let bits = u32::from_le_bytes(
self.buf[self.begin..self.begin + size_of::<u32>()]
.try_into()
.unwrap(),
);
Ok(bits >> self.nbits)
}
fn has_data_left(&mut self) -> std::io::Result<bool> {
Ok(!self.buffer().is_empty() || self.fill_buf()? != 0)
}
}
Mostly straightforward. In peek_bits() method, we first check to see if we have enough bits left. If not, we refill the data. Then, we obtain the first 4-bytes, remove already consumed bits, and return.
OK, so we have completed the second stage of the data pipeline, which is BitRead. Before we wrap up, don’t forget to include bitread module in lib.rs
// append to src/lib.rs
pub mod bitread;
and test if everything compiles
cargo build
Feel free to add unit tests for BitRead. Stay tuned for the next component!
Previous in series: https://towardsdev.com/rust-write-gunzip-from-scratch-1-a0100648b246
Next in series: https://towardsdev.com/rust-write-gunzip-from-scratch-3-56f21c29a97b