Data compression in 100 lines of code Link to heading
Today, we will write a very simple compression program and decompression program in just 100 lines total from scratch. The algorithm we will use is run length encoding. Our compression program will take any input, apply compression, and write out. The decompression program will take any compressed input, apply decompression, and write out. What is really important is that for every file, applying compression and then decompression must restore the original file.
Unfortunately, there is no guarantee that our program will be able to compress (i.e., reduce size) every single file. This is true not only for our program but also for any compression algorithm out there. See here if you are not convinced. However, we will guarantee that our compressed output will be at most 1-byte longer than the original in the worst case. At the end of the article, we will write the compression program that can reduce linux.tar file by 13% and restore to the original file.
Run length encoding Link to heading
Run Length Encoding (RLE) is a simple compression algorithm that represents a consecutively repeating byte as a pair of the byte and its repeat length. For example, we can compress the string thank you sooooooo much into thank you so7 much where the 7-bytes string ooooooo is replaced with 2-bytes string o7. This compression scheme would work for any text file as long as the text file does not contain any ASCII number—if the original text contains an ASCII number, then we won’t be able to distinguish if so7 is the original text or compressed text of sooooooo.
RLE format for arbitrary file Link to heading
To apply RLE to an arbitrary file, we would need to be a bit more clever. Here is what we can do: every time we see a repeating byte of length 2 to 257 inclusive, we only keep the first two and replace the rest with a single byte for additional repeat count of 0 up to 255. For example, the string thank you sooooooo much would become thank you soo\x05 much where \x05 is a byte value 5, meaning we have 5 additional o’s after oo. To recover the original string, we simply look for any repeating bytes of length two, and interpret the subsequent byte as the additional repeat count.
Let’s define our RLE compressed file format
- the first byte is always either
0x00or0xFF - if the first byte is
0x00, the rest is raw uncompressed data - if the first byte is
0xFF, the rest represents the compressed as discussed above
This way, only if RLE encoding actually reduces the size by more than 2-bytes, we will apply compression to the file; otherwise, we will use the raw uncompressed data with an extra 0x00 byte padded in the front. Now, we are ready to write simple programs for both compression and decompression.
Compression Link to heading
Below is what I came up. The RLE scheme is extremely simple, so feel free to write your own implementation in your favorite programming language.
// compress.rs
use std::io::{stdin, stdout, Read, Result, Write};
fn main() -> Result<()> {
const UNCOMPRESSED: u8 = 0x00;
const COMPRESSED: u8 = 0xFF;
let mut uncompressed = Vec::new();
stdin().read_to_end(&mut uncompressed)?;
// empty input, empty output
if uncompressed.is_empty() {
return Ok(());
}
let compressed = run_length_encode(&mut uncompressed);
if uncompressed.len() >= compressed.len() + 2 {
// compression reduces size by more than 2 bytes or more
stdout().write_all(&[COMPRESSED])?;
stdout().write_all(&compressed)?;
} else {
// compression does not reduce the size; use the uncompressed data as is
stdout().write_all(&[UNCOMPRESSED])?;
stdout().write_all(&uncompressed)?;
}
Ok(())
}
fn run_length_encode(uncompressed: &[u8]) -> Vec<u8> {
let n = uncompressed.len();
let mut compressed = Vec::new();
let mut ix = 0;
while ix < n {
let x = uncompressed[ix];
let mut count = 1;
while ix + 1 < n && x == uncompressed[ix + 1] && count < 257 {
count += 1;
ix += 1;
}
compressed.push(x);
if count >= 2 {
compressed.push(x);
compressed.push((count - 2) as u8);
}
ix += 1;
}
compressed
}
Let’s put our compression program into action. As always, we will try to compress linux.tar file.
# linux.tar original file size
wc -c linux.tar # 1489264640 bytes
# compress into linux.tar.rle
./compress < linux.tar > linux.tar.rle
# compressed file size
wc -c linux.tar.rle # 1291142402 bytes
On my M2 processor, it only took 2.3s to compress 1.4GB file and reduced the size by 13%. Not the most impressive compression efficiency, but not bad with just 50 lines of code.
Now, let’s also test if our program really adds at most 1-byte compared to the original size when the input cannot be further compressed.
# extract first 1M random bytes, which is very difficult to compress
head -c 1000000 /dev/random > random
# compress into random.rle
./compress < random > random.rle
# compressed file size
wc -c random.rle # 1000001 bytes
As expected, it only adds at most 1-byte overhead when it cannot compress.
Decompression Link to heading
Now, let’s write our decompression program that does the reverse. Below is what I came up with, but again please feel free to write your own implementation.
// decompress.rs
use std::io::{stdin, stdout, Read, Result, Write};
fn main() -> Result<()> {
const UNCOMPRESSED: u8 = 0x00;
const COMPRESSED: u8 = 0xFF;
let mut input = Vec::new();
stdin().read_to_end(&mut input)?;
if input.is_empty() {
return Ok(());
}
let mut output = Vec::new();
match input[0] {
UNCOMPRESSED => {
output.extend_from_slice(&input[1..]);
}
COMPRESSED => {
output = run_length_decode(&input[1..]);
}
_ => {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
"Invalid compression flag",
));
}
}
stdout().write_all(&output)?;
Ok(())
}
fn run_length_decode(compressed: &[u8]) -> Vec<u8> {
let n = compressed.len();
let mut decompressed = Vec::new();
let mut ix = 0;
while ix < n {
let x = compressed[ix];
if ix + 1 < n && compressed[ix + 1] == x {
let count = compressed[ix + 2] as usize + 2;
decompressed.extend(std::iter::repeat(x).take(count));
ix += 3;
} else {
decompressed.push(x);
ix += 1;
}
}
decompressed
}
Let’s test our program and make sure we can correctly restore the original file.
# decompress and compare to linux.tar
cmp linux.tar <(./decompress < linux.tar.rle)
# decompress and compare to random
cmp random <(./decompress < random.rle)
Everything works as expected! We just wrote our first compression and decompression programs in only 100 lines of code. Stay tuned for the next compression algorithm implementation!
References Link to heading
Pigeonhole principle - Wikipedia
Run-length encoding - Wikipedia