Performance optimization—efficient cache programming 1 Link to heading

Last time, we have examined a real example of branchless programming technique, which resulted in 7x speed up by reducing branch misses. Today, let’s explore how to speed up by employing efficient cache programming techniques again by looking at real examples and running benchmark. You can find the full source code here.

If you are not already familiar with this topic, I strongly recommend this classic video for excellent introduction.

Accessing memory is extremely slow in comparison to modern CPU’s performance. For this reason, modern CPUs are equipped with multiple levels of cache to speed up accessing data from memory. Large cache makes significant difference in computing power, especially those application that require large memory, such as games. Link to heading

Below is L1/L2/L3 cache size of AMD’s Ryzen 9800X3D. Having very large L3 cache (96MB) is the reason why this CPU is the best gaming CPU as of today.

Cache Level	Cache Size
L1	640KB
L2	8MB
L3	96MB

Given limited size of cache, the best thing we could do is to optimize the program to use the cache as efficiently as possible. This typically corresponds to reducing the memory usage as much as possible. Let’s write the following program in different ways to see how to achieve the speed up. Try it yourself too!

Programming Task Link to heading

Suppose we need to compute a hash value through a series of operations, either “add” or “multiply,” using values between 1 and 32768 (inclusive). Each operation is serialized as a JSON object of type:

enum Code {
    Multiply(u16), // 1..32768
    Add(u16),      // 1..32768
}

The process involves two main steps:

Codebook Creation:
- Read an integer n from the input as a string.
- Read n lines of JSON objects to create the codebook.
Operation Execution:
- Initialize with 0u64
- Read 4-byte u32 operation ID from the input.
- Look up operation in the codebook using its ID.
- Perform the corresponding operation
- Repeat until end of file

Here is a simple example input:

4
{"Add":32740}
{"Multiply":761}
{"Multiply":30965}
{"Add":5}
\x00000001\x00000000\x00000010\x00000011\0x00000001

This input indicates:

A codebook with 4 operations
Execution order: operations 1, 0, 2, 3, 1

The expected output of the program by running this input is ((0 * 761 + 32749) * 30965 + 5) * 761 = 771709393190

Before optimization Link to heading

I’d implement the program as follows. The hotspot would be run() function that looks up the operation from the codebook and execute.

use serde::{Deserialize, Serialize};
use std::io::{self, stdin, BufRead, Read};

#[derive(Serialize, Deserialize)]
enum Operation {
    Multiply(u16), // 1..32768
    Add(u16),      // 1..32768
}

fn main() -> io::Result<()> {
    let mut handle = stdin().lock();
    let mut lines = handle.by_ref().lines();

    // Read n
    let n: u32 = lines.next().unwrap()?.parse().unwrap();

    // Read n Code elements
    let mut codebook = Vec::with_capacity(n as usize);
    for _ in 0..n {
        let line = lines.next().unwrap()?;
        let code: Operation = serde_json::from_str(&line).unwrap();
        codebook.push(code);
    }

    // Read code until EOF
    let mut buf = Vec::new();
    handle.read_to_end(&mut buf)?;
    let indices: &[u32] = from_u8(&buf);
    let result = run(&codebook, indices);
    println!("{}", result);

    Ok(())
}

fn run(codebook: &[Operation], indices: &[u32]) -> u64 {
    indices
        .iter()
        .fold(0, |acc, ix| match codebook[*ix as usize] {
            Operation::Multiply(x) => acc.wrapping_mul(x as u64),
            Operation::Add(x) => acc.wrapping_add(x as u64),
        })
}

fn from_u8<T>(arr: &[u8]) -> &[T] {
    let item_size = std::mem::size_of::<T>();
    assert_eq!(arr.len() % item_size, 0);
    let len = arr.len() / item_size;
    let ptr = arr.as_ptr() as *const T;
    unsafe { std::slice::from_raw_parts(ptr, len) }
}

Running this on a codebook of size 1,000,000 and 200,000,000 operations, I get around 2s of runtime.

     4,463,720,733      instructions                                                         
       828,230,183      branches                                                             
         2,193,853      branch-misses                    #    0.26% of all branches           
       307,309,861      cache-references                                                     
       119,884,253      cache-misses                     #   39.01% of all cache refs        

       2.019265986 seconds time elapsed

       1.561133000 seconds user
       0.458039000 seconds sys

After optimization Link to heading

Now, let’s try to improve our implementation. First notice how cache-misses is whopping 39% above. This is definitely something we should reduce in order to achieve performance gain. Whenever we are dealing with an array of elements, the first thing to try is to reduce the memory size of the array as much as possible. This will reduce memory footprint and hence cache-miss rate.

In fn run(codebook: &[Operation], indices: &[u32]) -> u64 function, we can’t do much with indices: &[u32] since that is specification. What we could modify is codebook: &[Operation]. Let’s observe how many bytes Operation is. From Rust playground, we can see it takes up 4 bytes. Can we do better?

Looking at the range, the operand of Add and Multiply variants are 1 to 32768, meaning 2¹⁵ each. Hence, a single u16 can represent any Operation as follows

struct CompactOperation(u16);

impl CompactOperation {
    const MASK: u16 = 1 << 15;
    const MULTIPLY: u16 = 1 << 15;
    const ADD: u16 = 0 << 15;
}

impl From<Operation> for CompactOperation {
    fn from(value: Operation) -> Self {
        match value {
            Operation::Multiply(x) => CompactOperation((x - 1) | Self::MULTIPLY),
            Operation::Add(x) => CompactOperation((x - 1) | Self::ADD),
        }
    }
}

We let the left-most bit indicate whether Add or Multiply operation, and the rest of 15-bits represent the operand 1 to 32768 minus 1. We are essentially creating the enum but more efficiently. Now, CompactOperation struct is only u16, so that is 2-bytes large. This will reduce the memory size of the codebook array by half. How much of performance impact can this have? Below shows the diff for the new implementation after cache optimization

--- before
+++ after
@@ -7,6 +7,24 @@
     Add(u16),      // 1..32768
 }
 
+#[derive(Copy, Clone)]
+struct CompactOperation(u16);
+
+impl CompactOperation {
+    const MASK: u16 = 1 << 15;
+    const MULTIPLY: u16 = 1 << 15;
+    const ADD: u16 = 0 << 15;
+}
+
+impl From<Operation> for CompactOperation {
+    fn from(value: Operation) -> Self {
+        match value {
+            Operation::Multiply(x) => CompactOperation((x - 1) | Self::MULTIPLY),
+            Operation::Add(x) => CompactOperation((x - 1) | Self::ADD),
+        }
+    }
+}
+
 fn main() -> io::Result<()> {
     let mut handle = stdin().lock();
     let mut lines = handle.by_ref().lines();
@@ -19,7 +37,7 @@
     for _ in 0..n {
         let line = lines.next().unwrap()?;
         let code: Operation = serde_json::from_str(&line).unwrap();
-        codebook.push(code);
+        codebook.push(CompactOperation::from(code));
     }
 
     // Read code until EOF
@@ -32,13 +50,16 @@
     Ok(())
 }
 
-fn run(codebook: &[Operation], indices: &[u32]) -> u64 {
-    indices
-        .iter()
-        .fold(0, |acc, ix| match codebook[*ix as usize] {
-            Operation::Multiply(x) => acc.wrapping_mul(x as u64),
-            Operation::Add(x) => acc.wrapping_add(x as u64),
-        })
+fn run(codebook: &[CompactOperation], indices: &[u32]) -> u64 {
+    indices.iter().fold(0, |acc, ix| {
+        let compact_code = codebook[*ix as usize];
+        let x = (compact_code.0 & !CompactOperation::MASK) as u64 + 1;
+        match compact_code.0 & CompactOperation::MASK {
+            CompactOperation::MULTIPLY => acc.wrapping_mul(x),
+            CompactOperation::ADD => acc.wrapping_add(x),
+            _ => unreachable!(),
+        }
+    })
 }
 
 fn from_u8<T>(arr: &[u8]) -> &[T] {

Again, not much change from the original version. We are just manually implementing enum but more compactly. Running this modified version using the same input, I get

     5,459,382,101      instructions                                                         
       827,363,852      branches                                                             
         2,196,592      branch-misses                    #    0.27% of all branches           
       328,491,238      cache-references                                                     
        42,631,674      cache-misses                     #   12.98% of all cache refs        

       1.384393534 seconds time elapsed

       0.931324000 seconds user
       0.452671000 seconds sys

Note that the benchmark result may vary greatly depending on the cache size and speed of the system used.

First, notice the cache-miss rate is now reduced from 39% to 13%. However, there is also 20% increase in instructions from 4.4B to 5.5B probably because of extra work required to convert from Operation to CompactOperation in the codebook. Despite this, the **performance increased by 1.4x ** just by reducing the memory footprint of the array!

perf stat comparison

Alright, so the lesson of this article is to reduce memory footprint of an array. This implies that relying on high-level abstraction such as `std::variant` for C++ and `enum` for Rust isn’t always the most efficient implementation. Obviously, this does not mean you should never use them, but it just means there are times where it is better to manually re-implement. Link to heading

Another example is a struct with multiple fields that can also be compacted into a smaller-size data type. Here is one such example from the real world (DEFLATE) that bit-packs the following 8-bytes struct into 4-bytes.

// 8-bytes
struct Data {
  extra_bits: u8, // 0..=13
  value: u16,     // 0..=24577
  codelength: u8, // 0..=15
  symbol: u16,    // 0..=285
}

// 4-bytes
// extra_bits: 4-bits | value: 15-bits | codelength: 4-bits | symbol: 9-bits
struct PackedData(u32);

We will not go through this example here, as it shall yield similar result. Suffice to say that *when you have a large array of objects to enumerate, make sure the object footprint is as compact as possible!