Performance optimization — branchless programming Link to heading
Everyone wants to write program that runs fast and efficient. Today, let’s have a look at one of the performance optimization techniques: branchless programming. We will go through an example where branchless optimization achieves whopping 7x faster performance! You can find the full source code here.
If you are not familiar with the topic yet, these two videos will serve as excellent introduction to the topic.
Branch misprediction overhead remains high in current processors due to deep pipelines and speculative execution, which cause significant time and resource waste when a misprediction occurs, despite advanced prediction techniques.
Branchful implementation Link to heading
Branching means there are multiple paths of instructions for the program to execute. A classic example is with if/else and switch statements. Let’s say we have a function that takes in two optional<int> arguments. The function is to
- if both arguments contain values: compute the product and return
- else: return a preset value
A typical implementation would involve a simple if/else as below
int branchful(std::optional<int> x, std::optional<int> y) {
const int NULL_VALUE = -1;
if (x.has_value() && y.has_value()) {
return *x * *y;
} else {
return NULL_VALUE;
}
}
How many branches do we have in the code? Well, though not obvious, the answer is 2. The not-so-obvious branch comes from the logical AND operator &&.

Branchful control flow
Let’s examine the assembly code to verify. Below is what gcc produces without any optimization (so that the function is not inlined but even with -O2 flag, gcc still implements with branch instructions)
branchful(std::optional<int>, std::optional<int>):
push rbp
mov rbp, rsp
push rbx
sub rsp, 24
mov QWORD PTR [rbp-24], rdi
mov QWORD PTR [rbp-32], rsi
lea rax, [rbp-24]
mov rdi, rax
call std::optional<int>::has_value() const
test al, al
je .L12
lea rax, [rbp-32]
mov rdi, rax
call std::optional<int>::has_value() const
test al, al
je .L12
mov eax, 1
jmp .L13
.L12:
mov eax, 0
.L13:
test al, al
je .L14
lea rax, [rbp-24]
mov rdi, rax
call std::optional<int>::operator*() &
mov ebx, DWORD PTR [rax]
lea rax, [rbp-32]
mov rdi, rax
call std::optional<int>::operator*() &
mov eax, DWORD PTR [rax]
imul eax, ebx
jmp .L15
.L14:
mov eax, -1
.L15:
mov rbx, QWORD PTR [rbp-8]
leave
ret
We can find the je operation right after std::optional<int>::has_value() call for each argument. With this implementation, the benchmark program took 4.7s to complete.
Branchless implementations 1 Link to heading
How could we implement it without branching? One trick is to use array indexing. That is, prepare a lookup table that contains all possible values. Then, select the correct one from the table using the conditional variable.
int lookup_table(std::optional<int> x, std::optional<int> y) {
const int NULL_VALUE = -1;
// construct a lookup table of all possible values
int options[2] = {NULL_VALUE, *x * *y};
// construct the index from the conditional variable
// note bitwise operation `&` and not logical operation `&&`
auto ix = x.has_value() & y.has_value();
// select and return
return options[ix];
}
Let’s verify that there is no branching instruction from the assembly code generated by gcc.
lookup(std::optional<int>, std::optional<int>):
push rbp
mov rbp, rsp
push rbx
sub rsp, 40
mov QWORD PTR [rbp-40], rdi
mov QWORD PTR [rbp-48], rsi
lea rax, [rbp-40]
mov rdi, rax
call std::optional<int>::has_value() const
movzx ebx, al
lea rax, [rbp-48]
mov rdi, rax
call std::optional<int>::has_value() const
movzx eax, al
and eax, ebx
mov DWORD PTR [rbp-20], eax
mov QWORD PTR [rbp-28], 0
mov DWORD PTR [rbp-28], -1
lea rax, [rbp-40]
mov rdi, rax
call std::optional<int>::operator*() &
mov ebx, DWORD PTR [rax]
lea rax, [rbp-48]
mov rdi, rax
call std::optional<int>::operator*() &
mov eax, DWORD PTR [rax]
imul eax, ebx
mov DWORD PTR [rbp-24], eax
mov eax, DWORD PTR [rbp-20]
cdqe
mov eax, DWORD PTR [rbp-28+rax*4]
mov rbx, QWORD PTR [rbp-8]
leave
ret
Viola! There is not a single branching instruction at all! The same benchmark program took 0.7s to complete—that is 6 times faster!
Branchless implementations 2 Link to heading
We are not done yet. There is another variant we could try. This time, we are going to magically compute the return value directly. This may well be faster than the lookup table approach because indexing operation from the lookup table approach is slow as it needs access to the memory.
Rather than creating the lookup table and indexing, we will use bit-mask approach. Given signed cond taking 0 or 1 boolean value, we could subtract 1 and get a conditional mask containing all-zero-bits vs all-one-bits. Then, we could perform bitwise operations to construct the value we want directly.
int bitmasking(std::optional<int> x, std::optional<int> y) {
const int NULL_VALUE = -1;
auto prod = *x * *y;
int cond = x.has_value() & y.has_value();
// either all-zero-bits or all-one-bits
int mask = cond - 1;
// bitmasking operations
return (mask & NULL_VALUE) | (~mask & prod);
}
Let’s verify its assembly code.
bitmasking(std::optional<int>, std::optional<int>):
push rbp
mov rbp, rsp
push rbx
sub rsp, 40
mov QWORD PTR [rbp-40], rdi
mov QWORD PTR [rbp-48], rsi
lea rax, [rbp-40]
mov rdi, rax
call std::optional<int>::operator*() &
mov ebx, DWORD PTR [rax]
lea rax, [rbp-48]
mov rdi, rax
call std::optional<int>::operator*() &
mov eax, DWORD PTR [rax]
imul eax, ebx
mov DWORD PTR [rbp-20], eax
lea rax, [rbp-40]
mov rdi, rax
call std::optional<int>::has_value() const
movzx ebx, al
lea rax, [rbp-48]
mov rdi, rax
call std::optional<int>::has_value() const
movzx eax, al
and eax, ebx
mov DWORD PTR [rbp-24], eax
mov eax, DWORD PTR [rbp-24]
sub eax, 1
mov DWORD PTR [rbp-28], eax
mov eax, DWORD PTR [rbp-28]
or eax, DWORD PTR [rbp-20]
mov rbx, QWORD PTR [rbp-8]
leave
ret
Again, completely free of branching! This implementation ran in just 0.6s, resulting in 7x faster performance compared to the branchful implementation!
Profiling Link to heading
Measuring the program’s runtime is one thing, but for more scientific analysis, we need to actually measure the branch-related instruction counts. For that, perf is the best tool.
perf stat -e instructions,branches,branch-misses command [args...]

perf stat comparison
First, notice that the number instructions are very similar among all three implementations. The branchless implementations (lookup and bitmasking) reduce the branch instructions by 3x. More importantly, the branch-misses are reduced by impressive 400x with branchless implementations. This explains why we are seeing 7x speed up. The remaining branches are from other parts of the program, and their branch-miss rate is only 0.2%. This confirms our hypothesis that branch mispredictions were the culprit in slowing down the program. By the way, this measures the program when compiled by gcc with -O2 optimization.
Discussions Link to heading
Note that we are actually doing more work with branchless implementations because we are performing the multiplication *x * *y regardless of the condition, as opposed to only when both arguments contain values in the branchful implementation. Typically, the branchless optimization works well if the branch misprediction cost is greater than the extra work we do. This implies that the branchless programming works well under these conditions
- data: close to random in which case branch prediction is very difficult
- workload: light extra work is required
- compiler: not so clever to do branchless optimization for us
For example, I obtained 7x performance gain when compiled with gcc but no gain when compiled with llvm , as llvm aggressively performs branchless optimization even for the branchful implementation.
Premature optimization is the root of all evil [Donald Knuth]
Please do not get the idea that we should avoid if/else as much possible. That is outright wrong conclusion. The truth is that it really depends on so many factors, such as compiler, optimization flags, programming language, operating system, CPU, and the code itself. Unless you are an expert in this area, you probably won’t be able to tell just from reading the code.
The best practice is to always profile and locate the hotspot first, and then measure and compare alternative implementations at the hotspot.