Compiler optimizations can be misleading Link to heading

User can specify compilation optimization levels with -O flag. Probably the most widely used options are the default -O0 for debugging purposes and -O2 or -O3 for release builds. As the optimization level increases, gcc throws in more aggressive optimizations, where we expect the code to run faster. For example, -O3 adds the following optimizations on top of -O2 option

-fgcse-after-reload
-fipa-cp-clone
-floop-interchange
-floop-unroll-and-jam
-fpeel-loops
-fpredictive-commoning
-fsplit-loops
-fsplit-paths
-ftree-loop-distribution
-ftree-partial-pre
-funswitch-loops
-fvect-cost-model=dynamic
-fversion-loops-for-strides

If you are interested, here is the official documentation.

Unfortunately, the actual performance and runtime speed of the app does not always correlate with optimization levels. Here is a simple example code that that runs significantly slower with -O3 optimization compared to -O2, on my x64 linux system:

#include <iostream>

int fibo(int n) {
    if (n <= 1) return n;
    else return fibo(n-2) + fibo(n-1);
}

int main() {
    std::cout << fibo(40) << "\n";
    return 0;
}

I compiled the code using gcc 11.3.0 with optimization levels 0 to 3 and ran benchmarks:

$ for o in 0 1 2 3; do g++ -O$o fibo.cc -o fibo$o; done
$ hyperfine ./fibo0 ./fibo1 ./fibo2 ./fibo3
Benchmark 1: ./fibo0
  Time (mean ± σ):     483.0 ms ±   1.3 ms    [User: 482.9 ms, System: 0.1 ms]
  Range (min … max):   481.5 ms … 485.5 ms    10 runs

Benchmark 2: ./fibo1
  Time (mean ± σ):     404.5 ms ±   0.7 ms    [User: 403.9 ms, System: 0.6 ms]
  Range (min … max):   403.8 ms … 405.9 ms    10 runs

Benchmark 3: ./fibo2
  Time (mean ± σ):     140.2 ms ±   1.4 ms    [User: 139.3 ms, System: 0.7 ms]
  Range (min … max):   138.6 ms … 144.7 ms    21 runs

Benchmark 4: ./fibo3
  Time (mean ± σ):     174.8 ms ±   1.0 ms    [User: 174.0 ms, System: 0.8 ms]
  Range (min … max):   174.1 ms … 178.0 ms    17 runs

Summary
  './fibo2' ran
    1.25 ± 0.01 times faster than './fibo3'
    2.89 ± 0.03 times faster than './fibo1'
    3.45 ± 0.04 times faster than './fibo0'

Compiler optimization benchmark results showing -O2 running fastest at 140.2ms compared to -O3 at 174.8ms

We can clearly see that -O2 ran fastest — 25% speed up compared to -O3 is quite significant. I have always assumed that -O3 should run at least as fast as -O2, and blindly compiled with -O3 flag for release builds.

I digged a bit deeper and compared the assembly code generated by Compiler Explorer, and here is the difference in the assembly code

--- -O2
+++ -O3
@@ -197,19 +197,51 @@
 .LC0:
         .string "\n";
 main:
+        push    r13
+        xor     r13d, r13d
+        push    r12
+        mov     r12d, 38
+        push    rbp
+        push    rbx
         sub     rsp, 8
-        mov     edi, 40
+.L71:
+        lea     ebx, [r12-2]
+        xor     ebp, ebp
+        cmp     r12d, 1
+        jbe     .L80
+.L75:
+        mov     edi, ebx
         call    fibo(int)
+        add     ebp, eax
+        sub     ebx, 1
+        jnb     .L75
+        sub     r12d, 1
+        lea     r13d, [r13+1+rbp]
+        xor     ebp, ebp
+        lea     ebx, [r12-2]
+        cmp     r12d, 1
+        ja      .L75
+.L80:
+        add     r13d, r12d
+        test    r12d, r12d
+        jne     .L81
+        lea     esi, [r13+1]
         mov     edi, OFFSET FLAT:_ZSt4cout
-        mov     esi, eax
         call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
         mov     edx, 1
         mov     esi, OFFSET FLAT:.LC0
         mov     rdi, rax
         call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
-        xor     eax, eax
         add     rsp, 8
+        xor     eax, eax
+        pop     rbx
+        pop     rbp
+        pop     r12
+        pop     r13
         ret
+.L81:
+        xor     r12d, r12d
+        jmp     .L71
 _GLOBAL__sub_I_fibo(int):
         sub     rsp, 8
         mov     edi, OFFSET FLAT:_ZStL8__ioinit

For whatever reason, the compiler seems to be adding unnecessary code with -O3, which may well be a bug or unintended behavior. I am no expert on this, but I can’t imagine any scenario where the -O3 can run faster than -O2 with the added instructions above.

Surely enough, profiling reveals that -O2 runs with ~24% fewer instructions compared to -O3, aligning perfectly with 25% boost in runtime.

$ perf stat ./fibo2
102334155

 Performance counter stats for './fibo2':

            140.37 msec task-clock                #    0.994 CPUs utilized
                 0      context-switches          #    0.000 /sec
                 0      cpu-migrations            #    0.000 /sec
               127      page-faults               #  904.752 /sec
         594547853      cycles                    #    4.236 GHz                      (82.91%)
            229667      stalled-cycles-frontend   #    0.04% frontend cycles idle     (82.90%)
             51811      stalled-cycles-backend    #    0.01% backend cycles idle      (82.90%)
        2132262074      instructions              #    3.59  insn per cycle
                                                  #    0.00  stalled cycles per insn  (82.90%)
         381545506      branches                  #    2.718 G/sec                    (83.15%)
           1285727      branch-misses             #    0.34% of all branches          (85.23%)

       0.141185168 seconds time elapsed

       0.141226000 seconds user
       0.000000000 seconds sys


$ perf stat ./fibo3
102334155

 Performance counter stats for './fibo3':

            174.57 msec task-clock                #    0.996 CPUs utilized
                 2      context-switches          #   11.457 /sec
                 0      cpu-migrations            #    0.000 /sec
               128      page-faults               #  733.219 /sec
         739885292      cycles                    #    4.238 GHz                      (81.68%)
            157639      stalled-cycles-frontend   #    0.02% frontend cycles idle     (83.24%)
              1846      stalled-cycles-backend    #    0.00% backend cycles idle      (83.96%)
        2617414601      instructions              #    3.54  insn per cycle
                                                  #    0.00  stalled cycles per insn  (83.98%)
         395343682      branches                  #    2.265 G/sec                    (83.96%)
           1280136      branch-misses             #    0.32% of all branches          (83.18%)

       0.175316380 seconds time elapsed

       0.175308000 seconds user
       0.000000000 seconds sys

So, the lesson I learned was

Do not trust compiler optimization. Always run benchmark with different optimization options and choose the one that runs the fastest.

Update: I found the culprit behind: -fipa-cp-clone. This flag seems to be causing all the trouble. I wonder if there is a bug that handles this flag.