Quantitative analysis of program’s efficiency — Part 1 Link to heading

Let’s say you want to compare efficiency of two different implementations. By efficiency, I mean how fast a program can run. One crude way is to use “time” command and measure the runtime. This is easy to do but does not provide much info other than, well, the run time difference itself.

Another approach is to use a profiling tool to get a better idea of why one is more efficient than another. Below, I will “perf” to analyze two sort implementations, one from GNU coreutils and another from uutils coreutils.

# sort using Linux's built-in implementation, i.e., GNU coreutils
$ perf stat -e instructions,branches,branch-misses,cache-references,cache-misses sort -o output1.tsv input.tsv

 Performance counter stats for 'sort -o output1.tsv input.tsv':

         356177606      instructions
          47026350      branches
           3296732      branch-misses             #    7.01% of all branches
         136292394      cache-references
           1363145      cache-misses              #    1.000 % of all cache refs

       0.692938577 seconds time elapsed

       0.224913000 seconds user
       0.301224000 seconds sys

# sort using uutils coreutils
$ perf stat -e instructions,branches,branch-misses,cache-references,cache-misses coreutils sort -o output2.tsv input.tsv

 Performance counter stats for 'coreutils sort -o output2.tsv input.tsv':

         249930880      instructions
          29638029      branches
           2153360      branch-misses             #    7.27% of all branches
          97090433      cache-references
            870114      cache-misses              #    0.896 % of all cache refs

       0.424018461 seconds time elapsed

       0.108133000 seconds user
       0.268331000 seconds sys

Starting off with instructions, uutils implementation runs 30% fewer instructions, which typically translates to more efficient code.

The uutils implementation also requires substantially fewer branches and branch misses. Note that branch misses have enourmous effect on efficiency of a program. For more details on this topic, I recommend this excellent talk on branchless programming by Fedor Pikus.

Finally, we also see much fewer cache references and cache misses with the uutils implementation. Cache misses again have huge impact on the efficiency of the program. For more details on this topic, I recommend another excellent talk on cpu caches and why you care by Scott Meyers.

So, given these statistics we would expect the uutils implementation to run much faster than GNU coreutils, and indeed we observe ~40% reduction in runtime.