Let’s say you want to compare efficiency of two different implementations. By efficiency, I mean how fast a program can run. One crude way is to use time command and measure the runtime. This is easy to do but does not provide much info other than, well, the run time difference itself.

Another approach is to use a profiling tool to get a better idea of why one is more efficient than another. Below, I will perf to analyze two sort implementations, one from GNU coreutils and another from uutils.

# sort using Linux's built-in implementation, i.e., GNU coreutils
perf stat -e instructions,branches,branch-misses,cache-references,cache-misses sort -o output1.tsv input.tsv

Performance counter stats for 'sort -o output1.tsv input.tsv':

        356177606      instructions
        47026350      branches
        3296732      branch-misses             #    7.01% of all branches
        136292394      cache-references
        1363145      cache-misses              #    1.000 % of all cache refs

    0.692938577 seconds time elapsed

    0.224913000 seconds user
    0.301224000 seconds sys

# sort using uutils coreutils
perf stat -e instructions,branches,branch-misses,cache-references,cache-misses coreutils sort -o output2.tsv input.tsv

Performance counter stats for 'coreutils sort -o output2.tsv input.tsv':

        249930880      instructions
        29638029      branches
        2153360      branch-misses             #    7.27% of all branches
        97090433      cache-references
        870114      cache-misses              #    0.896 % of all cache refs

    0.424018461 seconds time elapsed

    0.108133000 seconds user
    0.268331000 seconds sys

Starting off with instructions, uutils implementation runs 30% fewer instructions, which typically translates to more efficient code.

The uutils implementation also requires substantially fewer branches and branch misses. Note that branch misses have enourmous effect on efficiency of a program. For more details on this topic, I recommend this excellent talk on branchless programming by Fedor Pikus.

Finally, we also see much fewer cache references and cache misses with the uutils implementation. Cache misses again have huge impact on the efficiency of the program. For more details on this topic, I recommend another excellent talk on cpu caches and why you care by Scott Meyers.

So, given these statistics we would expect the uutils implementation to run much faster than GNU coreutils, and indeed we observe ~40% reduction in runtime.