Let’s say you want to compare efficiency of two different implementations. By efficiency, I mean how fast a program can run. One crude way is to use time
command and measure the runtime. This is easy to do but does not provide much info other than, well, the run time difference itself.
Another approach is to use a profiling tool to get a better idea of why one is more efficient than another. Below, I will perf
to analyze two sort
implementations, one from GNU coreutils and another from uutils.
# sort using Linux's built-in implementation, i.e., GNU coreutils
perf stat -e instructions,branches,branch-misses,cache-references,cache-misses sort -o output1.tsv input.tsv
Performance counter stats for 'sort -o output1.tsv input.tsv':
356177606 instructions
47026350 branches
3296732 branch-misses # 7.01% of all branches
136292394 cache-references
1363145 cache-misses # 1.000 % of all cache refs
0.692938577 seconds time elapsed
0.224913000 seconds user
0.301224000 seconds sys
# sort using uutils coreutils
perf stat -e instructions,branches,branch-misses,cache-references,cache-misses coreutils sort -o output2.tsv input.tsv
Performance counter stats for 'coreutils sort -o output2.tsv input.tsv':
249930880 instructions
29638029 branches
2153360 branch-misses # 7.27% of all branches
97090433 cache-references
870114 cache-misses # 0.896 % of all cache refs
0.424018461 seconds time elapsed
0.108133000 seconds user
0.268331000 seconds sys
Starting off with instructions, uutils
implementation runs 30% fewer instructions, which typically translates to more efficient code.
The uutils
implementation also requires substantially fewer branches and branch misses. Note that branch misses have enourmous effect on efficiency of a program. For more details on this topic, I recommend this excellent talk on branchless programming by Fedor Pikus.
Finally, we also see much fewer cache references and cache misses with the uutils
implementation. Cache misses again have huge impact on the efficiency of the program. For more details on this topic, I recommend another excellent talk on cpu caches and why you care by Scott Meyers.
So, given these statistics we would expect the uutils
implementation to run much faster than GNU coreutils
, and indeed we observe ~40% reduction in runtime.