Whichever language we write our program with, at the end of the day the executable is created by translating into the machine instructions. In general, fewer instructions require fewer CPU cycles and hence run faster. So I have been wondering how a simple program written in various languages compare in terms of # instructions, # cache references, and # branches.
Let’s start off with the most simple program, which simply reads input text file and prints out line by line. We will write this in C
, C++
, and Rust
.
// stdio.cc
#include <cstdio>
#include <string>
#include <cstdlib>
/**
* Copies input to output line by line
* Usage: io [INPUT [OUTPUT]]
* If INPUT/OUTPUT is omitted, stdin/stdout is assumed
*/
int main(int argc, const char** argv) {
std::string input = "-";
std::string output = "-";
if (argc >= 2) input = argv[1];
if (argc >= 3) output = argv[2];
if (input == "-") input = "/dev/stdin";
if (output == "-") output = "/dev/stdout";
FILE* ifs = fopen(input.data(), "r");
FILE* ofs = fopen(output.data(), "w");
char* line = nullptr;
size_t capacity;
while (auto n = getline(&line, &capacity, ifs)) {
if (n <= 0) break;
fwrite(line, sizeof(char), n, ofs);
}
free(line);
fclose(ifs);
fclose(ofs);
return 0;
}
Ok, this is technically C++
but it uses stdio
and stdlib
library from C
, so let’s just call it C
-implementation.
// iostream.cc
#include <iostream>
#include <fstream>
/**
* Copies input to output line by line
* Usage: io [INPUT [OUTPUT]]
* If INPUT/OUTPUT is omitted, stdin/stdout is assumed
*/
int main(int argc, const char** argv) {
std::ios::sync_with_stdio(false);
std::string input = "-";
std::string output = "-";
if (argc >= 2) input = argv[1];
if (argc >= 3) output = argv[2];
if (input == "-") input = "/dev/stdin";
if (output == "-") output = "/dev/stdout";
std::ifstream ifs{input};
std::ofstream ofs{output};
std::string line;
while (std::getline(ifs, line)) {
ofs << line << "\n";
}
return 0;
}
For our C++
program above, it is almost identical to C
-version, except it uses iostream
from the C++
standard library.
// rust_io.rs
use std::io::{BufReader, BufRead, BufWriter, Write};
use std::fs::File;
/**
* Copies input to output line by line
* Usage: io [INPUT [OUTPUT]]
* If INPUT/OUTPUT is omitted, stdin/stdout is assumed
*/
fn main() -> std::io::Result<()> {
let args: Vec<_> = std::env::args().collect();
let input_file = if args.len() >= 2 { args[1].clone() } else { "/dev/stdin".to_owned() };
let output_file = if args.len() >= 3 { args[2].clone() } else { "/dev/stdout".to_owned() };
let ifs = BufReader::new(File::open(input_file)?);
let mut ofs = BufWriter::new(File::create(output_file)?);
for line in ifs.lines() {
let line = line?;
writeln!(ofs, "{}", line)?;
}
Ok(())
}
Finally the rust
code.
So, any guess as to which program will run with the least instructions? cache references? branches?
To be fair, I will use LLVM/Clang to compile C
and C++
, since Rust
uses LLVM as the backend. On my arm64 architecture, here is the result:
clang++ -O3 -std=c++11 -o iostream iostream.cc
clang++ -O3 -std=c++11 -o stdio stdio.cc
rustc -C opt-level=3 rust_io.rs -o rust_io
perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./stdio input.tsv output1.tsv \
&& perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./iostream input.tsv output2.tsv \
&& perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./rust_io input.tsv output3.tsv
Performance counter stats for './stdio input.tsv output1.tsv':
175404790 instructions
21383967 branches
1848947 branch-misses # 8.65% of all branches
67305342 cache-references
864562 cache-misses # 1.285 % of all cache refs
0.414314617 seconds time elapsed
0.064524000 seconds user
0.201637000 seconds sys
Performance counter stats for './iostream input.tsv output2.tsv':
205537304 instructions
24900665 branches
1957023 branch-misses # 7.86% of all branches
85756002 cache-references
987124 cache-misses # 1.151 % of all cache refs
0.319406442 seconds time elapsed
0.145116000 seconds user
0.161240000 seconds sys
Performance counter stats for './rust_io input.tsv output3.tsv':
206013042 instructions
24794799 branches
2083127 branch-misses # 8.40% of all branches
82961063 cache-references
914823 cache-misses # 1.103 % of all cache refs
0.321221279 seconds time elapsed
0.084762000 seconds user
0.209888000 seconds sys
Disregard the elapsed time, as it was run on my Raspberry Pi Zero 2W, which does not seem to produce consistent runtime. We see that the C
-version produces the least instructions, while C++
and Rust
versions are comparable. More importantly, the number of branches (and misses) and cache references (and misses) are fewer on the C
-version, which will likely speed up the program.
So far, not too surprising. This is probably what we expect. The lower-level languages tend to produce smaller executable compared to high-level languages. Ok, then let’s write another program. This time, let’s sort the input lines before printing out.
// sort_c.cc
#include <cstdio>
#include <string>
#include <cstring>
#include <algorithm>
#include <vector>
int comp(char** a, char** b) {
return std::strcmp(*a, *b);
}
/**
* sort
* Usage: io [INPUT [OUTPUT]]
* If INPUT/OUTPUT is omitted, stdin/stdout is assumed
*/
int main(int argc, const char** argv) {
std::string input = "-";
std::string output = "-";
if (argc >= 2) input = argv[1];
if (argc >= 3) output = argv[2];
if (input == "-") input = "/dev/stdin";
if (output == "-") output = "/dev/stdout";
FILE* ifs = fopen(input.data(), "r");
FILE* ofs = fopen(output.data(), "w");
std::vector<char*> lines;
while (true) {
char* line = nullptr;
size_t capacity;
auto n = getline(&line, &capacity, ifs);
if (n <= 0) break;
lines.push_back(line);
}
qsort(lines.data(), lines.size(), sizeof(*lines.data()), reinterpret_cast<int (*)(const void *, const void *)>(comp));
for (auto &line : lines) {
fprintf(ofs, "%s", line);
free(line);
}
fclose(ifs);
fclose(ofs);
return 0;
}
Again, I am writing a hybrid C
/C++
version, since I don’t want to re-implement vector
in C
, but I am using qsort
from C
, which will presumably consume most of the instructions.
// sort_cc.cc
#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
/**
* sort
* Usage: io [INPUT [OUTPUT]]
* If INPUT/OUTPUT is omitted, stdin/stdout is assumed
*/
int main(int argc, const char** argv) {
std::ios::sync_with_stdio(false);
std::string input = "-";
std::string output = "-";
if (argc >= 2) input = argv[1];
if (argc >= 3) output = argv[2];
if (input == "-") input = "/dev/stdin";
if (output == "-") output = "/dev/stdout";
std::ifstream ifs{input};
std::ofstream ofs{output};
std::vector<std::string> lines;
std::string line;
while (std::getline(ifs, line)) {
lines.push_back(std::move(line));
}
std::sort(std::begin(lines), std::end(lines));
for (auto &line : lines) {
ofs << line << "\n";
}
return 0;
}
For C++
, I am using std::sort
.
use std::io::{BufReader, BufRead, BufWriter, Write};
use std::fs::File;
/**
* sort
* Usage: io [INPUT [OUTPUT]]
* If INPUT/OUTPUT is omitted, stdin/stdout is assumed
*/
fn main() -> std::io::Result<()> {
let args: Vec<_> = std::env::args().collect();
let input_file = if args.len() >= 2 { args[1].clone() } else { "/dev/stdin".to_owned() };
let output_file = if args.len() >= 3 { args[2].clone() } else { "/dev/stdout".to_owned() };
let ifs = BufReader::new(File::open(input_file)?);
let mut ofs = BufWriter::new(File::create(output_file)?);
let mut lines = Vec::new();
for line in ifs.lines() {
let line = line?;
lines.push(line);
}
lines.sort_unstable();
for line in lines.into_iter() {
writeln!(ofs, "{}", line)?;
}
Ok(())
}
For Rust
, use sort_unstable to be consistent with the C
and C++
implementation based on quick sort.
Now, another guess on # instructions, cache references, and branches?
clang++ -O3 -std=c++11 -o sort_c sort_c.cc
clang++ -O3 -std=c++11 -o sort_cc sort_cc.cc
rustc -C opt-level=3 sort_rust.rs -o sort_rust
perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./sort_c input.tsv output1.tsv \
&& perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./sort_cc input.tsv output2.tsv \
&& perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./sort_rust input.tsv output3.tsv
Performance counter stats for './sort_c input.tsv output1.tsv':
390649760 instructions
45792242 branches
3850947 branch-misses # 8.41% of all branches
133946131 cache-references
3998514 cache-misses # 2.985 % of all cache refs
1.315745627 seconds time elapsed
0.424391000 seconds user
0.375422000 seconds sys
Performance counter stats for './sort_cc input.tsv output2.tsv':
415197504 instructions
52343350 branches
3141354 branch-misses # 6.00% of all branches
161482085 cache-references
3915068 cache-misses # 2.424 % of all cache refs
0.864817779 seconds time elapsed
0.458420000 seconds user
0.326873000 seconds sys
Performance counter stats for './sort_rust input.tsv output3.tsv':
352974111 instructions
42474938 branches
2562994 branch-misses # 6.03% of all branches
131108106 cache-references
3121866 cache-misses # 2.381 % of all cache refs
1.101268364 seconds time elapsed
0.366721000 seconds user
0.255110000 seconds sys
To me, the results were surprising because
- # instructions from the
Rust
-version is even fewer than theC
-version - % branch misses and cache misses on the
C
-version is much higher than theC++
andRust
versions
I guess the Rust
’s sort_unstable
implementation is much more efficiently written than qsort
or std::sort
. In any case, Rust
seems to be the clear winner here. Kudos to the Rust
-Language developers!
All the source code files are available at https://github.com/TechHara/iostream_benchmark.git