Whichever language we write our program with, at the end of the day the executable is created by translating into the machine instructions. In general, fewer instructions require fewer CPU cycles and hence run faster. So I have been wondering how a simple program written in various languages compare in terms of # instructions, # cache references, and # branches.

Let’s start off with the most simple program, which simply reads input text file and prints out line by line. We will write this in C, C++, and Rust.

// stdio.cc
#include <cstdio>
#include <string>
#include <cstdlib>

/**
 * Copies input to output line by line
 * Usage: io [INPUT [OUTPUT]]
 * If INPUT/OUTPUT is omitted, stdin/stdout is assumed
 */
int main(int argc, const char** argv) {
  std::string input = "-";
  std::string output = "-";

  if (argc >= 2) input = argv[1];
  if (argc >= 3) output = argv[2];
  if (input == "-") input = "/dev/stdin";
  if (output == "-") output = "/dev/stdout";

  FILE* ifs = fopen(input.data(), "r");
  FILE* ofs = fopen(output.data(), "w");

  char* line = nullptr;
  size_t capacity;
  while (auto n = getline(&line, &capacity, ifs)) {
    if (n <= 0) break;
    fwrite(line, sizeof(char), n, ofs);
  }

  free(line);
  fclose(ifs);
  fclose(ofs);

  return 0;
}

Ok, this is technically C++ but it uses stdio and stdlib library from C, so let’s just call it C-implementation.

// iostream.cc
#include <iostream>
#include <fstream>

/**
 * Copies input to output line by line
 * Usage: io [INPUT [OUTPUT]]
 * If INPUT/OUTPUT is omitted, stdin/stdout is assumed
 */
int main(int argc, const char** argv) {
  std::ios::sync_with_stdio(false);
  std::string input = "-";
  std::string output = "-";

  if (argc >= 2) input = argv[1];
  if (argc >= 3) output = argv[2];
  if (input == "-") input = "/dev/stdin";
  if (output == "-") output = "/dev/stdout";

  std::ifstream ifs{input};
  std::ofstream ofs{output};

  std::string line;
  while (std::getline(ifs, line)) {
    ofs << line << "\n";
  }

  return 0;
}

For our C++ program above, it is almost identical to C-version, except it uses iostream from the C++ standard library.

// rust_io.rs
use std::io::{BufReader, BufRead, BufWriter, Write}; 
use std::fs::File;

/**
 * Copies input to output line by line
 * Usage: io [INPUT [OUTPUT]]
 * If INPUT/OUTPUT is omitted, stdin/stdout is assumed
 */
fn main() -> std::io::Result<()> {
    let args: Vec<_> = std::env::args().collect();
    let input_file = if args.len() >= 2 { args[1].clone() } else { "/dev/stdin".to_owned() };
    let output_file = if args.len() >= 3 { args[2].clone() } else { "/dev/stdout".to_owned() };

    let ifs = BufReader::new(File::open(input_file)?);
    let mut ofs = BufWriter::new(File::create(output_file)?);

    for line in ifs.lines() {
        let line = line?;
        writeln!(ofs, "{}", line)?;
    }

    Ok(())
}

Finally the rust code.

So, any guess as to which program will run with the least instructions? cache references? branches?

To be fair, I will use LLVM/Clang to compile C and C++, since Rust uses LLVM as the backend. On my arm64 architecture, here is the result:

clang++ -O3 -std=c++11 -o iostream iostream.cc
clang++ -O3 -std=c++11 -o stdio stdio.cc
rustc -C opt-level=3 rust_io.rs -o rust_io

perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./stdio input.tsv output1.tsv \
 && perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./iostream input.tsv output2.tsv \
 && perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./rust_io input.tsv output3.tsv

Performance counter stats for './stdio input.tsv output1.tsv':

        175404790      instructions
        21383967      branches
        1848947      branch-misses             #    8.65% of all branches
        67305342      cache-references
        864562      cache-misses              #    1.285 % of all cache refs

    0.414314617 seconds time elapsed

    0.064524000 seconds user
    0.201637000 seconds sys



Performance counter stats for './iostream input.tsv output2.tsv':

        205537304      instructions
        24900665      branches
        1957023      branch-misses             #    7.86% of all branches
        85756002      cache-references
        987124      cache-misses              #    1.151 % of all cache refs

    0.319406442 seconds time elapsed

    0.145116000 seconds user
    0.161240000 seconds sys



Performance counter stats for './rust_io input.tsv output3.tsv':

        206013042      instructions
        24794799      branches
        2083127      branch-misses             #    8.40% of all branches
        82961063      cache-references
        914823      cache-misses              #    1.103 % of all cache refs

    0.321221279 seconds time elapsed

    0.084762000 seconds user
    0.209888000 seconds sys

Disregard the elapsed time, as it was run on my Raspberry Pi Zero 2W, which does not seem to produce consistent runtime. We see that the C-version produces the least instructions, while C++ and Rust versions are comparable. More importantly, the number of branches (and misses) and cache references (and misses) are fewer on the C-version, which will likely speed up the program.

So far, not too surprising. This is probably what we expect. The lower-level languages tend to produce smaller executable compared to high-level languages. Ok, then let’s write another program. This time, let’s sort the input lines before printing out.

// sort_c.cc
#include <cstdio>
#include <string>
#include <cstring>
#include <algorithm>
#include <vector>

int comp(char** a, char** b) {
  return std::strcmp(*a, *b);
}

/**
 * sort
 * Usage: io [INPUT [OUTPUT]]
 * If INPUT/OUTPUT is omitted, stdin/stdout is assumed
 */
int main(int argc, const char** argv) {
  std::string input = "-";
  std::string output = "-";

  if (argc >= 2) input = argv[1];
  if (argc >= 3) output = argv[2];
  if (input == "-") input = "/dev/stdin";
  if (output == "-") output = "/dev/stdout";

  FILE* ifs = fopen(input.data(), "r");
  FILE* ofs = fopen(output.data(), "w");

  std::vector<char*> lines;
  while (true) {
    char* line = nullptr;
    size_t capacity;
    auto n = getline(&line, &capacity, ifs);
    if (n <= 0) break;
    lines.push_back(line);
  }

  qsort(lines.data(), lines.size(), sizeof(*lines.data()), reinterpret_cast<int (*)(const void *, const void *)>(comp));

  for (auto &line : lines) {
    fprintf(ofs, "%s", line);
    free(line);
  }
  fclose(ifs);
  fclose(ofs);

  return 0;
}

Again, I am writing a hybrid C/C++ version, since I don’t want to re-implement vector in C, but I am using qsort from C, which will presumably consume most of the instructions.

// sort_cc.cc
#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>

/**
 * sort
 * Usage: io [INPUT [OUTPUT]]
 * If INPUT/OUTPUT is omitted, stdin/stdout is assumed
 */
int main(int argc, const char** argv) {
  std::ios::sync_with_stdio(false);
  std::string input = "-";
  std::string output = "-";

  if (argc >= 2) input = argv[1];
  if (argc >= 3) output = argv[2];
  if (input == "-") input = "/dev/stdin";
  if (output == "-") output = "/dev/stdout";

  std::ifstream ifs{input};
  std::ofstream ofs{output};

  std::vector<std::string> lines;
  std::string line;
  while (std::getline(ifs, line)) {
    lines.push_back(std::move(line));
  }

  std::sort(std::begin(lines), std::end(lines));
  for (auto &line : lines) {
    ofs << line << "\n";
  }

  return 0;
}

For C++, I am using std::sort.

use std::io::{BufReader, BufRead, BufWriter, Write}; 
use std::fs::File;

/**
 * sort
 * Usage: io [INPUT [OUTPUT]]
 * If INPUT/OUTPUT is omitted, stdin/stdout is assumed
 */
fn main() -> std::io::Result<()> {
    let args: Vec<_> = std::env::args().collect();
    let input_file = if args.len() >= 2 { args[1].clone() } else { "/dev/stdin".to_owned() };
    let output_file = if args.len() >= 3 { args[2].clone() } else { "/dev/stdout".to_owned() };

    let ifs = BufReader::new(File::open(input_file)?);
    let mut ofs = BufWriter::new(File::create(output_file)?);

    let mut lines = Vec::new();
    for line in ifs.lines() {
        let line = line?;
        lines.push(line);
    }

    lines.sort_unstable();
    for line in lines.into_iter() {
        writeln!(ofs, "{}", line)?;
    }

    Ok(())
}

For Rust, use sort_unstable to be consistent with the C and C++ implementation based on quick sort.

Now, another guess on # instructions, cache references, and branches?

clang++ -O3 -std=c++11 -o sort_c sort_c.cc
clang++ -O3 -std=c++11 -o sort_cc sort_cc.cc
rustc -C opt-level=3 sort_rust.rs -o sort_rust

perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./sort_c input.tsv output1.tsv \
 && perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./sort_cc input.tsv output2.tsv \
 && perf stat -e instructions,branches,branch-misses,cache-references,cache-misses ./sort_rust input.tsv output3.tsv

Performance counter stats for './sort_c input.tsv output1.tsv':

        390649760      instructions
        45792242      branches
        3850947      branch-misses             #    8.41% of all branches
        133946131      cache-references
        3998514      cache-misses              #    2.985 % of all cache refs

    1.315745627 seconds time elapsed

    0.424391000 seconds user
    0.375422000 seconds sys



Performance counter stats for './sort_cc input.tsv output2.tsv':

        415197504      instructions
        52343350      branches
        3141354      branch-misses             #    6.00% of all branches
        161482085      cache-references
        3915068      cache-misses              #    2.424 % of all cache refs

    0.864817779 seconds time elapsed

    0.458420000 seconds user
    0.326873000 seconds sys



Performance counter stats for './sort_rust input.tsv output3.tsv':

        352974111      instructions
        42474938      branches
        2562994      branch-misses             #    6.03% of all branches
        131108106      cache-references
        3121866      cache-misses              #    2.381 % of all cache refs

    1.101268364 seconds time elapsed

    0.366721000 seconds user
    0.255110000 seconds sys

To me, the results were surprising because

  • # instructions from the Rust-version is even fewer than the C-version
  • % branch misses and cache misses on the C-version is much higher than the C++ and Rust versions

I guess the Rust’s sort_unstable implementation is much more efficiently written than qsort or std::sort. In any case, Rust seems to be the clear winner here. Kudos to the Rust-Language developers!

All the source code files are available at https://github.com/TechHara/iostream_benchmark.git