Speed up C++ — false sharing Link to heading

This story is relevant to not only programs written in C++ but also to programs written other languages.

Cache is an essential component of modern computer architecture that plays a crucial role in improving performance. A cache is a small, fast memory that stores frequently accessed data and instructions, allowing the processor to access them quickly without having to access slower, larger main memory.

The importance of cache in modern computer architecture can be understood by its impact on performance. By storing frequently used data and instructions in a cache, the processor can access them quickly, reducing the amount of time it takes to access data from the slower main memory. This results in faster execution of programs and improved overall system performance.

Now that I have briefly described what cache does, let’s talk about false sharing. False sharing is a performance issue that occurs when multiple threads or processes access different variables that happen to be located in the same cache line. A cache line is the unit of data transferred between the main memory and the cache. The cache is designed to transfer entire cache lines at a time, so when one thread updates a variable in a cache line, it can cause the entire cache line to be invalidated and reloaded, even if the other variables in that cache line were not modified.

Well, let’s take a look at a simple example. I will again use the π-estimator function described in my previous story.

double estimate_pi_false_share(uint64_t num_steps) {
    auto max_threads = omp_get_max_threads();
    auto delta = 1.0 / static_cast<double>(num_steps);
    std::vector<double> partial_sum(max_threads, 0.0);
    int num_threads;
#pragma omp parallel default(none) shared(num_steps, delta, max_threads, num_threads, partial_sum)
    {
        if (omp_get_thread_num() == 0) num_threads = omp_get_num_threads();
        for (uint64_t step = omp_get_thread_num(); step < num_steps; step += omp_get_num_threads()) {
            auto x = delta * (static_cast<double>(step) + 0.5);
            partial_sum[omp_get_thread_num()] += 4.0 / (1 + x * x);
        }
    }
    return std::accumulate(std::begin(partial_sum), std::begin(partial_sum) + num_threads, 0.0) * delta;
}

You don’t need to understand what the function does. It suffices to say that it runs a parallel for-loop where each thread accumulates its partial sum into the element in the array corresponding to the thread id. In particular, the line below is where we need to pay attention to

partial_sum[omp_get_thread_num()] += 4.0 / (1 + x * x);

Here, omp_get_thread_num() returns an id 0 to N where N is the number of concurrent threads running. Note that each thread is writing to only each position, so there is no race condition involved. However, as I mentioned above, the problem is that all the threads are writing to elements that are adjacent in an array. This leads to false sharing — as soon as one thread updates the value in the array, the cache line needs to be reloaded.

Let’s run and measure the runtime of a single-threaded version vs multi-threaded version. Ideally, the multi-threaded implementation should reduce the runtime by # threads. However, what we see is that the multi-threaded version runs 8x SLOWER!

1-thread: 1.16 sec
6-threads: 8.25 sec

False sharing can hurt performance significantly because it can lead to unnecessary cache line invalidation and reloading. When a cache line is invalidated and reloaded, it causes the processor to stall and wait for the data to be fetched from main memory, which can be a relatively slow operation compared to accessing data from the cache. This can result in significant delays and reduce the overall performance of the system.

So what is the fix? Each thread must not update the same cache-line frequently. We can create a temporary variable for each thread on its stack and update the partial sum there in the loop. At the end, we write to the partial sum array just once.

--- false sharing
+++ true sharing
@@ -1,4 +1,4 @@
-double estimate_pi_false_share(uint64_t num_steps) {
+double estimate_pi_true_share(uint64_t num_steps) {
     auto max_threads = omp_get_max_threads();
     auto delta = 1.0 / static_cast<double>(num_steps);
     std::vector<double> partial_sum(max_threads, 0.0);
@@ -6,10 +6,12 @@
 #pragma omp parallel default(none) shared(num_steps, delta, max_threads, num_threads, partial_sum)
     {
         if (omp_get_thread_num() == 0) num_threads = omp_get_num_threads();
+        auto local_sum = 0.0;
         for (uint64_t step = omp_get_thread_num(); step < num_steps; step += omp_get_num_threads()) {
             auto x = delta * (static_cast<double>(step) + 0.5);
-            partial_sum[omp_get_thread_num()] += 4.0 / (1 + x * x);
+            local_sum += 4.0 / (1 + x * x);
         }
+        partial_sum[omp_get_thread_num()] = local_sum;
     }
     return std::accumulate(std::begin(partial_sum), std::begin(partial_sum) + num_threads, 0.0) * delta;
 }

Here is the runtime:

1-thread: 1.16 sec
6-threads: 8.25 sec # false sharing
6-threads: 0.21 sec # true sharing

*Note that I compiled with clang 14 using -O3 optimization running on x64 processor. When I tested with gcc 11.3 using -O3, I did not see any performance hit from the false-sharing implementation. Apparently, gcc is clever enough to optimize it away, at least in this simple example.