A close encounter with false sharing

False sharing happens when multiple CPU cores modify different but closely located memory, unintentionally sharing a cache line as a result. Atomic operations on memory not shared with other threads are normally uncontended, and therefore almost as fast as non-atomic ones, but false sharing can cause them to execute orders of magnitude more slowly. The usual remedy for false sharing is to space the data farther apart, forcing it into distinct cache lines. While I’ve been aware of false sharing for years, it always seemed like a concern in the domain of experts, one of those things that don’t affect code written by mere mortals. This is a story of how I encountered it in my own code.

Atomic counter

I work on batch processing software that goes through a large number of events, numbering from thousands to hundreds of billions. To speed things up the program typically spawns as many threads as there are physical cores, and processes events in parallel. In debugging and production alike it’s very useful to log the progress of processing, so that one can gauge how long it will take for it to finish, as well as spot if a bug has slowed it down. As this is needed in many places, it’s abstracted into a lightweight progress tracker/counter with a minimal API:

pub struct Progress { ... }

impl Progress {
    /// Construct Progress.
    pub fn new() -> Self { ... }
    /// Bump the counter, and occasionally print a log message.
    pub fn bump(&self) { ... }
    /// Consume the counter, and print a final log message.
    pub fn finish(self) { ... }
}

// example usage:
fn process(input: impl Iterator<Item = Data>) {
    let progress = Progress::new();
    for event in input {
        progress.bump(); // bump the counter, occasionally logging a message
        // ... do the work ...
    }
    progress.finish(); // log the total count of processed events
}

For real-world use it’s crucial for bump() to take &self rather than &mut self, and for Progress to be Sync. That allows reporting progress from multiple threads, typically by calling bump() from closures invoked by Rayon parallel iterators. Here is another example, one that we can actually run and measure:

use rayon::prelude::*;

fn main() {
    let progress = Progress::new();
    let data = vec![0u64..10_000_000; 128];
    let sum: f64 = data
        .into_par_iter()
        .map(|subiter| {
            let mut s = 0.;
            for n in subiter {
                progress.bump();
                s += (n as f64).sqrt().sin()
            }
            s
        })
        .sum();
    progress.finish();
    println!("final sum: {sum}");
}

Of course, the progress of this particular calculation could be reported in a number of better ways – but imagine this to be a placeholder for a more complex calculation where number of elements or speed of execution is not known in advance.

With the progress calls commented out, the above program takes 1.8s to execute on my 12-core CPU. With RAYON_NUM_THREADS=1 it takes 9.83s, so the parallelization is far from perfect, but it still brings massive speedup. Multiple runs exhibit only minor variations in timing.

Given the constraints above, the obvious implementation of Progress would use an AtomicU64 counter incremented on every bump(), and occasionally check whether to log:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: 0.into(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let cnt = self.count.fetch_add(1, Ordering::Relaxed);
        if cnt % 0x1000000 == 0 {
            self.maybe_log();
        }
    }

    #[cold]
    fn maybe_log(&self) {
        let Some(mut last_logged) = self.last_logged.try_lock().ok() else {
            // never allow this mutex to introduce waiting
            return;
        };
        if last_logged.elapsed().as_secs_f64() < 5. {
            return;
        }
        eprintln!("processed {} events", self.fetch_count());
        *last_logged = Instant::now();
    }

    fn fetch_count(&self) -> u64 {
        self.count.load(Ordering::Relaxed)
    }

    pub fn finish(self) {
        eprintln!("processed {} events", self.fetch_count());
    }
}

The code is intended to be efficient: on the hot path, it executes a single atomic increment and a fast power-of-two-modulo check to see if it needs to do more. Only once in several million invocations does it actually check the time to see if it needs to log the progress message. Compared to the much heavier floating-point work, one would expect the atomic increment and the subsequent check to be negligible.

But running the code shows otherwise – this version takes a whopping 38s to execute. In other words, progress reporting has made the code 20x slower.

Since bump() only does two things – increment the atomic counter and check it against a modulo-power-of-2 – and we know the latter to be fast, the attention must be on the former. AtomicU64::fetch_add() is implemented using a single instruction, LOCK INC. But execution of that instruction on the same counter in parallel requires careful coordination among cores. If 12 cores are doing little other than incrementing this counter, they will spend most of the time in synchronization. Indeed, running with RAYON_NUM_THREADS=1 shortens the runtime to 11s, bringing it close to the original single-threaded version. So we have a serious case of contention on our hands – what can we do about it?

Thread-local counter

The obvious way to mitigate contention is to eliminate or reduce sharing of mutable data. That’s non-trivial in this case because Progress is optimized for convenience and usage with Rayon, so it can’t offer the usual alternatives such as message passing or per-worker handles which cache some data. To eliminate sharing from Progress we must look in a different direction: thread-local data. If we could create a separate counter for each thread, there would be no contention. The standard library doesn’t make this easy because it only supports static thread-locals, but thanks to Amanieu (who also wrote parking_lot and hashbrown, among others) and his thread_local crate, we can easily embed thread-local data into Progress:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;
use thread_local::ThreadLocal;

pub struct Progress {
    count: ThreadLocal<AtomicU64>,
    last_logged: Mutex<Instant>,
}

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: ThreadLocal::new(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let local_cnt = self.count.get_or_default().fetch_add(1, Ordering::Relaxed);
        if local_cnt % 0x100000 == 0 {
            self.maybe_log();
        }
    }

    fn fetch_count(&self) -> u64 {
        self.count.iter().map(|x| x.load(Ordering::Relaxed)).sum()
    }

    // ...maybe_log() and finish() are exactly the same as in the previous version
}

This version of bump() appears to do more, but should still be much more efficient. self.count.get_or_default() obtains a reference to a thread-local atomic that it immediately increments without contention. It checks whether it needs to check time a bit more often (because it’s now observing only counts for its own thread), but still rarely enough not to compromise performance. So how does this version fare?

It runs in 7.86s, 4.4x slower than the original. That’s… disappointing. On the one hand, it’s significantly better than our first attempt, so at least we were on to something with the contention hypothesis. But on the other hand, this is still bad enough to render Progress unusable for calculations that do little work per event.

What went wrong? Is ThreadLocal::get_or_default() just slow, or is AtomicU64::fetch_add() slow even when uncontended? Looking at the source of thread-local, it doesn’t seem like it should introduce slowness – in the hot path it boils down to a relaxed atomic load and a pointer dereference. Plus it’s carefully written using inline-friendly code by an expert author. What remains is the possibility that fetch_add() is slow, and that can only be explained by contention. This is where false sharing comes into the picture.

False sharing

ThreadLocal allocates data on the heap, but it allocates it in buckets that hold multiple values. Also, when allocations are performed in quick succession at the beginning of processing, they might be assigned heap addresses close to each other. Such proximity would be normally considered cache-friendly, a feature, but here we want the exact opposite. According to crossbeam, “Intel’s prefetcher is pulling pairs of 64-byte cache lines at a time”, so false sharing can occur for data within 128 bytes of each other. Even after accounting for heap overhead, it’s not hard to imagine 3-4 AtomicU64s getting allocated within the same 128-byte region, and causing massive contention.

This theory is trivial to test, simply by adding some padding next to each counter. A convenient container that does just that is provided by crossbeam, so we only need to change the definition of the count field in the struct to:

use crossbeam::util::CachePadded;

pub struct Progress {
    count: ThreadLocal<CachePadded<AtomicU64>>, // previously ThreadLocal<AtomicU64>
    last_logged: Mutex<Instant>,
}

With that single change, the code runs in 1.95s, only 1.08x slower than the original code, the difference being close to statistical noise. That confirms the false sharing hypothesis, and shows that false sharing is not something only observed by professionals writing code for bare metal. If 8% slowdown still sounds like a lot, do keep in mind that the above calculation is very close to the worst case for this kind of interface, because the calculation does no IO or allocation and is very fine-grained. Calculations that do more work between calls to Progress::bump() would experience no measurable slowdown from progress reporting.

2 thoughts on “A close encounter with false sharing”

  1. Your problem is that you do atomic operations on tls data. It is not needed at all. By definition tls data is per thread and therefore is not shared between threads.
    You should just do ordinary increment.

Leave a Reply