Tech

Rust global variables, two years on

November 26, 2023 Hrvoje 1 Comment

In November 2021 I wrote a blog post that examined Rust’s curious relationship with global variables. It aimed to explain why this ubiquitous language feature required external crates, and ended with personal recommendations on the use of globals in new code. Two years have passed, and Rust has changed enough that it’s time to take a fresh look. The rest of this text assumes you’ve read the previous article or are familiar with the subject.

Const Mutex and RwLock constructors

The first change is that Mutex::new() is const as of Rust 1.63, so this example from the previous post now compiles and works as expected:

// didn't compile two years ago, compiles now
static LOG_FILE: Mutex<String> = Mutex::new(String::new());

The foundation for this improvement was laid down in 1.62 which replaced Mutex, RwLock, and CondVar with lightweight, non-allocating implementations on Linux, and which 1.63 extended to provide const construction of those types on all platforms. The result is that for simple types mutex-protected globals “just work” without doing anything special.

Although we no longer have to encase every static Mutex in a OnceCell or equivalent, we still need a cell-like wrapper for scenarios where locked writing is only done on first use to initialize the value. In that case subsequent accesses to the global are read-only and shouldn’t require locking, only an atomic check. This is a very common use of global variables, a good example being a global holding a lazily compiled regex.

This brings us to the next and more important news.

Once cell is now in std

Since Rust 1.70, once_cell::sync::OnceCell, from the once_cell crate got integrated into the standard library as std::sync::OnceLock. For the first time in Rust’s existence, you don’t need to write unsafe code, or bring in external crates that encapsulate it, to create a global/static variable initialized on first use. Usage is essentially the same as with once_cell:

use std::sync::OnceLock;
use regex::Regex;

pub fn log_file_regex() -> &'static Regex {
    static LOG_FILE_REGEX: OnceLock<Regex> = OnceLock::new();
    LOG_FILE_REGEX.get_or_init(|| Regex::new(r#"^\d+-[[:xdigit:]]{8}$"#).unwrap())
}

// use log_file_regex().is_match(some_name) anywhere in your program

This addition might not seem like a big deal at first given that once_cell has provided the same functionality for years. However having it in the standard library greatly benefits the language in several ways. First, initialize-on-first-use globals are very widely used by both applications and libraries, and both can now phase out crates like once_cell and lazy_static from their dependencies. Second, global variables can now be created by macro-generated code without awkward reexports of once_cell and other logistic issues. Third, it makes it easier to teach the language, with teaching materials no longer needing to decide whether to cover once_cell or lazy_static, nor explain why external crates are needed for global variables to begin with. This excruciatingly long StackOverflow answer is a good example of the quagmire, as is my previous blog post on this topic. The whole stdlib/unsafe section of the latter is now just rendered obsolete, as the same be achieved safely with OnceLock at no loss of performance.

The work is not yet complete, however. Note how the static variable is placed inside the function that contains the sole call to OnceLock::get_or_init(). This pattern ensures that every access to the static OnceLock goes through one place which also initializes it. once_cell makes this less verbose through once_cell::sync::Lazy, but the equivalent stdlib type is not yet stable, being stuck on some technical issues. The workaround of placing the global into a function isn’t a significant obstacle, but it’s worth mentioning. It’s particularly relevant when comparing the ease of use of OnceLock with that of lazy_static::lazy_static! or once_cell::sync::Lazy, both of which offer the convenience of initializing in a single location without additional effort.

What to use in 2024

Two years ago the TL;DR of my recommendation was to “use once_cell or lazy_static, depending on which syntax you prefer”. Now it shifts to: use standard-library facilities like OnceLock or atomics in almost all situations, and once_cell when you require convenience not yet covered by std.

In particular:

As before, when the type you want to use in static supports thread-safe interior mutability and has a const constructor, you can declare it as static directly. (The compiler will check all that for you, just see if it compiles.) This used to only include atomics, but now also includes mutexes and rwlocks. So if something like static CURRENT_CONFIG: Mutex<Option<Config>> = Mutex::new(None) or static SHOULD_LOG: AtomicBool = AtomicBool::new(true) works for you, go for it.
When this doesn’t work, or you need to initialize on first use, use std::sync::OnceLock, preferably encapsulated in a function as shown above.
If you create a large number of globals and want to avoid the boilerplate encapsulating each in a function, use once_cell::sync::Lazy. That type is likely to be stabilized in some form, which makes it preferable over lazy_static. There are no good reasons to use lazy_static in new code.

Note that existing code that uses once_cell or lazy_static doesn’t require immediate attention. Those crates will remain available indefinitely, and they generate nearly identical assembly to that of the standard library’s OnceLock. The above recommendations are meant to guide your decisions regarding new code, or regarding code you’re refactoring anyway.

Tech

A close encounter with false sharing

November 19, 2023 Hrvoje 3 Comments

False sharing happens when multiple CPU cores modify different but closely located memory, unintentionally sharing a cache line as a result. Atomic operations on memory not shared with other threads are normally uncontended, and therefore almost as fast as non-atomic ones, but false sharing can cause them to execute orders of magnitude more slowly. The usual remedy for false sharing is to space the data farther apart, forcing it into distinct cache lines. While I’ve been aware of false sharing for years, it always seemed like a concern in the domain of experts, one of those things that don’t affect code written by mere mortals. This is a story of how I encountered it in my own code.

Atomic counter

I work on batch processing software that goes through a large number of events, numbering from thousands to hundreds of billions. To speed things up the program typically spawns as many threads as there are physical cores, and processes events in parallel. In debugging and production alike it’s very useful to log the progress of processing, so that one can gauge how long it will take for it to finish, as well as spot if a bug has slowed it down. As this is needed in many places, it’s abstracted into a lightweight progress tracker/counter with a minimal API:

pub struct Progress { ... }

impl Progress {
    /// Construct Progress.
    pub fn new() -> Self { ... }
    /// Bump the counter, and occasionally print a log message.
    pub fn bump(&self) { ... }
    /// Consume the counter, and print a final log message.
    pub fn finish(self) { ... }
}

// example usage:
fn process(input: impl Iterator<Item = Data>) {
    let progress = Progress::new();
    for event in input {
        progress.bump(); // bump the counter, occasionally logging a message
        // ... do the work ...
    }
    progress.finish(); // log the total count of processed events
}

For real-world use it’s crucial for bump() to take &self rather than &mut self, and for Progress to be Sync. That allows reporting progress from multiple threads, typically by calling bump() from closures invoked by Rayon parallel iterators. Here is another example, one that we can actually run and measure:

use rayon::prelude::*;

fn main() {
    let progress = Progress::new();
    let data = vec![0u64..10_000_000; 128];
    let sum: f64 = data
        .into_par_iter()
        .map(|subiter| {
            let mut s = 0.;
            for n in subiter {
                progress.bump();
                s += (n as f64).sqrt().sin()
            }
            s
        })
        .sum();
    progress.finish();
    println!("final sum: {sum}");
}

Of course, the progress of this particular calculation could be reported in a number of better ways – but imagine this to be a placeholder for a more complex calculation where number of elements or speed of execution is not known in advance.

With the progress calls commented out, the above program takes 1.8s to execute on my 12-core CPU. With RAYON_NUM_THREADS=1 it takes 9.83s, so the parallelization is far from perfect, but it still brings massive speedup. Multiple runs exhibit only minor variations in timing.

Given the constraints above, the obvious implementation of Progress would use an AtomicU64 counter incremented on every bump(), and occasionally check whether to log:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: 0.into(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let cnt = self.count.fetch_add(1, Ordering::Relaxed);
        if cnt % 0x1000000 == 0 {
            self.maybe_log();
        }
    }

    #[cold]
    fn maybe_log(&self) {
        let Some(mut last_logged) = self.last_logged.try_lock().ok() else {
            // never allow this mutex to introduce waiting
            return;
        };
        if last_logged.elapsed().as_secs_f64() < 5. {
            return;
        }
        eprintln!("processed {} events", self.fetch_count());
        *last_logged = Instant::now();
    }

    fn fetch_count(&self) -> u64 {
        self.count.load(Ordering::Relaxed)
    }

    pub fn finish(self) {
        eprintln!("processed {} events", self.fetch_count());
    }
}

The code is intended to be efficient: on the hot path, it executes a single atomic increment and a fast power-of-two-modulo check to see if it needs to do more. Only once in several million invocations does it actually check the time to see if it needs to log the progress message. Compared to the much heavier floating-point work, one would expect the atomic increment and the subsequent check to be negligible.

But running the code shows otherwise – this version takes a whopping 38s to execute. In other words, progress reporting has made the code 20x slower.

Since bump() only does two things – increment the atomic counter and check it against a modulo-power-of-2 – and we know the latter to be fast, the attention must be on the former. AtomicU64::fetch_add() is implemented using a single instruction, LOCK INC. But execution of that instruction on the same counter in parallel requires careful coordination among cores. If 12 cores are doing little other than incrementing this counter, they will spend most of the time in synchronization. Indeed, running with RAYON_NUM_THREADS=1 shortens the runtime to 11s, bringing it close to the original single-threaded version. So we have a serious case of contention on our hands – what can we do about it?

Thread-local counter

The obvious way to mitigate contention is to eliminate or reduce sharing of mutable data. That’s non-trivial in this case because Progress is optimized for convenience and usage with Rayon, so it can’t offer the usual alternatives such as message passing or per-worker handles which cache some data. To eliminate sharing from Progress we must look in a different direction: thread-local data. If we could create a separate counter for each thread, there would be no contention. The standard library doesn’t make this easy because it only supports static thread-locals, but thanks to Amanieu (who also wrote parking_lot and hashbrown, among others) and his thread_local crate, we can easily embed thread-local data into Progress:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;
use thread_local::ThreadLocal;

pub struct Progress {
    count: ThreadLocal<AtomicU64>,
    last_logged: Mutex<Instant>,
}

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: ThreadLocal::new(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let local_cnt = self.count.get_or_default().fetch_add(1, Ordering::Relaxed);
        if local_cnt % 0x100000 == 0 {
            self.maybe_log();
        }
    }

    fn fetch_count(&self) -> u64 {
        self.count.iter().map(|x| x.load(Ordering::Relaxed)).sum()
    }

    // ...maybe_log() and finish() are exactly the same as in the previous version
}

This version of bump() appears to do more, but should still be much more efficient. self.count.get_or_default() obtains a reference to a thread-local atomic that it immediately increments without contention. It checks whether it needs to check time a bit more often (because it’s now observing only counts for its own thread), but still rarely enough not to compromise performance. So how does this version fare?

It runs in 7.86s, 4.4x slower than the original. That’s… disappointing. On the one hand, it’s significantly better than our first attempt, so at least we were on to something with the contention hypothesis. But on the other hand, this is still bad enough to render Progress unusable for calculations that do little work per event.

What went wrong? Is ThreadLocal::get_or_default() just slow, or is AtomicU64::fetch_add() slow even when uncontended? Looking at the source of thread-local, it doesn’t seem like it should introduce slowness – in the hot path it boils down to a relaxed atomic load and a pointer dereference. Plus it’s carefully written using inline-friendly code by an expert author. What remains is the possibility that fetch_add() is slow, and that can only be explained by contention. This is where false sharing comes into the picture.

False sharing

ThreadLocal allocates data on the heap, but it allocates it in buckets that hold multiple values. Also, when allocations are performed in quick succession at the beginning of processing, they might be assigned heap addresses close to each other. Such proximity would be normally considered cache-friendly, a feature, but here we want the exact opposite. According to crossbeam, “Intel’s prefetcher is pulling pairs of 64-byte cache lines at a time”, so false sharing can occur for data within 128 bytes of each other. Even after accounting for heap overhead, it’s not hard to imagine 3-4 AtomicU64s getting allocated within the same 128-byte region, and causing massive contention.

This theory is trivial to test, simply by adding some padding next to each counter. A convenient container that does just that is provided by crossbeam, so we only need to change the definition of the count field in the struct to:

use crossbeam::util::CachePadded;

pub struct Progress {
    count: ThreadLocal<CachePadded<AtomicU64>>, // previously ThreadLocal<AtomicU64>
    last_logged: Mutex<Instant>,
}

With that single change, the code runs in 1.95s, only 1.08x slower than the original code, the difference being close to statistical noise. That confirms the false sharing hypothesis, and shows that false sharing is not something only observed by professionals writing code for bare metal. If 8% slowdown still sounds like a lot, do keep in mind that the above calculation is very close to the worst case for this kind of interface, because the calculation does no IO or allocation and is very fine-grained. Calculations that do more work between calls to Progress::bump() would experience no measurable slowdown from progress reporting.

More Stina Blog!

Monthly Archives: November 2023