Tag Archives: rust

Tech

Rust global variables, two years on

November 26, 2023 Hrvoje 1 Comment

In November 2021 I wrote a blog post that examined Rust’s curious relationship with global variables. It aimed to explain why this ubiquitous language feature required external crates, and ended with personal recommendations on the use of globals in new code. Two years have passed, and Rust has changed enough that it’s time to take a fresh look. The rest of this text assumes you’ve read the previous article or are familiar with the subject.

Const Mutex and RwLock constructors

The first change is that Mutex::new() is const as of Rust 1.63, so this example from the previous post now compiles and works as expected:

// didn't compile two years ago, compiles now
static LOG_FILE: Mutex<String> = Mutex::new(String::new());

The foundation for this improvement was laid down in 1.62 which replaced Mutex, RwLock, and CondVar with lightweight, non-allocating implementations on Linux, and which 1.63 extended to provide const construction of those types on all platforms. The result is that for simple types mutex-protected globals “just work” without doing anything special.

Although we no longer have to encase every static Mutex in a OnceCell or equivalent, we still need a cell-like wrapper for scenarios where locked writing is only done on first use to initialize the value. In that case subsequent accesses to the global are read-only and shouldn’t require locking, only an atomic check. This is a very common use of global variables, a good example being a global holding a lazily compiled regex.

This brings us to the next and more important news.

Once cell is now in std

Since Rust 1.70, once_cell::sync::OnceCell, from the once_cell crate got integrated into the standard library as std::sync::OnceLock. For the first time in Rust’s existence, you don’t need to write unsafe code, or bring in external crates that encapsulate it, to create a global/static variable initialized on first use. Usage is essentially the same as with once_cell:

use std::sync::OnceLock;
use regex::Regex;

pub fn log_file_regex() -> &'static Regex {
    static LOG_FILE_REGEX: OnceLock<Regex> = OnceLock::new();
    LOG_FILE_REGEX.get_or_init(|| Regex::new(r#"^\d+-[[:xdigit:]]{8}$"#).unwrap())
}

// use log_file_regex().is_match(some_name) anywhere in your program

This addition might not seem like a big deal at first given that once_cell has provided the same functionality for years. However having it in the standard library greatly benefits the language in several ways. First, initialize-on-first-use globals are very widely used by both applications and libraries, and both can now phase out crates like once_cell and lazy_static from their dependencies. Second, global variables can now be created by macro-generated code without awkward reexports of once_cell and other logistic issues. Third, it makes it easier to teach the language, with teaching materials no longer needing to decide whether to cover once_cell or lazy_static, nor explain why external crates are needed for global variables to begin with. This excruciatingly long StackOverflow answer is a good example of the quagmire, as is my previous blog post on this topic. The whole stdlib/unsafe section of the latter is now just rendered obsolete, as the same be achieved safely with OnceLock at no loss of performance.

The work is not yet complete, however. Note how the static variable is placed inside the function that contains the sole call to OnceLock::get_or_init(). This pattern ensures that every access to the static OnceLock goes through one place which also initializes it. once_cell makes this less verbose through once_cell::sync::Lazy, but the equivalent stdlib type is not yet stable, being stuck on some technical issues. The workaround of placing the global into a function isn’t a significant obstacle, but it’s worth mentioning. It’s particularly relevant when comparing the ease of use of OnceLock with that of lazy_static::lazy_static! or once_cell::sync::Lazy, both of which offer the convenience of initializing in a single location without additional effort.

What to use in 2024

Two years ago the TL;DR of my recommendation was to “use once_cell or lazy_static, depending on which syntax you prefer”. Now it shifts to: use standard-library facilities like OnceLock or atomics in almost all situations, and once_cell when you require convenience not yet covered by std.

In particular:

As before, when the type you want to use in static supports thread-safe interior mutability and has a const constructor, you can declare it as static directly. (The compiler will check all that for you, just see if it compiles.) This used to only include atomics, but now also includes mutexes and rwlocks. So if something like static CURRENT_CONFIG: Mutex<Option<Config>> = Mutex::new(None) or static SHOULD_LOG: AtomicBool = AtomicBool::new(true) works for you, go for it.
When this doesn’t work, or you need to initialize on first use, use std::sync::OnceLock, preferably encapsulated in a function as shown above.
If you create a large number of globals and want to avoid the boilerplate encapsulating each in a function, use once_cell::sync::Lazy. That type is likely to be stabilized in some form, which makes it preferable over lazy_static. There are no good reasons to use lazy_static in new code.

Note that existing code that uses once_cell or lazy_static doesn’t require immediate attention. Those crates will remain available indefinitely, and they generate nearly identical assembly to that of the standard library’s OnceLock. The above recommendations are meant to guide your decisions regarding new code, or regarding code you’re refactoring anyway.

Tech

A close encounter with false sharing

November 19, 2023 Hrvoje 3 Comments

False sharing happens when multiple CPU cores modify different but closely located memory, unintentionally sharing a cache line as a result. Atomic operations on memory not shared with other threads are normally uncontended, and therefore almost as fast as non-atomic ones, but false sharing can cause them to execute orders of magnitude more slowly. The usual remedy for false sharing is to space the data farther apart, forcing it into distinct cache lines. While I’ve been aware of false sharing for years, it always seemed like a concern in the domain of experts, one of those things that don’t affect code written by mere mortals. This is a story of how I encountered it in my own code.

Atomic counter

I work on batch processing software that goes through a large number of events, numbering from thousands to hundreds of billions. To speed things up the program typically spawns as many threads as there are physical cores, and processes events in parallel. In debugging and production alike it’s very useful to log the progress of processing, so that one can gauge how long it will take for it to finish, as well as spot if a bug has slowed it down. As this is needed in many places, it’s abstracted into a lightweight progress tracker/counter with a minimal API:

pub struct Progress { ... }

impl Progress {
    /// Construct Progress.
    pub fn new() -> Self { ... }
    /// Bump the counter, and occasionally print a log message.
    pub fn bump(&self) { ... }
    /// Consume the counter, and print a final log message.
    pub fn finish(self) { ... }
}

// example usage:
fn process(input: impl Iterator<Item = Data>) {
    let progress = Progress::new();
    for event in input {
        progress.bump(); // bump the counter, occasionally logging a message
        // ... do the work ...
    }
    progress.finish(); // log the total count of processed events
}

For real-world use it’s crucial for bump() to take &self rather than &mut self, and for Progress to be Sync. That allows reporting progress from multiple threads, typically by calling bump() from closures invoked by Rayon parallel iterators. Here is another example, one that we can actually run and measure:

use rayon::prelude::*;

fn main() {
    let progress = Progress::new();
    let data = vec![0u64..10_000_000; 128];
    let sum: f64 = data
        .into_par_iter()
        .map(|subiter| {
            let mut s = 0.;
            for n in subiter {
                progress.bump();
                s += (n as f64).sqrt().sin()
            }
            s
        })
        .sum();
    progress.finish();
    println!("final sum: {sum}");
}

Of course, the progress of this particular calculation could be reported in a number of better ways – but imagine this to be a placeholder for a more complex calculation where number of elements or speed of execution is not known in advance.

With the progress calls commented out, the above program takes 1.8s to execute on my 12-core CPU. With RAYON_NUM_THREADS=1 it takes 9.83s, so the parallelization is far from perfect, but it still brings massive speedup. Multiple runs exhibit only minor variations in timing.

Given the constraints above, the obvious implementation of Progress would use an AtomicU64 counter incremented on every bump(), and occasionally check whether to log:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: 0.into(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let cnt = self.count.fetch_add(1, Ordering::Relaxed);
        if cnt % 0x1000000 == 0 {
            self.maybe_log();
        }
    }

    #[cold]
    fn maybe_log(&self) {
        let Some(mut last_logged) = self.last_logged.try_lock().ok() else {
            // never allow this mutex to introduce waiting
            return;
        };
        if last_logged.elapsed().as_secs_f64() < 5. {
            return;
        }
        eprintln!("processed {} events", self.fetch_count());
        *last_logged = Instant::now();
    }

    fn fetch_count(&self) -> u64 {
        self.count.load(Ordering::Relaxed)
    }

    pub fn finish(self) {
        eprintln!("processed {} events", self.fetch_count());
    }
}

The code is intended to be efficient: on the hot path, it executes a single atomic increment and a fast power-of-two-modulo check to see if it needs to do more. Only once in several million invocations does it actually check the time to see if it needs to log the progress message. Compared to the much heavier floating-point work, one would expect the atomic increment and the subsequent check to be negligible.

But running the code shows otherwise – this version takes a whopping 38s to execute. In other words, progress reporting has made the code 20x slower.

Since bump() only does two things – increment the atomic counter and check it against a modulo-power-of-2 – and we know the latter to be fast, the attention must be on the former. AtomicU64::fetch_add() is implemented using a single instruction, LOCK INC. But execution of that instruction on the same counter in parallel requires careful coordination among cores. If 12 cores are doing little other than incrementing this counter, they will spend most of the time in synchronization. Indeed, running with RAYON_NUM_THREADS=1 shortens the runtime to 11s, bringing it close to the original single-threaded version. So we have a serious case of contention on our hands – what can we do about it?

Thread-local counter

The obvious way to mitigate contention is to eliminate or reduce sharing of mutable data. That’s non-trivial in this case because Progress is optimized for convenience and usage with Rayon, so it can’t offer the usual alternatives such as message passing or per-worker handles which cache some data. To eliminate sharing from Progress we must look in a different direction: thread-local data. If we could create a separate counter for each thread, there would be no contention. The standard library doesn’t make this easy because it only supports static thread-locals, but thanks to Amanieu (who also wrote parking_lot and hashbrown, among others) and his thread_local crate, we can easily embed thread-local data into Progress:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;
use thread_local::ThreadLocal;

pub struct Progress {
    count: ThreadLocal<AtomicU64>,
    last_logged: Mutex<Instant>,
}

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: ThreadLocal::new(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let local_cnt = self.count.get_or_default().fetch_add(1, Ordering::Relaxed);
        if local_cnt % 0x100000 == 0 {
            self.maybe_log();
        }
    }

    fn fetch_count(&self) -> u64 {
        self.count.iter().map(|x| x.load(Ordering::Relaxed)).sum()
    }

    // ...maybe_log() and finish() are exactly the same as in the previous version
}

This version of bump() appears to do more, but should still be much more efficient. self.count.get_or_default() obtains a reference to a thread-local atomic that it immediately increments without contention. It checks whether it needs to check time a bit more often (because it’s now observing only counts for its own thread), but still rarely enough not to compromise performance. So how does this version fare?

It runs in 7.86s, 4.4x slower than the original. That’s… disappointing. On the one hand, it’s significantly better than our first attempt, so at least we were on to something with the contention hypothesis. But on the other hand, this is still bad enough to render Progress unusable for calculations that do little work per event.

What went wrong? Is ThreadLocal::get_or_default() just slow, or is AtomicU64::fetch_add() slow even when uncontended? Looking at the source of thread-local, it doesn’t seem like it should introduce slowness – in the hot path it boils down to a relaxed atomic load and a pointer dereference. Plus it’s carefully written using inline-friendly code by an expert author. What remains is the possibility that fetch_add() is slow, and that can only be explained by contention. This is where false sharing comes into the picture.

False sharing

ThreadLocal allocates data on the heap, but it allocates it in buckets that hold multiple values. Also, when allocations are performed in quick succession at the beginning of processing, they might be assigned heap addresses close to each other. Such proximity would be normally considered cache-friendly, a feature, but here we want the exact opposite. According to crossbeam, “Intel’s prefetcher is pulling pairs of 64-byte cache lines at a time”, so false sharing can occur for data within 128 bytes of each other. Even after accounting for heap overhead, it’s not hard to imagine 3-4 AtomicU64s getting allocated within the same 128-byte region, and causing massive contention.

This theory is trivial to test, simply by adding some padding next to each counter. A convenient container that does just that is provided by crossbeam, so we only need to change the definition of the count field in the struct to:

use crossbeam::util::CachePadded;

pub struct Progress {
    count: ThreadLocal<CachePadded<AtomicU64>>, // previously ThreadLocal<AtomicU64>
    last_logged: Mutex<Instant>,
}

With that single change, the code runs in 1.95s, only 1.08x slower than the original code, the difference being close to statistical noise. That confirms the false sharing hypothesis, and shows that false sharing is not something only observed by professionals writing code for bare metal. If 8% slowdown still sounds like a lot, do keep in mind that the above calculation is very close to the worst case for this kind of interface, because the calculation does no IO or allocation and is very fine-grained. Calculations that do more work between calls to Progress::bump() would experience no measurable slowdown from progress reporting.

Tech

Compile-time checks in generic functions work, and you can use them in your code

October 15, 2023 Hrvoje 1 Comment

Some generic functions need to verify properties of the types they accept that are not easy to express using traits, such as that the type’s size or layout satisfies a condition. These constraints come from unsafe or FFI code and are generally hard to avoid. Here is a silly function that has two requirements of the value: that it’s Pod, and that it’s non-zero-sized:

/// Returns the first byte of the in-memory representation of `value`.
/// Requires value to have non-zero size.
fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    assert!(std::mem::size_of::<T>() != 0);
    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

The Pod requirement is expressed with a trait bound (provided by an external crate), but there is no trait bound to cover all zero-sized types, so the function asserts that at run-time. The assertion is what makes the function’s use of unsafe sound: not only does first_byte_of() not make sense with zero-sized types, but attempting to call it with one would cause undefined behavior if it weren’t for the check. Usage looks like you’d expect:

fn main() {
    // 258u16 is [2, 1] in memory
    assert_eq!(first_byte_of(258u16), 2);
    // 3.14f64 is [31, 133, 235, 81, 184, 30, 9, 64] in IEEE 754-ese
    assert_eq!(first_byte_of(3.14f64), 31);
    //first_byte_of(()); // panics at run-time
}

While the above works, it does introduce the possibility of a run-time panic. The obvious fix would be to change the return type to Option<u8>, returning None for zero-sized T. That would shift the burden of panicking to the caller, which would very likely immediately .unwrap() the returned value, at least in cases where it believes the type to be non-zero-sized, and it needs the value unconditionally. Keep in mind that the caller of first_byte_of() might itself be generic, so changing a type very far away from the call to first_byte_of() could introduce the panic as long as the check is performed at run-time, and the panic might go unnoticed until production.

Thanks to monomorphization, the compiler knows the size of T when generating the code of first_byte_of<T>(), so it should in principle be possible to abort compilation when T is something like (). And indeed, beginning with Rust 1.57, the compiler supports compile-time assertions:

pub const FOO: usize = 42;
pub const BAR: usize = 42;
const _: () = assert!(FOO == BAR); // compiles only if FOO == BAR

The const _: () = assert!(...) syntax looks a bit weird, but it sort of makes sense – assignment to a constant makes sure that the assertion is executed at compile-time, and assert!() does technically return (), since it operates by side effect. The assertion which would normally panic at run-time now becomes a compilation failure, which is just what we need.

Applied to first_byte_of(), the check would look like this:

fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    // size_of() is const fn and can be invoked in const contexts
    const _: () = assert!(std::mem::size_of::<T>() != 0);
    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

But… this doesn’t compile! The message is “error[E0401]: can’t use generic parameters from outer function”, and the explanation doesn’t really help with our use case. Simple attempts to work around the error, such as by moving the assertion to a separate const fn, fail to work.

Some older discussions about this topic go even so far as to claim that rustc is actually justified in preventing post-monomorphization errors. They argue that it would be wrong for a generic function to compile with one type but not with another, at least in cases where both types satisfy the trait/lifetime bounds of the function. Fortunately this view was not shared by the compiler team, and Rust does allow you to verify properties of generics at compile-time. It just requires a bit of ceremony:

fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    struct Check<T>(T);
    impl<T> Check<T> {
        const NON_ZERO_SIZE: () = assert!(std::mem::size_of::<T>() != 0);
    }
    let _ = Check::<T>::NON_ZERO_SIZE;

    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

Before explaining what’s going on, let’s see how well this works. The main() function from above compiles as before, but uncommenting the first_byte_of(()) invocation results in this beautiful compile time error:

error[E0080]: evaluation of `first_byte_of::Check::<()>::NON_ZERO_SIZE` failed
 --> src/main.rs:4:35
  |
4 |         const NON_ZERO_SIZE: () = assert!(std::mem::size_of::<T>() != 0);
  |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the evaluated program panicked at 'assertion failed: std::mem::size_of::<T>() != 0', src/main.rs:4:35
  |
  = note: this error originates in the macro `assert` (in Nightly builds, run with -Z macro-backtrace for more info)

note: the above error was encountered while instantiating `fn first_byte_of::<()>`
  --> src/main.rs:16:5
   |
16 |     first_byte_of(());
   |     ^^^^^^^^^^^^^^^^^

Note that the error is shown only if you do cargo build (or cargo run etc), but not with cargo check, where compilation appears to succeed.

To work around “can’t use generic parameters from outer function”, the const in the function must not be initialized by an expression involving generic types. This limitation may be lifted in the future, but for now constants in functions must be fully resolved prior to monomorphization. That means that the const _: () = ... trick doesn’t work and we need to find another way to force the assertion to be evaluated at compile-time.

This is where associated constants come into play – the initialization expression of constants attached to generic types isn’t subject to the same restrictions as the initialization expression of constants defined inside generic functions. We introduce a Check type which is generic over T, and contains a T just to satisfy the compiler (we could use “phantom data” but we don’t bother because we never actually construct a Check value). As before, the NON_ZERO_SIZE constant serves only to execute the assert, its value never being really used and remaining (), as that’s what assert!() returns. But we do need to trigger its evaluation from first_byte_of(), which is accomplished with let _ = Check::<T>::NON_ZERO_SIZE;. The dummy let binding prevents an “unused value” warning we’d get if we just wrote Check::<T>::NON_ZERO_SIZE; at function top-level.

Finally, this pattern for enforcing compile-time assertions using associated constants can be extracted into a macro:

macro_rules! gen_assert {
    ($t:ident, $c:expr) => {{
        struct Check<$t>($t);
        impl<$t> Check<$t> {
            const CHECK: () = assert!($c);
        }
        let _ = Check::<$t>::CHECK;
    }}
}

With the boilerplate taken care of by the macro, first_byte_of() becomes simple again:

fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    gen_assert!(T, std::mem::size_of::<T>() != 0);
    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

Tech

Contention on multi-threaded regex matching

October 22, 2022 Hrvoje 2 Comments

Let’s say you need to match the same regex across a large number of strings – perhaps you’re applying a grep-like filter to data generated or received by your program. This toy example demonstrates it by matching a regex against half a billion strings:

use regex::Regex;

lazy_static! {
    static ref IS_INTEGER: Regex = Regex::new("^[0-9]+$").unwrap();
}

fn main() {
    let strings: Vec<&str> = ["foo", "bar", "1234", "1234foo", ""]
        .into_iter()
        .cycle()
        .take(100_000_000)
        .collect();

    let start = Instant::now();
    let n_ints = strings.iter().filter(|s| IS_INTEGER.is_match(s)).count();
    let elapsed = start.elapsed().as_secs_f32();
    println!("{} {}s", n_ints, elapsed);
}

It’s not a scientific benchmark of regex performance, but it does show some interesting and unexpected effects also observed in real-world code. For starters, it takes 2.0s to execute the matches on my laptop.

This is good performance, but it’s a shame to do everything in one thread – let’s try to speed it up by using all cores. This is the kind of thing Rayon makes really easy, just change iter() to par_iter():

use rayon::prelude::*;
let n_ints = strings.par_iter().filter(|s| IS_INTEGER.is_match(s)).count();

Surprisingly, this takes 6-8s to execute on the system with 4 physical cores. In other words, instead of being 3-4x faster due to running on 4 cores, it’s 3-4 times slower. A very similar slowdown occurs on the system with 32 physical cores, where the time grows from 2.05s to 8.2s.

This result can’t be chalked up to an inefficiency in Rayon, as the same slowdown is observed when dividing the work among threads in other ways. Is it possible that matching a compiled regex in multiple threads causes contention when accessed from multiple threads?

When this was first suggested in discussions with coworkers, it seemed quite unlikely, as contention would imply that the compiled regex held a lock or other form of synchronization. This runs counter to the idea of a compiled regex, which one would expect to be fully constructed during compilation. Compiled regexes are often seen in lazy_statics, and shared by the whole program. But no matter how unlikely, the possibility of contention is easy to test, simply by switching from lazy_static! to thread_local!:

thread_local! {
    static IS_INTEGER: Regex = Regex::new("^[0-9]+$").unwrap();
}

The match now needs an additional closure to access the thread-local, but is still quite readable:

use rayon::prelude::*;
let n_ints = strings
    .par_iter()
    .filter(|s| IS_INTEGER.with(|is_integer| is_integer.is_match(s)))
    .count();

Continuing the surprise, this takes 0.66s to run, which is 3x faster than the single-threaded version – the kind of speedup one might realistically expect from a 4-core computer. On the 32-core server, it takes 0.086s, a 24x speedup.

So, regex matching does have a contention issue. The compiled regex type, Regex, wraps the internal Exec type, which holds ProgramCache values organized in a Pool that stores them inside a mutex. Accessed from a single thread or from multiple threads at different times, this mutex is cheap to acquire and is held for a very short time. But under strong contention it becomes a bottle neck with attempts to acquire it falling back to OS-level waits, causing performance to degrade.

The file PERFORMANCE.md dedicates a whole section to this issue. The text is nuanced, so instead of quoting its parts, I encourage you to go ahead and read it. It warns of the performance impact and shows how to eliminate it (using a slightly different approach than taken above), but it also says that it’s “supported and encouraged” to define regexes using lazy_static! and use them from multiple threads. It explains that, despite expectations, a compiled regex doesn’t contain everything needed to match it – some of the compiled state is built lazily while executing particular kinds of search, and is later reused for searches of the same kind. The mutex protects the “scratch space” used for those updates. Design options and tradeoffs are discussed in this issue in more detail.

In summary, for most use cases it’s perfectly fine to use the same compiled regex from multiple threads. But if you have code that does heavy regex matching from multiple threads, and does most or all of it on one regex, you’ll almost certainly want to give different instances of the compiled regex to different threads. As always, be sure to measure performance before actually making such a change to the code.

Programs, Tech

Rust global variables demystified

November 1, 2021 Hrvoje 1 Comment

Rust has a reputation of a language unfriendly to global variables. While this reputation is not entirely undeserved, most of it stems from guarantees afforded by Rust and not by a desire to stifle the programmer’s creativity. In this article we’ll show how to use global variables, and how to overcome the limitations.

Note: this article has a followup written two years later.

Is it even ok to use global variables?

Global variables are a controversial topic in programming circles, with many educators taking the moral high ground and condemning them as code smell, a shortcut, a crutch, hallmark of throw-away code, and <insert favorite insult>. While there is good reason for the hostility, there is also an abundance of situations where global variables are either appropriate or actually the only way to proceed. For example, pre-compiled regular expressions or state of a logger are examples where you’d probably want to use globals in preference to sending the state all the way from top-level to the bottom-most part of your program. Many low-level system APIs, such as those of signal handlers or hardware-triggered interrupts, have callbacks that don’t receive a “state” argument, so communication between the callback and the rest of the system must go through globals. There are other examples, and in this article we’ll assume that you have good reasons for using globals that your CS professor would approve of. If that’s not the case, don’t worry, we won’t tell.

Declaring globals

A Rust global variable is declared much like any other variable, except it’s declared at top-level and uses static instead of let:

static LOG_LEVEL: u8 = 0;

So we use the keyword static instead of let, and must spell out the type, as the compiler refuses to infer it. That means that you must spell out the type even when it’s unambiguous – static LOG_LEVEL = 0u8 won’t compile.

Note that what we’ve declared is actually a static, which is not necessarily global. If the same declaration appeared in a function, the variable would be visible only inside the function, but its value would still be shared among all invocations of the function and last until the end of the program. This article talks about global variables, but most of the content equally applies to static variables defined inside functions, the only difference being that of visibility.

The global works much like you’d expect: you can read it from any function in the module, you can import it (if declared pub) from another module and then read it from there. You can even borrow it and get a reference with a 'static lifetime – because the global variable lasts until the end of the program. Neat.

What you can’t do is assign to the global because it’s not declared as mut. This is where things get interesting.

Mutable globals – atomics and locks

Ideally we’d like to declare our global as mut and have a public API that manipulates it – say, a function that reads it and another that writes it:

static mut LOG_LEVEL: u8 = 0;

pub fn get_log_level() -> u8 {
    LOG_LEVEL
}

pub fn set_log_level(level: u8) {
    LOG_LEVEL = level;
}

The compiler rejects both the getter and the setter with a similar error message:

error[E0133]: use of mutable static is unsafe and requires unsafe function or block
 --> src/lib.rs:8:5
  |
8 |     LOG_LEVEL = level;
  |     ^^^^^^^^^^^^^^^^^ use of mutable static

The underlying problem is that a global variable is potentially visible from multiple threads. The above functions don’t synchronize their access to the global, so there’s nothing to prevent set_log_level() from being called in one thread while another thread calls get_log_level() or set_log_level(), either of which would constitute a data race. Rust requires an unsafe block to signal that such synchronization has been implemented by the code that surrounds access to the mutable static. Alternatively, the whole function can be unsafe to signal that the burden of synchronization is transferred to its callers. Since we don’t in fact have such synchronization (and it’s unclear how a caller of set_log_level() and get_log_level() would even ensure it), we won’t attempt to “fix” it by adding an unsafe to get the code to compile. We want to access globals without advanced reasoning about unsafe and undefined behavior.

Since we’re dealing with a potential data race, let’s address it with the mechanisms used elsewhere in Rust to avoid data races – locks and atomics. In case of a u8, we can simply replace u8 with AtomicU8:

use std::sync::atomic::{AtomicU8, Ordering};

static LOG_LEVEL: AtomicU8 = AtomicU8::new(0);

pub fn get_log_level() -> u8 {
    LOG_LEVEL.load(Ordering::Relaxed)
}

pub fn set_log_level(level: u8) {
    LOG_LEVEL.store(level, Ordering::Relaxed);
}

The global variable is no longer mut, so no unsafe is needed. The code is thread-safe and as performant as an unsafe version would be – on x86 the relaxed atomic load compiles into an ordinary load. If you need stricter ordering guarantees between LOG_LEVEL and other data in the program, you can use Ordering::SeqCst instead.

But what if we need something that won’t neatly fit into an atomic – say, a string? This compiles, but won’t allow us to modify the global:

static LOG_FILE: String = String::new();

pub fn get_log_file() -> &'static str {
    &LOG_FILE
}

Since there is no AtomicString, we need to use a proper lock:

use std::sync::Mutex;

// XXX - doesn't compile
static LOG_FILE: Mutex<String> = Mutex::new(String::new());

pub fn get_log_file() -> String {
    LOG_FILE.lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *LOG_FILE.lock().unwrap() = file;
}

Note that get_log_file() must return a fresh copy of the string. Returning a reference would require a lifetime, and there is no lifetime to associate with the global variable other than 'static, and 'static is incorrect (and wouldn’t compile) because set_log_file() can modify it at any time.

The above doesn’t compile for a different reason:

error[E0015]: calls in statics are limited to constant functions, tuple structs and tuple variants
 --> src/lib.rs:3:34
  |
3 | static LOG_FILE: Mutex<String> = Mutex::new(String::new());
  |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^

For more information about this error, try `rustc --explain E0015`.

What’s going on here? Why did String::new() compile, but Mutex::new(String::new()) didn’t?

The difference is that the globals we declared so far were just pieces of data whose initial values were available at compilation time. The compiler didn’t need to generate any initialization code for static LOG_LEVEL: u8 = 0 – it only reserved a byte in the executable’s data segment and ensured that it contained 0 at compile time. String::new() also works because it is a const fn, function specifically marked as runnable at compile time. It can be marked like that because an empty string doesn’t allocate, so the string returned by String::new() can be represented in the executable by a triple of (0 [length], 0 [capacity], NonNull::dangling() [constant representing unallocated pointer]). Nothing to do at run time. On the other hand, static LOG_FILE: String = String::from("foo") wouldn’t compile because String::from() requires a run-time allocation and is therefore not a const fn.

std::sync::Mutex::new() is not const fn because it requires an allocation in order to keep the system mutex at a fixed address. And even if we used an allocation-free mutex (such as parking_lot::Mutex which supports a const fn constructor on nightly Rust), we’d face the same issue if we wanted to start off with a non-empty string, a data structure coming from a library we don’t control, or information only available at run-time, such as fresh randomness or the current time. In general, we don’t want to be constrained to const fn functions when initializing global variables.

As a side note, C++ supports initializing global variables with arbitrary code by simply allowing the compiler to generate code that runs before main() (or during dlopen() in the case of dynamic libraries). This approach is convenient for simple values, but when used in real programs it led to issues with initialization order, aptly named static initialization order fiasco. To avoid that problem, as well as issues with libraries that require explicit initialization, Rust doesn’t allow pre-main initialization, opting instead for the approach C++ calls the construct on first use idiom.

We will review three ways to initialize a global variable with arbitrary data, two of them based on external (but extremely well reviewed and widely used) crates, and one based on the standard library.

Once cell

The once_cell crate provides a OnceCell type that can be used to define global variables. Here is how one would use OnceCell for LOG_FILE:

use once_cell::sync::OnceCell;
use std::sync::Mutex;

static LOG_FILE: OnceCell<Mutex<String>> = OnceCell::new();

fn ensure_log_file() -> &'static Mutex<String> {
    LOG_FILE.get_or_init(|| Mutex::new(String::new()))
}

pub fn get_log_file() -> String {
    ensure_log_file().lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *ensure_log_file().lock().unwrap() = file;
}

Looking at the implementation of get_log_file() and set_log_file(), it is immediately apparent that they implement the “construct on first use” idiom – both functions call a method that ensures that the inner value is initialized (and that this is only done once), and retrieve a reference to the globally-stored value. This value can then be manipulated in the usual way through interior mutability.

OnceCell<T> is conceptually similar to a RefCell<Option<T>>. Like an Option, it has two states, one empty and another with a useful value. Like RefCell<Option<T>>, it uses interior mutability to allow setting the inner value using just a shared reference. But unlike Option, once set to a non-empty value, the stored value can never be set again. This allows a non-empty OnceCell to give out shared references to the inner data, which RefCell<Option> wouldn’t be allowed to do (it could at best return a Ref<T>) because the contents may change at any time. OnceCell is also thread-safe, so it would actually compare to Mutex<Option<T>>, except it uses an atomic to efficiently check whether the cell has been set.

once_cell also provides a Lazy type that makes the above even simpler, removing the need for a separate ensure_log_file() function:

use std::sync::Mutex;
use once_cell::sync::Lazy;

static LOG_FILE: Lazy<Mutex<String>> = Lazy::new(|| Mutex::new(String::new()));

pub fn get_log_file() -> String {
    LOG_FILE.lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *LOG_FILE.lock().unwrap() = file;
}

Our globals now work pretty much the way we wanted them. Lazy<T> performs a sleight of hand that even allows us to directly call methods like lock() directly on LOG_FILE. It achieves this by implementing Deref, the trait normally used to treat lightweight containers (typically smart pointers) like the values they contain. Deref is the mechanism that allows you to call methods of Foo on Box<Foo>, or methods of &str on String, and so on. Lazy<T> wraps a OnceCell<T> and implements a Deref<T> that returns self.once_cell.get_or_init(|| self.init.take().unwrap()), where init is the closure passed to Lazy::new().

The Lazy version still uses the construct on first use idiom, it’s just hidden behind the magic of Deref. In some cases this can yield surprising results because the actual type of LOG_FILE is not Mutex<String>, it’s Lazy<Mutex<String>>, so if you use it a context that expects exactly &Mutex<String>, it will fail to compile. It’s not a big deal because you can always obtain the actual &Mutex<String> with &*LOG_FILE (equivalent to LOG_FILE.deref()), but it is something to be aware of.

The OnceCell and Lazy types are in the process of getting stabilized, so we can expect them to become part of the standard library in the near future.

Lazy static

Another popular library for creating global variables is the lazy_static crate, which defines a macro that hides even the lazy initialization, allowing you to write code that looks almost like an ordinary declaration:

use lazy_static::lazy_static;
use std::sync::Mutex;

lazy_static! {
    static ref LOG_FILE: Mutex<String> = Mutex::new(String::new());
}

// get_log_file() and set_log_file() defined as with once_cell::Lazy

pub fn get_log_file() -> String {
    LOG_FILE.lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *LOG_FILE.lock().unwrap() = file;
}

An invocation of lazy_static! is just syntax sugar for defining a Lazy value. Under the hood everything works exactly the same as in the example that used once_cell::Lazy (except lazy_static defines its own lazy_static::Lazy). Like with once_cell::Lazy, the actual type of LOG_FILE is not Mutex<String>, but a different type which uses Deref to give out &'static Mutex<String> on method calls. Some details differ, e.g. lazy_static constructs a dummy type also named LOG_FILE and implements Deref on that, while hiding the actual Lazy<T> value in a static variable defined in a function – but the end result is exactly the same.

If you’re curious, you can run cargo expand on code generated by lazy_static! {...} to learn exactly what it does.

Standard library – Once+unsafe

Note: OnceCell stabilized in Rust 1.70 as std::sync::OnceLock, so this section is useful only for educational/historical purposes.

Until OnceCell stabilizes, the standard library doesn’t offer a way to implement the global variables initialized with non-const functions without unsafe code. In most cases this should be avoided because you can use use once_cell or lazy_static. But if you must only depend on the standard library, if you want tighter control, or if you just want to learn how it’s done, here is an example that uses std::sync::Once to implement a mutable global:

use std::mem::MaybeUninit;
use std::sync::{Mutex, Once};

fn ensure_log_file() -> &'static Mutex<String> {
    static mut LOG_FILE: MaybeUninit<Mutex<String>> = MaybeUninit::uninit();
    static LOG_FILE_ONCE: Once = Once::new();

    // Safety: initializing the variable is only done once, and reading is
    // possible only after initialization.
    unsafe {
        LOG_FILE_ONCE.call_once(|| {
            LOG_FILE.write(Mutex::new(String::new()));
        });
        // We've initialized it at this point, so it's safe to return the reference.
        LOG_FILE.assume_init_ref()
    }
}

// get_log_file() and set_log_file() defined as with once_cell::OnceCell

pub fn get_log_file() -> String {
    ensure_log_file().lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *ensure_log_file().lock().unwrap() = file;
}

Once ensures that MaybeUninit is initialized only once, which OnceCell guaranteed in previous versions of the code. Once is also efficient, using atomics to optimize the fast path when the variable has already been initialized. The definition of the variable, now static rather than global, is placed inside the function to prevent code outside ensure_log_file() from accessing it directly. All accesses inside ensure_log_file() are synchronized through call_once(), writer by running inside it, and readers by waiting for it to complete, which makes the access data-race-free.

Once the initialization of LOG_FILE is complete, ensure_log_file() can proceed to return the reference to the inside of the MaybeUninit, using assume_init_ref().

Which option to choose

In most cases involving real-world types, you’ll want to use once_cell or lazy_static, depending on which syntax you prefer. They have the exact same run-time characteristics, and you won’t go wrong with choosing either. Of course, when once_cell stabilizes in the stdlib, that will become the obvious choice.

There are two exceptions:

When your globals are const-initialized and you don’t need to modify them, you can just declare them as static or const. The difference between the two is that static guarantees that they are stored in only one place and const doesn’t (it inlines them where they’re used).
When you need to modify your globals, but their type is supported by std::sync::atomic: bool, u8–u64, i8–i64, usize, or isize. In that case you can declare the variable as static with the appropriate atomic type, and use the atomic API to read and modify it.

A case not covered by this article are thread-local variables, which can also be global. Those are provided by the thread_local macro from the standard library, and allow the use of non-Sync types in globals.

Tech

Upgradable parking_lot::RwLock might not be what you expect

May 13, 2021 Hrvoje Leave a comment

Let’s say we’re building a simple table indexed by integers starting with 0. Although the keys are contiguous, the table is loaded from key-value pairs that arrive in arbitrary order. The number of elements is not known in advance and can be anywhere from a handful up to a hundred million, but is expected to fit in working memory, given a reasonable representation. Due to these constraints the table loader naturally represents the data as Vec<Option<Payload>>. It starts out empty and sports a set() method that fills the payload at the given position, resizing the table as needed:

/// Called during loading, place `val` at `pos`
fn set(&mut self, pos: usize, val: Payload) {
    if pos >= self.data.len() {
        self.data.resize_with(pos + 1, Default::default);
    }
    self.data[pos] = Some(val);
}

The table payload is comparatively expensive to compute (it is deserialized from JSON), so that’s done from multiple worker threads, and set() must be thread-safe. The obvious approach is to place the whole table under a lock, using the fantastic parking_lot crate to provide the implementation. Under a mutex data would be a Mutex<Vec<Option<Payload>>>:

fn set(&self, pos: usize, val: Payload) {
    let mut data = self.data.lock();  // no unwrap() because parking_lot is awesome!
    if pos >= data.len() {
        data.resize_with(pos + 1, Default::default);
    }
    data[pos] = Some(val);
}

/// Called once loading is done to extract the data
fn into_data(self) -> Vec<Option<Payload>> {
    self.data.into_inner()
}

This simple solution has a serious problem – now the whole of set() operates with the big lock acquired. On a machine with 32 physical cores, there is a lot of contention around calls to set(), and the worker threads end up spending about 80% of their time idly waiting for the lock. The worst part is that the coarse lock is only really needed to resize the table, which occurs very rarely, because just one large key immediately removes the need for resizes for all smaller keys. The large majority of calls to set() just access non-overlapping slots in data and have no real need for locking.

Fortunately parking_lot’s Mutex is so incredibly well crafted that it only costs one (1) byte of space. Even if it ends up taking up more space due to padding, just compare that to a POSIX mutex that requires 40 bytes on x86-64 Linux, and must be heap-allocated because it’s UB to move it after initialization! With parking_lot fine-grained locking is quite feasible, even on tables with hundreds of millions of elements. Individual locks will be uncontended and therefore acquirable without system calls, and virtually free compared to the big lock. If we knew the table size in advance, we’d represent it as Vec<Mutex<Option<Payload>>> and be done with it.

However, we still need some kind of lock around the whole table as well, in order to be able to resize it when necessary. Since resizing only happens occasionally, we can use a RwLock, acquiring a shared read lock to access the table and its individual row (which has its own mutex for updating), and request an exclusive write lock only to resize the whole table. The type for data now becomes RwLock<Vec<Mutex<Option<Payload>>>>. (Don’t be scared by the piled-up generics, just read them in the order of appearance: read-write lock containing a vector of mutexes which protect optional payloads.)

To top it off, the parking_lot README advertises a feature of its RwLock that looks exactly like what we need:

RwLock supports atomically upgrading an “upgradable” read lock into a write lock.

Perfect – we acquire the read lock for the whole table, and only upgrade it to a write lock if a resize is needed. After the resize, we downgrade it back to a read lock and proceed. With resizing taken care of, we access the inner mutexes through the shared read lock, acquire the finer-grained mutex without any contention, and store the data. Here is an implementation of that idea:

// XXX
pub fn set(&self, pos: usize, val: Payload) {
    let data = self.data.upgradable_read();
    if data.len() <= pos {   // need to resize the table
        let mut wdata = RwLockUpgradableReadGuard::upgrade(data);
        wdata.resize_with(pos + 1, Default::default);
        let data = RwLockWriteGuard::downgrade(wdata);
        *data[pos].lock() = Some(val);
    } else {
        *data[pos].lock() = Some(val);
    }
}

fn into_data(self) -> Vec<Option<Payload>> {
    self.data.into_inner().into_iter().map(Mutex::into_inner).collect()
}

However, benchmarking this code in production shows no improvement whatsoever compared to the version with one big mutex. What’s going on?

After spending a lot of time trying to figure out where else in the program there could be a lock causing contention (including investigating the allocator, among other suspects turned innocent), I finally came back to this code and carefully reviewed the documentation of upgradable_read, which says the following:

Locks this rwlock with upgradable read access, blocking the current thread until it can be acquired. The calling thread will be blocked until there are no more writers or other upgradable reads which hold the lock. There may be other readers currently inside the lock when this method returns.

So, an upgradable read is locking out other upgradable reads, even if they never exercise their option to upgrade the lock! Since every access to the table from set() is an upgradable read, this made it behave just like a mutex. I feel that this should be more prominently stressed in the docs, as it is easy to miss and makes upgradable reads useless for this scenario. The correct way to implement set() is by manually releasing the read lock and reacquiring a write lock:

pub fn set(&self, pos: usize, val: Payload) {
    let mut data = self.data.read();
    if data.len() <= pos {
        // "upgrade" the lock
        drop(data);
        let mut wdata = self.data.write();
        // check that someone else hasn't resized the table in the meantime
        if wdata.len() <= pos {
            wdata.resize_with(pos + 1, Default::default);
        }
        // now "downgrade" it back again
        drop(wdata);
        data = self.data.read();
    }
    *data[pos].lock() = Some(val);
}

The upgrade and the downgrade are of course no longer atomic, but we don’t really care, we just perform another check after the upgrade in case another thread beat us to the punch and resized the table. And most importantly, this version of the code utilized all the available cores, bringing load times from 7 minutes to 1m45s to parse and load ~70 GiB of JSON data. (Just going through all the data with cat takes 32s, so this is pretty good.)

The immediate moral of the story is: be careful when adopting non-standard features of fundamental APIs like the synchronization devices. The documentation stated the behavior clearly, and quite a bit of debugging would have been avoided by reading it in advance. Even an otherwise excellent library like parking_lot sometimes violates the principle of least surprise.

The deeper lesson is that fearless concurrency, and the more general “if it compiles, it works” principle have their limits. Rust has spoiled us by providing a nice experience with multi-threading, so much that it’s easy to be genuinely surprised when a program that successfully compiles ends up misbehaving in a non-obvious way.

Tech

Exploring lock-free Rust 3: Crossbeam

October 12, 2017 Hrvoje 5 Comments

In the previous installment we showed that while atomic types provided by the Rust standard library can be used for lock-free access to shared values, memory reclamation must be ensured manually because Rust’s normal scoping rules do not cleanly map to lock-free concurrency.

Crossbeam

The problem of memory reclamation in lock-free data structures is not unique to Rust, it is shared by other languages without garbage collection, most notably C++. Different solutions have been proposed, sporting exotic names such as quiescent state based reclamation, epoch-based reclamation, and hazard pointers. See Tom Hart’s thesis for an extensive description of the memory reclamation strategies and analysis of their benefits and drawbacks.

In Rust the currently favored strategy is the epoch-based memory reclamation, a clever scheme that keeps track of objects marked for destruction in three thread-local bins. Each bin corresponds to an “epoch”, somewhat similar to a GC generation. When a thread becomes active, i.e. it is expected to start executing lock-free code, it locally stores the current epoch number (0-2) and uses it until deactivated. During this period, objects slated for destruction will be registered in the corresponding bin. A thread performing GC will first check whether all currently active threads belong to the current epoch. If that is the case, it means that there are no threads remaining from the previous epoch, and the epoch number is atomically bumped (incremented modulo 3). The thread that succeeds in incrementing the epoch proceeds to destroy objects from the bin of two epochs ago. For example, the thread that increments epoch from 1 to 2 can at that point safely destroy objects in bin 0. Objects in bin 1 cannot yet be destroyed because the epoch was just switched from 1 to 2, and there can still be active threads from epoch 1. But no new epoch 1 threads are being activated, and as soon as existing ones deactivate, all active threads will have been from epoch 2. At this point it will be safe to bump the epoch to 0 and drop objects from bin 1.

The nice thing about epoch-based memory reclamation is that it is a good fit for libraries, since it can be fully embedded inside the code that, say, implements a lock-free queue, without the rest of the application having to know anything about it. Rust’s implementation of epoch-based memory reclamation is provided by the Crossbeam crate. Aaron Turon’s original blog post is an excellent read on the topic, describing both Crossbeam and epoch-based memory reclamation in some detail using the classic Treiber’s stack as an example.

Here is a Crossbeam based implementation of LazyTransform:

extern crate crossbeam;

use std::sync::atomic::{AtomicBool, Ordering};
use crossbeam::epoch::{self, Atomic, Owned, Guard};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Atomic<S>,
    value: Atomic<T>,
    transform_lock: LightLock,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Atomic::null(),
            value: Atomic::null(),
            transform_lock: LightLock::new(),
        }
    }

    pub fn set_source(&self, source: S) {
        let guard = epoch::pin();
        let prev = self.source.swap(Some(Owned::new(source)),
                                  Ordering::AcqRel, &guard);
        if let Some(prev) = prev {
            unsafe { guard.unlinked(prev); }
        }
    }

    fn try_transform(&self, guard: &Guard) -> Option<T> {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let source_maybe = self.source.swap(None, Ordering::AcqRel, &guard);
            let source = match source_maybe {
                Some(source) => source,
                None => return None,
            };
            let source_data = unsafe { ::std::ptr::read(source.as_raw()) };
            let newval = match (self.transform_fn)(source_data) {
                Some(newval) => newval,
                None => return None,
            };
            let prev = self.value.swap(Some(Owned::new(newval.clone())),
                                       Ordering::AcqRel, &guard);
            unsafe {
                if let Some(prev) = prev {
                    guard.unlinked(prev);
                }
                guard.unlinked(source);
            }
            return Some(newval);
        }
        None
    }

    pub fn get_transformed(&self) -> Option<T> {
        let guard = epoch::pin();
        let source = self.source.load(Ordering::Relaxed, &guard);
        if source.is_some() {
            let newval = self.try_transform(&guard);
            if newval.is_some() {
                return newval;
            }
        }
        self.value.load(Ordering::Acquire, &guard)
            .as_ref().map(|x| T::clone(&x))
    }
}

This version is very similar to the version from last article based on the imaginary AtomicCell, except it adapts to the requirements of Crossbeam. Let’s first cover the basics:

source and value are Atomic, Crossbeam’s equivalent of AtomicCell. Option is not needed because Crossbeam always allows options by representing None values as null.
Before calling Atomic::swap and Atomic::load, the thread needs to be “pinned”, i.e. marked as active within the current epoch. The guard returned by epoch::pin serves as proof that the thread has been pinned, and automatically marks it as inactive when destroyed. A reference to this guard can be sent to helper methods such as try_transform.
Crossbeam’s Atomic::swap accepts Owned, an object similar to Box that guarantees that the value we’re storing is heap-allocated and owned by the caller (who just transferred that ownership to swap). This is similar to AtomicCell::swap from last post, except Crossbeam’s design allows reusing a previously extracted box.
Methods working with Atomic accept an Ordering argument, with the same meaning as in Rust’s atomic types. The initial test of source requests the least strict Relaxed ordering, which is safe because the source, if non-null, will be re-fetched using a stricter ordering once again in try_transform.

The key feature introduced by Crossbeam lies in the mostly-automatic memory management implemented on top of epoch reclamation. Atomic::swap returns a Shared guard which encapsulates the pointer obtained from AtomicPtr::swap and provides safe access to the underlying object, concurrently observable by other threads. The lifetime bound on the returned Shared ensures that it cannot outlive the guard returned by epoch::pin(), preventing the object from being collected while reachable through Shared. Once we are done with the object, we must manually mark it for collection. This is an unsafe operation and something Crossbeam cannot attempt automatically because it cannnot prove that the retrieved pointer is not still used elsewhere in the data model, for example in a linked list chaining to the pointer. We know no such reference exists, so it’s safe to deallocate the object. Atomic::load is used exactly the same way, only without the final deallocation.

try_transform extracts the source value published by set_source by calling std::ptr::read, a function that moves the object from an arbitrary location and returns it by value. After the call to std::ptr::read, the memory where the object resided is treated as uninitialized, and it is left to Crossbeam to free it at a later epoch switch. std::ptr::read is marked unsafe because Rust cannot trace the pointer to prove that we own the object on that location. But since we pass it the location freshly swapped out that set_source won’t ever read, we know calling std::ptr::read is safe. An unsafe block hiding unsafe implementation inside a completely safe public API forms the essence of unsafe Rust. A safe function is not only one that uses no unsafe code, but also one that can be called with any kind of argument without incurring undefined behavior.

This version of LazyTransform satisfies the requirements of the exercise. It is not only lock-free, but also wait-free because it avoids compare-and-swap retry loops. The size of the LazyTransform object equals the size of two pointers and one bool, and the pointers only ever allocate the amount memory needed to store S and T respectively. Given the requirements, that is as memory-efficient as it gets.

Coco

Having written and tested the above code, I had expected it to be the final version of the code. However, running some additional test code had a strange effect on my memory monitor – the program was leaking memory, and in large quantities! I had both expected and in prior runs observed the memory to fluctuate due to epoch-based memory reclamation, but this was different. What I observed here was memory consumption monotonically growing for as long as the program was running. Also, the leak could only be reproduced when using a value type that allocates, such as a String. It looked like Crossbeam was simply not dropping the unreachable objects.

Carefully looking at the code, it is obviously inconsistent in its memory management of shared values. set_source simply forgets about the previous value, presumably expecting guard.unlinked(prev) to dispose of it. But try_transform() uses std::ptr::read() to move source data out of the Crossbeam-managed Owned container, and also calling guard.unlinked afterwards. They cannot both be correct: either guard.unlinked doesn’t drop the underlying object and guard.unlinked(prev) in set_source leaks memory, or it does drop and guard.unlinked(source) in try_transform results in a double free because the underlying source_data was moved to transform_fn and dropped there.

I posted a StackOverflow question and, again to my surprise, it turned out that not running destructors was a known limitation of the current Crossbeam. The description of Crossbeam does state that “the epoch reclamation scheme does not run destructors [emphasis in the original], but merely deallocates memory.” This means that Crossbeam’s guard.unlink(prev_value) deletes the dynamically allocated storage for T internally created by Atomic<T>, but doesn’t drop the underlying T instance. That works for the lock-free collections supported by current Crossbeam, which automatically remove items “observed” by the collection user (no peeking is allowed) and take ownership of the object inside, similar to our AtomicCell::swap. Support for such semantics fits the needs of a queue or stack, but not e.g. a lock-free map, or even of a simple container such as LazyTransform.

Maintainers of Crossbeam are aware of the issue and are working on a new version which will include many improvements, such as the support for full dropping of objects and an improved and tunable garbage collection. A preview of the new Crossbeam design is already available in the form of the Concurrent collections (Coco) crate, whose epoch-based reclamation implements the object dropping we need, and also optimizes epoch::pin.

Switching to Coco finally resolves the memory leak and leads to the following LazyTransform implementation:

extern crate coco;

use std::sync::atomic::{AtomicBool, Ordering};
use coco::epoch::{self, Atomic, Owned, Ptr, Scope};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Atomic<S>,
    value: Atomic<T>,
    transform_lock: LightLock,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Atomic::null(),
            value: Atomic::null(),
            transform_lock: LightLock::new(),
        }
    }

    pub fn set_source(&self, source: S) {
        epoch::pin(|scope| unsafe {
            let source_ptr = Owned::new(source).into_ptr(&scope);
            let prev = self.source.swap(source_ptr, Ordering::AcqRel, &scope);
            if !prev.is_null() {
                scope.defer_drop(prev);
            }
        });
    }

    fn try_transform(&self, scope: &Scope) -> Option<T> {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let source = self.source.swap(Ptr::null(), Ordering::AcqRel, &scope);
            if source.is_null() {
                return None;
            }
            let source_data;
            unsafe {
                source_data = ::std::ptr::read(source.as_raw());
                scope.defer_free(source);
            }
            let newval = match (self.transform_fn)(source_data) {
                Some(newval) => newval,
                None => return None,
            };
            let prev = self.value.swap(Owned::new(newval.clone()).into_ptr(&scope),
                                       Ordering::AcqRel, &scope);
            unsafe {
                if !prev.is_null() {
                    scope.defer_drop(prev);
                }
            }
            return Some(newval);
        }
        None
    }

    pub fn get_transformed(&self) -> Option<T> {
        epoch::pin(|scope| {
            let source = self.source.load(Ordering::Relaxed, &scope);
            if !source.is_null() {
                let newval = self.try_transform(&scope);
                if newval.is_some() {
                    return newval;
                }
            }
            unsafe {
                self.value.load(Ordering::Acquire, &scope)
                    .as_ref().map(T::clone)
            }
        })
    }
}

Compared to Crossbeam, the differences are minor, and mostly to Coco’s advantage.

Where appropriate, defer_drop is used to drop the object in addition to the memory that it occupied. This eliminates the leak. The inconsistency regarding ptr::read is no longer present – when ptr::read is used to move the object out of the Coco-managed memory, defer_free is used in place of defer_drop.

epoch::pin no longer returns a guard, it now accepts a closure that will be run with the thread pinned to the current epoch (“active”). This makes no practical difference in our example, but might reduce readability of Crossbeam code that embedded flow control constructs such as return or break inside a pinned block.

Finally, accessing the value through a shared reference now requires an unsafe block. This is unfortunate, as pinning was explicitly designed to guarantee safety of such access. The problem was that such access was really safe only when memory orderings were correctly specified. As this was impossible to enforce statically, unsafe was introduced to eliminate a serious soundness issue in current Crossbeam.

Performance

After taking the trouble to write the code, it makes sense to measure it and see how much of a performance benefit Rust brings to the table. The Java version is admittedly much shorter (although not necessarily easier to devise) because it can rely on a volatile variable to achieve atomic access to an object. Likewise, memory reclamation is a non-issue because it is transparently handled by the GC. But surely this comes at a cost? Even with the advantage GC brings to lock-free code, Rust is a statically typed ahead-of-time compiled language specifically targeted for systems and high-performance programming.

The benchmark simulates a busy single producer thread that occasionally publishes a randomly generated value, and then remains busy spending CPU for several microseconds. At the same time, 8 consumer threads are continuously reading the transformed and (most of the time) cached value, trivially inspecting it in order to prevent a very clever compiler from optimizing away the whole loop. The whole benchmark is run three times, allowing the JVM to warm up the JIT, and also to make it easier to spot anomalies between runs.

To run the benchmark:

Download the source.
For the Rust version, build it with cargo build --release and run target/release/bench.
For the Java version, cd to src/java, byte-compile it with javac *.java and run it with java Benchmark.

Results

On my 2012 desktop workstation with 3GHz Xeon W3550, the Java benchmark reports an average of 7.3 ns per getTransformed invocation. The Rust benchmark reports 128 ns in get_transformed, a whopping 17 times slower execution. These timings are in stark contrast with the original Crossbeam article which documents the lock-free queue implemented in Crossbeam as not only competitive with, but consistently faster than java.util.concurrent.ConcurrentLinkedQueue. What could explain such a performance difference in this case?

Let’s consider the most common “happy case” for get_transformed, when it simply returns the cached value. The Java version performs the following:

an atomic load of source with the sequentially consistent ordering (the docs define get as having “the memory effects of reading a volatile variable”, which is sequentially consistent in Java.)
if non-null, as it will be in the happy case, an atomic load of transformed.

So we have two atomic loads, a check against null, and looping overhead. The Rust version also performs two loads, a relaxed load of self.value and a sequentially consistent load of self.value. However, behind the scenes it additionally does the following:

Pin the epoch
Check the garbage bins for dead objects
Clone the cached String value, which allocates
In the get_transformed caller, destroy the cloned String, again using the allocator

For a start, using a String value that Rust clones and Java only returns by pointer would appear to favor Java. Since a typical payload object is expected to be a complex object, it would surely be more efficient to make Payload an Arc<String>. “Cloning” the payload will only increment a reference count and string allocations will be eliminated. However, making this change not only fails to pay off, it makes the code even slower, with an average get_transformed invocation now taking 290 ns!

Breakdown

To make sense of this measurement, I decided to strip down down get_transformed to its very basics, breaking its contract where necessary, just to see which part takes what time. Here are the findings, now measuring only the “happy case” obtained with PRODUCE_ITERS reduced to 1. Repeating the benchmark showed some variation in numbers, but not significant enough to change their overall meaning. Keep in mind that absolute figures are obtained on my old desktop; a modern computer would be significantly faster.

single unsafe relaxed load of u64 value: 3 ns
epoch::pin() + u64 payload: 26 ns
like the above, but payload that allocates, Box<u64>: 74 ns
Arc<u64> payload: 230 ns
String (3 chars): 95 ns
String (128 chars) -> 105 ns
String (1024 chars) -> 136 ns
Arc<String> (any string size) -> 231 ns
String (2048 chars) -> 280 ns

Pinning the epoch costs around 23 ns on my machine, 26 ns measurement minus the 3 ns load and some loop overhead. This is consistent with the documentation cautioning of 10-15 ns pin time on a modern computer. This is likely the only work done, as no allocation is needed, and the thread-local garbage bins are empty. The u64 payload we’re cloning is Copy, so its clone() just loads the primitive value. No garbage is generated in the “happy case” because neither the source nor the cached value are written to, only read.

One surprising finding is that atomic reference counting is expensive, especially so when there is high contention over access to the object. It is no wonder that Rust opted to implement a separate single-threaded reference-counted type – using atomics adds a large overhead to Arc::clone compared to Rc::clone. (Also confirmed by separately benchmarking that comparison only.) Compared to the cost of cloning an Arc, string allocation and copying are fantastically optimized. It takes strings of almost 2 kilobytes for String::clone to match the cost of contended Arc::clone. Most surprisingly, it turns out that a heap allocation is actually cheaper than increasing and decreasing an atomic reference count. Allocation time can be obtained by subtracting the Box<u64> timing from the u64 one, which pegs allocation at under 50ns, in an 8-thread scenario. jemalloc’s segmented locking seems very successful in this scenario.

It would appear that this kind of micro-benchmark favors GC-backed languages, although it’s hard to tell by how much. It would be interesting to extend it to include some sort of processing and test whether the cumulative cost of garbage collection elsewhere in the program tips the scale.

Conclusion

Implementing the exercise was an interesting learning experience in both Rust and lock-free programming. The material presented here of course only scratches the surface of the topic. Jeff Preshing’s articles provide a much more in-depth treatment and further references.

Within Rust, Crossbeam and its successor Coco provide a convenient way to implement custom lock-free algorithms. This convenience does come at a cost – the mere cost of pinning the epoch would make Rust’s get_transformed fast path 3-4 times slower than equivalent Java. The real challenge comes when sharing objects among threads. Timings show that lock-free Rust requires very careful allocation design, as the cost of memory management can easily dwarf the lock-free operations that were supposed to be chosen for their efficiency. Specifically, Arc is not a panacea and can even add significant overhead to performance-sensitive designs. If access to a reasonably small object is needed, it may actually be more efficient to clone the object than to expose it through Arc. If some portion of a large object needs to be accessed, it may be more efficient to temporarily expose the reference to the object to a closure provided by the caller, which can then pick up the information it needs. Whatever solution is chosen, there does not appear to be a silver bullet that would fit all kinds of objects.

Both Crossbeam and Coco require unsafe in a couple of key places, so they are not as elegant as the statically checked lock-based design offered by the standard library. The libraries themselves are not to blame here – it is a hard problem and might require additional research and possibly even support from the language to resolve satisfactorily. This area is still under active research, especially in the wider C++ community, and it will be interesting to follow how it will develop.

Tech

Exploring lock-free Rust 2: Atomics

October 12, 2017 Hrvoje 4 Comments

The previous article showed a mutex-based implementation of the LazyTransform value container. However, the exercise was explicit about implementing a “lock-free” container. What does that mean, exactly?

Limitations of locks

The interaction between LazyTransform methods was carefully designed to prevent deadlock, but in a busy system one could expect significant contention for the source mutex between get_transformed and set_source. This is not optimal; we would like set_source to just store the new data without any waiting at all, and get_transformed to keep returning the old cached value until a new one is observed.

Consider also the contention for the value read-write lock. After get_transformed performs the transformation, it must acquire the value lock in write mode, which means it must wait until all the readers release it. What is even worse is that once the value lock is finally acquired in write mode, other readers must wait for the writer to release it. Of course, the writer only holds the lock for as long as it takes to store new_value, which is an extremely efficient operation, but all readers are still blocked, and if the writing thread gets suspended by the OS scheduler, all readers will remain blocked until the writer thread resumes. The readers, on the other hand, hold the lock for as long as it takes to clone the value, which is not bounded in any way. What is common to all these cases is that progress of the entire system hinges on the behavior or scheduling of one thread, and this is what lock-free algorithms are designed to avoid.

In a lock-free system, the program will make progress as long as any of its threads are allowed to run. Being lock-free is not just about refraining from using the OS lock primitives, but about carefully designing the system so it cannot lock up, be it by deadlock or livelock or unfortunate thread scheduling patterns. As Jeff Preshing puts it, a lock-free application won’t lock up even with hypothetical thread scheduling decisions being made by your worst enemy.

Another reason to want to avoid mutexes is their cost on object size. On x86_64 Linux, a pthread_mutex_t, internally used by Rust’s Mutex, takes 40 bytes of memory, and a RwLock takes 56. To make Mutex movable (a requirement for all Rust types), the actual OS lock object is dynamically allocated. The smallest possible Rust mutex, a Mutex<()>, takes 16 bytes of memory for the object itself (8 for the pointer to the dynamically allocated OS mutex and 8 more for the poison flag). With each LazyTransform object requiring ~200 bytes of memory and three dynamic allocations just to store the locks, we would not want to have millions of such objects lying around. If we don’t really need the waiting functionality offered by a full-featured Mutex and RwLock, we would do well to eliminate them and significantly reduce the footprint of LazyTransform.

AtomicBool

For a start, let’s eliminate the transform_lock mutex, an easy target as it holds no data and is only ever acquired with try_lock. As used, this lock was already “lock-free” in the above sense, because owning it would never cause another thread to stall.

We could replace it with a bool, but then try_lock would be doomed to a race condition in the window between checking the previous lock state and changing it to “locked”. Not to mention that Rust would see this as a data race and the compiler would reject access to a bool without protecting it with a mutex. Fortunately, a mutex is not the only option for avoiding data races. Modern processors support a number of specialized instructions to read and update values atomically, including one that fetches a machine value and atomically set the new one. In Rust, these instructions are exposed as methods on types in the std::sync::atomic module. The stable channel provides AtomicBool, AtomicUsize, AtomicIsize, and AtomicPtr<T>. For transform_lock, we can use AtomicBool to store a flag indicating whether the lock is currently taken. The resulting try_lock would look like this:

if !self.transform_lock.swap(true, Ordering::Acquire) {
    // lock acquired
    ...
    // release the lock
    self.transform_lock.store(false, Ordering::Release);
} else {
    // failed to acquire the lock
}

swap atomically stores true into the lock and returns the previous value – the operation Java AtomicBoolean calls getAndSet. If the previous value was false, it means it was us that changed the value from false to true and thus successfully acquired the lock. Conversely, if the previous value was true, it means that the lock had been already taken by someone else, and our store of true was a no-op. The Ordering argument tells the compiler which memory order is requested for the operation. The memory order corresponds to guarantees that place constraints on how the compiler and the CPU can reorder data reads and writes, both ordinary and atomic ones, around one particular atomic read/write; see the nomicon for a detailed explanation. For transform_lock we use the acquire and release ordering, so named because of its usefulness in implementing locks. A store with the “release” ordering guarantees that all writes executed prior to the atomic store will be observable by any thread that observes the newly stored atomic value. It is used when releasing a lock by setting the “locked” flag to false with the “release” ordering, which signals to other threads that the lock is now available for the taking, but also ensures that the objects that have been updated while the lock was held are available for reading. Conversely, any code that observes the false value using the “acquire” ordering can be sure that its subsequent load of other locations will also observe the results of all writes executed before the “release” store.

Note that the AtomicBool::swap method operates on the object by &self shared reference. This is another example of the conceptual magic like the one employed by mutexes and described in the previous post. The difference is that with AtomicBool no waiting takes place, so we are not trusted to ever get a mutable reference to the underlying bool. Instead, we tell AtomicBool how to change the object, and it executes the change through hardware mechanisms that are guaranteed to be exclusive, which allows it to declare the mutating methods to take &self.

Here is an implementation using AtomicBool for transform_lock:

use std::sync::{Mutex, RwLock};
use std::sync::atomic::{AtomicBool, Ordering};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Mutex<Option<S>>,
    value: RwLock<Option<T>>,
    transform_lock: AtomicBool,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Mutex::new(None),
            value: RwLock::new(None),
            transform_lock: AtomicBool::new(false),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut locked_source = self.source.lock().unwrap();
        *locked_source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        if !self.transform_lock.swap(true, Ordering::Acquire) {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    self.transform_lock.store(false, Ordering::Release);
                    return new_value;
                }
            }
            self.transform_lock.store(false, Ordering::Release);
        }
        self.value.read().unwrap().clone()
    }
}

Lock abstraction

One disadvantage of directly using an AtomicBool compared to a mutex is that we don’t get a fancy guard that automatically unlocks the lock when going out of scope. Instead, we must remember to unlock it ourselves, which can be difficult to guarantee during panic. A robust implementation would wrap the AtomicBool into a lock object that returns a guard that implements Drop:

struct LightLock(AtomicBool);

impl LightLock {
    pub fn new() -> LightLock {
        LightLock(AtomicBool::new(false))
    }

    pub fn try_lock<'a>(&'a self) -> Option<LightGuard<'a>> {
        let was_locked = self.0.swap(true, Ordering::Acquire);
        if was_locked {
            None
        } else {
            Some(LightGuard { lock: self })
        }
    }
}

struct LightGuard<'a> {
    lock: &'a LightLock,
}

impl<'a> Drop for LightGuard<'a> {
    fn drop(&mut self) {
        self.lock.0.store(false, Ordering::Release);
    }
}

In accordance with Rust conventions, the method that attempts the lock is named try_lock. Our version returns an Option that tells the caller whether taking the lock succeeded and, if so, contains the guard object to automatically unlock it on exit. Rust’s Mutex::try_lock has a more sophisticated return value that allows for other kinds of errors, but the logic of using it is essentially the same.

Using this type, transform_lock gets declared as LightLock and initialized as LightLock::new(). get_transformed automatically becomes panic-safe and also cleaner and more maintainable because it no longer needs to remember to release the lock before every exit point:

    pub fn get_transformed(&self) -> Option<T> {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    return new_value;
                }
            }
        }
        self.value.read().unwrap().clone()
    }

Getting rid of one mutex wasn’t that hard. Can the same logic apply to other locks? transform_lock was distinctive in that it never guarded concrete data, it was only used to serialize access to the transformation of source into value. This ensured that source and value locks, which are actually exclusive in nature, were held as briefly as possible. Once the lock was obtained, the actual protection of source and value from access by different threads was left to their respective locks. To safely access stored values without locks, we will need some way to access the values atomically like we did with the bool stored by the AtomicBool.

Atomic arbitrary values

Ideally we would have at our disposal a generic atomic type, let’s call it AtomicCell<X>, that provides a swap method to atomically modify the shared value and retrieve the old, and load to only read the existing one. With source and value fields declared as AtomicCell<Option<T>> and AtomicCell<Option<S>> respectively, LazyTransform could look like this:

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn set_source(&self, source: S) {
        self.source.swap(source);
    }

    fn try_transform(&self) {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let source_maybe = self.source.swap(None);
            let source = match source_maybe {
                Some(source) => source,
                None => return,
            };
            let newval = match (self.transform_fn)(source) {
                Some(newval) => newval,
                None => return,
            };
            self.value.swap(newval);
        }
    }

    pub fn get_transformed(&self) -> Option<T> {
        let source = self.source.load();
        if source.is_some() {
            self.try_transform();
        }
        self.value.load().as_ref().map(T::clone)
    }
}

get_transformed is now split into two methods: get_transformed, which now only returns the cached value, and a non-public try_transform, which actually attempts the transformation. set_source atomically swaps in the new source, and drops the old one (by virtue of ignoring the return value of swap). try_transform does the exact same thing with the source, and also with value when it obtains a new one. When the old value needs to be read, load() is called to access a reference to the underlying value, which is cloned.

The problem with the above code is that AtomicCell doesn’t exist. Rust’s standard std::sync::atomic module is limited to atomic manipulation of pointer-sized values which correspond to the operands of the CPU’s atomic instructions. Thus the most promising building block for implementing an AtomicCell is the AtomicPtr type, which provides atomic access to a pointer. An AtomicCell<T> would be represented by an AtomicPtr<T> that references T stored dynamically. The dynamic allocation is unfortunate, but unavoidable if we ever want to support T larger than a pointer. Using a pointer, AtomicCell::swap could look like this:

struct AtomicCell<T> {
    ptr: AtomicPtr<T>,
}

impl<T> AtomicCell<T> {
    fn new(initial: T) -> AtomicCell<T> {
        AtomicCell {
            ptr: AtomicPtr::new(Box::into_raw(Box::new(initial)))
        }
    }

    fn swap(&self, new: T) -> T {
        let new_ptr = Box::into_raw(Box::new(new));
        let old_ptr = self.ptr.swap(new_ptr, Ordering::AcqRel);
        unsafe {
            *Box::from_raw(old_ptr)
        }
    }
}

AtomicCell::new simply creates a cell whose ptr points to a valid heap-allocated object, which is AtomicCell‘s invariant. The interesting part happens in swap.

In swap the new object, owned by the local variable received from the caller, is moved to the heap and swapped into AtomicPtr as a raw pointer. The old object is retrieved from the same call to AtomicPtr::swap, also as a raw pointer, immediately converted back to a Box whose contents are moved from it, and returned by value. The ownership of each object is clear at every point. Values stored in the AtomicPtr are (conceptually) owned by the AtomicPtr. When writing a value to the pointer, the ownership over the new value is relinquished to the AtomicPtr using Box::into_raw, and at the same time the ownership of the old value is reasserted with Box::from_raw.

This last part is also the first time we’ve had to use unsafe in the implementation. Box::into_raw is not unsafe because it only gives back a raw pointer, trusting the caller to arrange for the object to be eventually destroyed. Even if the caller fails to uphold their end of the bargain, the worst thing that can happen is a leak, already possible with std::mem::forget. Box::from_raw is unsafe because it reinterprets a raw pointer as a box without any proof that the raw pointer is unreachable from elsewhere or even valid. We, however, know both things to be true. The pointer must be valid because we are only ever passing it pointers created by Box::into_raw, as only such pointers are stored in the AtomicPtr. We also know that the object is not reachable from another thread because we have just obtained it from swap, meaning it has already been disowned by the previous owner. Thus our swap is sound and the unsafe block is a mere implementation detail. Taking ownership is not only safe, but necessary if we want to prevent the old value from leaking.

swap was easy. Now what about AtomicCell::load, which we also need to implement lock-free reading? Based on AtomicPtr::load, a load could be imagined like this:

// doesn't work - see below
fn load(&self) -> T {
    let old_ptr = self.ptr.load(Ordering::Acquire);
    unsafe {
        *Box::from_raw(old_ptr)
    }
}

This code is written in analogy to swap, only without modifying the existing. Unfortunately it is also unsound, which means that it breaks Rust’s ownership rules and causes a crash. The problem is that load takes ownership of the object behind the atomic pointer without also changing the pointer to point to something else. This leads to double free as soon as load is called again to observe the same pointer, or in any subsequent call to swap, which will attempt to take ownership of a freed value. And there is no easy way to fix it, either. For example, suppose we defined load() to return Option<&T>, thus avoiding taking ownership of the observed object. This would prevent double free, but it wouldn’t work on the type level because there is no way to express the lifetime of the returned reference. While multiple loads would no longer be an issue, a swap() could still invalidate the value while a reference is live. Immediately cloning the observed object and returning a new value is also unsound because the reference can be invalidated by a swap before T::clone finishes. The atomic_cell crate resolves the issue using a lock around clone(). Locking is performed with a spinlock, whose lock looks like LightLock::try_lock above, except it retries taking the lock in a busy loop until it succeeds. Despite its efficiency when the lock is uncontended, a spinlock is still a lock and it is not acceptable in the middle of a lock-free algorithm.

Ideally load() would return a guard that provided a reference to the underlying value, much like the guard returned by Mutex::lock. The existence of the guard will prevent writers like swap() from destroying the object. Instead, swap will store the new raw pointer into AtomicPtr and mark the old one for deletion, deferring its actual destruction until all the load guards have been dropped. It is far from obvious how to implement this as efficiently as possible, and of course without heavy synchronization, lest we lose the very benefits lock-free code was supposed to provide.

Lock-free access to arbitrary values with correct memory management is the topic of the next article.

Tech

Exploring lock-free Rust 1: Locks

October 12, 2017 Hrvoje 6 Comments

As a learning exercise I set out to implement a simple lock-free algorithm in Rust. It was inspired by a problem posed at job interviews at a company where a friend works. The problem is simple enough that it can be tackled by a beginner, but tricky enough to require some thought to get right – and Rust presents several new challenges compared to the original Java.

This series of articles presents the evolution of a simple Rust lock-free container starting from single-threaded, progressing to a multi-threaded variant with locks, and finally settling on a lock-free implementation, discussing the trade-offs at each step. A basic understanding of Rust and of multi-threading programming is assumed, but the articles might be useful to adventurous beginners at both. Do note, however, that the author is not an expert at lock-free programming, so there might be errors – if you find some, please do leave a comment.

The Exercise

Implement a class providing access to a cached value. In more details:

Write a LazyTransform class that stores a single value, allowing it to be updated as needed. The value is potentially expensive to compute, so the setter method, set_source receives a “source” that will be used to compute the final value using a transformation function received in the LazyTransform constructor. Transformation must not be attempted until requested by get_transformed. Once generated, the value is cached and returned by further invocations of get_transformed, until invalidated by a new call to set_source.

A single-threaded version can be summarized with the following Python:

class LazyTransform:
    def __init__(self, transform_fn):
        self.transform_fn = transform_fn
        self.source = None
        self.value = None

    def set_source(self, new_source):
        self.source = new_source

    def get_transformed(self):
        if self.source is not None:
            newval = self.transform_fn(self.source)
            if newval is not None:
                self.value = newval
                self.source = None
        return self.value

The class must support being called from multiple threads, with the following semantics:

set_source and get_transformed can and will be called on the same LazyTransformer instance in parallel;
Once set_source completes, future invocations of get_transformed must eventually start returning the new value.
Read-heavy usage pattern is expected, so get_transformed must not block regardless of how many times set_source or get_transformed are called in other threads. The one exception is when a new source is detected – it is allowed for get_transformed to block until the transformation finishes before returning the transformed value (and caching it for future calls).
The code must be lock-free: neither set_source nor get_transformed should get stuck waiting on each other, even if they are called in quick succession or in parallel by many threads, or both.

Rust API tradeoffs

Before proceeding to parallelization, let’s review how the above interface would map to Rust’s type system. Ideally we’d want to place as few restrictions as possible on the type of values used for the source and the final objects; either could be as simple as a single u32 or a huge heap-allocated object. We know, for example, that both the source and the value type must be Send, because they need to be accessed from threads different from those that create them.

Another necessary restriction is that the final value type must be Clone. Why? Consider how the concept of “returning of cached value”, the return self.value line in the above Python, maps to Rust. In Python the semantics are clear because all of its objects are heap-allocated, and you always get the same instance shared. This is also the specified by the original Java exercise, which returns an Object. But a correct Rust implementation needs to deal with an actual value stored in the cache, and has three options for returning it:

move the object out of the container, typically by making the value field an Option and returning self.value.take();
return a reference to the object, return &self.value;
clone the object and return the cloned value to the caller.

The first option obviously doesn’t work because it would prevent get_transformed to return the cached value more than once. The second option looks feasible until one considers that the returned reference cannot be allowed to outlive the stored value. Since the stored value can be invalidated by a call to set_source, which can happen literally at any time, it is clear that allowing a reference to be returned would be unsound. Indeed, all such attempts are promptly rejected by the borrow checker.

Although cloning at first appears like it would be inefficient for arbitrary objects, it actually provides the greatest flexibility for the user. Light values, such as numeric IDs which are Copy (and hence also Clone), or small strings which are cheap to clone, can be placed in the cache as-is. Heavy values, on the other hand, can be dynamically allocated and accessed as Arc<ActualData>, ensuring that their clone only increments a reference count, providing the semantics one would expect from equivalent Python or Java. If needed, one can even combine the two and store a tuple of a light object and a heavy one.

So, LazyTransform needs to be generic on the value type (T) and the source type (S). But let’s not forget the transformation function received by the constructor. Fixing its type to fn(S) -> T would limit it to stateless functions, and we would like the user to be able to provide an arbitrary closure for transformation. One option would be to accept a generic function object in the constructor and box it in a Box<Fn(S) -> T>, but that would impose a dynamic allocation on each LazyTransform instance, as well as an indirection when invoking the function. If the transformation function is known at compile time and carries no state, it should incur neither storage nor run-time indirection overhead. This is easily achieved by adding a third type parameter, that of the transformation function type.

As a demonstration of the API, here is a single-threaded implementation of the container:

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Option<S>,
    value: Option<T>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: None, value: None,
        }
    }

    pub fn set_source(&mut self, source: S) {
        self.source = Some(source);
    }

    pub fn get_transformed(&mut self) -> Option<T> {
        if let Some(source) = self.source.take() {
            let newval = (self.transform_fn)(source);
            if newval.is_some() {
                self.value = newval;
            }
        }
        self.value.clone()
    }
}

In spirit this is exactly the same thing as the original Python, except sprinkled with a healthy dose of static typing.

Compile-time thread safety

What happens if we try to share an instance of LazyTransform among threads? Rust will prevent that at compile time — invoking a &mut method from multiple threads would require creating multiple &mut references to the same object, which is prevented by the borrow checker. For example, the following doesn’t compile:

fn main() {
    let mut lt = LazyTransform::new(|x: u64| Some(x + 1));
    std::thread::spawn(move || {            // lt moved here
        for i in 0..10_000 {
            lt.set_source(i);
        }
    });
    while lt.get_transformed().is_none() {  // lt used after move
    }
    let val = lt.get_transformed().unwrap();
    assert!(val >= 0 && val < 10_000);
}

lt gets moved into the closure executed by the new thread, but then it is no longer available for use by the main thread. Sending it by reference wouldn’t work because there can exist only one &mut reference to an object, so we wouldn’t be allowed to send the same reference to multiple threads. Allocating LazyTransform dynamically and using Arc to share it among threads wouldn’t help either because Arc only provides shared access to the data it owns.

In Rust, supporting parallel access to a container requires not only a change in implementation, but also in method signatures. This is an intentional design decision – while Python or Java single-threaded code will happily execute when called from multiple threads, providing incorrect results, the Rust version will refuse to compile when the thread-unsafe LazyTransform object is accessed from two threads.

The compiler uses simple rules to decide whether it is safe to share an object between threads:

Methods invoked from more than one thread must accept &self rather than &mut self. This rule is enforced by the borrow checker for single-threaded code as well.
The object must not contain values of types specifically blacklisted for multi-threaded access even through shared references. In Rust terms, its type must “be Sync”, meaning it implements the Sync marker trait, which most objects do. Examples of non-Sync types are Rc or Cell, and both have thread-safe equivalents.

At a glance, the first rule seems to rule out LazyTransform as a multi-threaded type. Both its public methods clearly modify the object, with set_source even doing that in a way that is observable from the outside. Changing the signatures to accept &self instead of &mut self fails to compile because both methods modify the data behind the &self shared reference, which is prohibited. Accessing an object inside &self will also result in further shared references that are read-only.

To modify data, we must find a way to obtain an exclusive mutable reference from the shared reference to self. This is not allowed for ordinary objects because the compiler would have no way to ensure that writes are exclusive, i.e. that while a thread holds a mutable reference to a value, no other thread can read it or write to it. However, if we could statically convince Rust that the reference’s ownership of the data will be exclusive, it would be within the rules to allow the conversion. This is where mutexes come in.

Mutexes

Rust’s Mutex type provides read-write access to the value it protects, using the appropriate operating system primitive to ensure that this can be done by only one thread at a time. Here is an implementation of LazyTransform updated to use a mutex:

use std::sync::Mutex;

struct LazyState<T, S> {
    source: Option<S>,
    value: Option<T>,
}

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    state: Mutex<LazyState<T, S>>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            state: Mutex::new(LazyState { source: None, value: None }),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut state = self.state.lock().unwrap();
        state.source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        let mut state = self.state.lock().unwrap();
        if let Some(new_source) = state.source.take() {
            let new_value = (self.transform_fn)(new_source);
            if new_value.is_some() {
                state.value = new_value;
            }
        }
        state.value.clone()
    }
}

Both methods now operate on &self, relying on the mutex to obtain write access to the data in self.state. As far as method signatures are concerned, this is the final version of the API – all future versions will only differ in implementation.

The storage is now split into transform_fn, which is itself immutable and can be invoked from a shared reference, and state, the mutable part of the object’s state moved to a separate struct and enclosed in a mutex. As can be seen here, Rust’s Mutex is a container that holds and owns the data it protects. While that coupling looks strange at first, it enables the mutex to safely grant read-write access to the data it owns.

Calling Mutex::lock() waits until an exclusive OS lock is acquired, then returns a “guard” object, that both LazyTransform methods store in a local variable called state. The mutex will not be unlocked until the guard goes out of scope. Therefore the existence of a live guard represents a proof that the mutex is locked and therefore provides read-write access to the underlying data.

In Rust’s twist on mutex semantics, the very meaning of the act of locking a mutex is obtaining temporary exclusive write access to its data through a temporary guard object. Despite self being a shared reference, a successful self.state.lock() grants access to &mut LazyState that may last for as long as the mutex is locked (guard exists) and no more. This is the crux of the way Rust prevents data races through static analysis.

Other than the curious mutex design, there is nothing really interesting about the code itself. Once the mutex is locked, both functions do exactly the same thing that their single-threaded counterparts did. While this code is thread-safe in the sense Rust promises, i.e. free from data races, it is still very far from being efficient when invoked in parallel, even ignoring the stringent lock-free requirements. In particular, get_transformed is extremely inefficient in a read-heavy scenario because each call blocks all other calls even when set_source isn’t called at all. When a transformation is in progress, all the other readers are blocked until it is finished.

Fine-grained locking

To minimize the amount of time spent waiting, we can take advantage of the following facts:

The methods are operating on two distinct pieces of data, source and value. set_source, for example, doesn’t access value at all. The two fields can be protected with different locks.
get_transformed has two distinct modes of operation: a fast one when it only returns the cached value, and the slow one when it detects that the source has changed and it needs to calculate the new value. The vast majority of calls to get_transformed can be expected to belong to the “fast” scenario.

Here is an implementation that uses finer-grained locking to ensure that readers don’t wait for either writers or other readers:

use std::sync::{Mutex, RwLock};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Mutex<Option<S>>,
    value: RwLock<Option<T>>,
    transform_lock: Mutex<()>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Mutex::new(None),
            value: RwLock::new(None),
            transform_lock: Mutex::new(()),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut locked_source = self.source.lock().unwrap();
        *locked_source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        if let Ok(_) = self.transform_lock.try_lock() {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    return new_value;
                }
            }
        }
        self.value.read().unwrap().clone()
    }
}

In this implementation there is no longer a “state” structure protected by a coarse mutex, we are back to individual fields. The source field is protected by its own mutex and the value field is protected by a separate RwLock, which is like a mutex, except it allows read access by multiple concurrent readers that don’t block each other. Finally, a new transform_lock field doesn’t protect any particular piece of data, it serves as something resembling a conventional mutex.

set_source locks the source mutex and replaces the source with the new value. It assigns to *locked_source because locked_source is just the variable holding the guard, and assigning Option<S> to it would be a type error. Since the guard provides automatic access to &mut Option<S>, *locked_source at the left-hand side of the assignment serves to both coerce the guard to &mut Option<S> (returned by guard’s implementation of DerefMut::deref_mut) and at the same time to dereference it, so that the value behind the reference is replaced with the new one.

get_transformed is more sophisticated. It first ensures that only a single call attempts to interact with the writer at one time. This is for two reasons: first, to avoid set_source being “attacked” by a potentially large number of readers in a read-heavy scenario. Second, we want to prevent more than one transformation happening in parallel, which would require the result of one expensive transformation to be thrown away. The synchronization is implemented using try_lock, which immediately returns if the lock could not be obtained. In case of failure to lock, get_transformed gives up and returns the cached value, which meets its requirements. If it acquires transform_lock, it proceeds to check whether a new source is available, again with a try_lock and a fallback to returning the cached value. This ensures that get_transformed gets out of the way of set_source as much as possible. If it acquires the source lock, it uses Option::take() to grab the new value, leaving None in its place. If the captured source is not None, meaning a new source was published since the last check, get_transformed performs the transformation, caches its result, and returns a copy.

get_transformed uses a RwLock to ensure that readers don’t wait for each other, but that the update is exclusive. RwLock nicely maps to Rust’s ownership system by RwLock::read returning a guard that provides shared reference to the underlying data, and RwLock::write returning a guard that provides a mutable reference.

This implementation is about as good as it can get with the use of locks. The problem statement, however, requires get_transformed and set_source not to block each other regardless of how often they are invoked. The above implementation will attempt an exclusive lock of source just to check if a new source has appeared. When this lock succeeds, set_source will be blocked for the duration of the lock. In a read-heavy scenario, get_transformed will be called often and by many different threads, and it is easy to imagine it hogging the lock enough to slow down set_source, which must wait to acquire the lock (it cannot use try_lock) in order to do its job.

Changing this requires looking outside the constructs offered by safe Rust, as discussed in the next post.

More Stina Blog!

Tag Archives: rust

Rust global variables, two years on

Const Mutex and RwLock constructors

Once cell is now in std

What to use in 2024

A close encounter with false sharing

Atomic counter

Thread-local counter

False sharing

Compile-time checks in generic functions work, and you can use them in your code

Contention on multi-threaded regex matching

Rust global variables demystified

Is it even ok to use global variables?

Declaring globals

Mutable globals – atomics and locks

Once cell

Lazy static

Standard library – Once+unsafe

Which option to choose

Upgradable parking_lot::RwLock might not be what you expect

Exploring lock-free Rust 3: Crossbeam

Crossbeam

Coco

Performance

Results

Breakdown

Conclusion

Exploring lock-free Rust 2: Atomics

Limitations of locks

AtomicBool

Lock abstraction

Atomic arbitrary values

Exploring lock-free Rust 1: Locks

The Exercise

Rust API tradeoffs

Compile-time thread safety

Mutexes

Fine-grained locking

Ramblings about whatever