All posts by Hrvoje

Contact: hniksic@gmail.com

Tech

Rust global variables, two years on

November 26, 2023 Hrvoje 1 Comment

In November 2021 I wrote a blog post that examined Rust’s curious relationship with global variables. It aimed to explain why this ubiquitous language feature required external crates, and ended with personal recommendations on the use of globals in new code. Two years have passed, and Rust has changed enough that it’s time to take a fresh look. The rest of this text assumes you’ve read the previous article or are familiar with the subject.

Const Mutex and RwLock constructors

The first change is that Mutex::new() is const as of Rust 1.63, so this example from the previous post now compiles and works as expected:

// didn't compile two years ago, compiles now
static LOG_FILE: Mutex<String> = Mutex::new(String::new());

The foundation for this improvement was laid down in 1.62 which replaced Mutex, RwLock, and CondVar with lightweight, non-allocating implementations on Linux, and which 1.63 extended to provide const construction of those types on all platforms. The result is that for simple types mutex-protected globals “just work” without doing anything special.

Although we no longer have to encase every static Mutex in a OnceCell or equivalent, we still need a cell-like wrapper for scenarios where locked writing is only done on first use to initialize the value. In that case subsequent accesses to the global are read-only and shouldn’t require locking, only an atomic check. This is a very common use of global variables, a good example being a global holding a lazily compiled regex.

This brings us to the next and more important news.

Once cell is now in std

Since Rust 1.70, once_cell::sync::OnceCell, from the once_cell crate got integrated into the standard library as std::sync::OnceLock. For the first time in Rust’s existence, you don’t need to write unsafe code, or bring in external crates that encapsulate it, to create a global/static variable initialized on first use. Usage is essentially the same as with once_cell:

use std::sync::OnceLock;
use regex::Regex;

pub fn log_file_regex() -> &'static Regex {
    static LOG_FILE_REGEX: OnceLock<Regex> = OnceLock::new();
    LOG_FILE_REGEX.get_or_init(|| Regex::new(r#"^\d+-[[:xdigit:]]{8}$"#).unwrap())
}

// use log_file_regex().is_match(some_name) anywhere in your program

This addition might not seem like a big deal at first given that once_cell has provided the same functionality for years. However having it in the standard library greatly benefits the language in several ways. First, initialize-on-first-use globals are very widely used by both applications and libraries, and both can now phase out crates like once_cell and lazy_static from their dependencies. Second, global variables can now be created by macro-generated code without awkward reexports of once_cell and other logistic issues. Third, it makes it easier to teach the language, with teaching materials no longer needing to decide whether to cover once_cell or lazy_static, nor explain why external crates are needed for global variables to begin with. This excruciatingly long StackOverflow answer is a good example of the quagmire, as is my previous blog post on this topic. The whole stdlib/unsafe section of the latter is now just rendered obsolete, as the same be achieved safely with OnceLock at no loss of performance.

The work is not yet complete, however. Note how the static variable is placed inside the function that contains the sole call to OnceLock::get_or_init(). This pattern ensures that every access to the static OnceLock goes through one place which also initializes it. once_cell makes this less verbose through once_cell::sync::Lazy, but the equivalent stdlib type is not yet stable, being stuck on some technical issues. The workaround of placing the global into a function isn’t a significant obstacle, but it’s worth mentioning. It’s particularly relevant when comparing the ease of use of OnceLock with that of lazy_static::lazy_static! or once_cell::sync::Lazy, both of which offer the convenience of initializing in a single location without additional effort.

What to use in 2024

Two years ago the TL;DR of my recommendation was to “use once_cell or lazy_static, depending on which syntax you prefer”. Now it shifts to: use standard-library facilities like OnceLock or atomics in almost all situations, and once_cell when you require convenience not yet covered by std.

In particular:

As before, when the type you want to use in static supports thread-safe interior mutability and has a const constructor, you can declare it as static directly. (The compiler will check all that for you, just see if it compiles.) This used to only include atomics, but now also includes mutexes and rwlocks. So if something like static CURRENT_CONFIG: Mutex<Option<Config>> = Mutex::new(None) or static SHOULD_LOG: AtomicBool = AtomicBool::new(true) works for you, go for it.
When this doesn’t work, or you need to initialize on first use, use std::sync::OnceLock, preferably encapsulated in a function as shown above.
If you create a large number of globals and want to avoid the boilerplate encapsulating each in a function, use once_cell::sync::Lazy. That type is likely to be stabilized in some form, which makes it preferable over lazy_static. There are no good reasons to use lazy_static in new code.

Note that existing code that uses once_cell or lazy_static doesn’t require immediate attention. Those crates will remain available indefinitely, and they generate nearly identical assembly to that of the standard library’s OnceLock. The above recommendations are meant to guide your decisions regarding new code, or regarding code you’re refactoring anyway.

Tech

A close encounter with false sharing

November 19, 2023 Hrvoje 3 Comments

False sharing happens when multiple CPU cores modify different but closely located memory, unintentionally sharing a cache line as a result. Atomic operations on memory not shared with other threads are normally uncontended, and therefore almost as fast as non-atomic ones, but false sharing can cause them to execute orders of magnitude more slowly. The usual remedy for false sharing is to space the data farther apart, forcing it into distinct cache lines. While I’ve been aware of false sharing for years, it always seemed like a concern in the domain of experts, one of those things that don’t affect code written by mere mortals. This is a story of how I encountered it in my own code.

Atomic counter

I work on batch processing software that goes through a large number of events, numbering from thousands to hundreds of billions. To speed things up the program typically spawns as many threads as there are physical cores, and processes events in parallel. In debugging and production alike it’s very useful to log the progress of processing, so that one can gauge how long it will take for it to finish, as well as spot if a bug has slowed it down. As this is needed in many places, it’s abstracted into a lightweight progress tracker/counter with a minimal API:

pub struct Progress { ... }

impl Progress {
    /// Construct Progress.
    pub fn new() -> Self { ... }
    /// Bump the counter, and occasionally print a log message.
    pub fn bump(&self) { ... }
    /// Consume the counter, and print a final log message.
    pub fn finish(self) { ... }
}

// example usage:
fn process(input: impl Iterator<Item = Data>) {
    let progress = Progress::new();
    for event in input {
        progress.bump(); // bump the counter, occasionally logging a message
        // ... do the work ...
    }
    progress.finish(); // log the total count of processed events
}

For real-world use it’s crucial for bump() to take &self rather than &mut self, and for Progress to be Sync. That allows reporting progress from multiple threads, typically by calling bump() from closures invoked by Rayon parallel iterators. Here is another example, one that we can actually run and measure:

use rayon::prelude::*;

fn main() {
    let progress = Progress::new();
    let data = vec![0u64..10_000_000; 128];
    let sum: f64 = data
        .into_par_iter()
        .map(|subiter| {
            let mut s = 0.;
            for n in subiter {
                progress.bump();
                s += (n as f64).sqrt().sin()
            }
            s
        })
        .sum();
    progress.finish();
    println!("final sum: {sum}");
}

Of course, the progress of this particular calculation could be reported in a number of better ways – but imagine this to be a placeholder for a more complex calculation where number of elements or speed of execution is not known in advance.

With the progress calls commented out, the above program takes 1.8s to execute on my 12-core CPU. With RAYON_NUM_THREADS=1 it takes 9.83s, so the parallelization is far from perfect, but it still brings massive speedup. Multiple runs exhibit only minor variations in timing.

Given the constraints above, the obvious implementation of Progress would use an AtomicU64 counter incremented on every bump(), and occasionally check whether to log:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: 0.into(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let cnt = self.count.fetch_add(1, Ordering::Relaxed);
        if cnt % 0x1000000 == 0 {
            self.maybe_log();
        }
    }

    #[cold]
    fn maybe_log(&self) {
        let Some(mut last_logged) = self.last_logged.try_lock().ok() else {
            // never allow this mutex to introduce waiting
            return;
        };
        if last_logged.elapsed().as_secs_f64() < 5. {
            return;
        }
        eprintln!("processed {} events", self.fetch_count());
        *last_logged = Instant::now();
    }

    fn fetch_count(&self) -> u64 {
        self.count.load(Ordering::Relaxed)
    }

    pub fn finish(self) {
        eprintln!("processed {} events", self.fetch_count());
    }
}

The code is intended to be efficient: on the hot path, it executes a single atomic increment and a fast power-of-two-modulo check to see if it needs to do more. Only once in several million invocations does it actually check the time to see if it needs to log the progress message. Compared to the much heavier floating-point work, one would expect the atomic increment and the subsequent check to be negligible.

But running the code shows otherwise – this version takes a whopping 38s to execute. In other words, progress reporting has made the code 20x slower.

Since bump() only does two things – increment the atomic counter and check it against a modulo-power-of-2 – and we know the latter to be fast, the attention must be on the former. AtomicU64::fetch_add() is implemented using a single instruction, LOCK INC. But execution of that instruction on the same counter in parallel requires careful coordination among cores. If 12 cores are doing little other than incrementing this counter, they will spend most of the time in synchronization. Indeed, running with RAYON_NUM_THREADS=1 shortens the runtime to 11s, bringing it close to the original single-threaded version. So we have a serious case of contention on our hands – what can we do about it?

Thread-local counter

The obvious way to mitigate contention is to eliminate or reduce sharing of mutable data. That’s non-trivial in this case because Progress is optimized for convenience and usage with Rayon, so it can’t offer the usual alternatives such as message passing or per-worker handles which cache some data. To eliminate sharing from Progress we must look in a different direction: thread-local data. If we could create a separate counter for each thread, there would be no contention. The standard library doesn’t make this easy because it only supports static thread-locals, but thanks to Amanieu (who also wrote parking_lot and hashbrown, among others) and his thread_local crate, we can easily embed thread-local data into Progress:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Mutex;
use std::time::Instant;
use thread_local::ThreadLocal;

pub struct Progress {
    count: ThreadLocal<AtomicU64>,
    last_logged: Mutex<Instant>,
}

impl Progress {
    pub fn new() -> Self {
        Progress {
            count: ThreadLocal::new(),
            last_logged: Mutex::new(Instant::now()),
        }
    }

    pub fn bump(&self) {
        let local_cnt = self.count.get_or_default().fetch_add(1, Ordering::Relaxed);
        if local_cnt % 0x100000 == 0 {
            self.maybe_log();
        }
    }

    fn fetch_count(&self) -> u64 {
        self.count.iter().map(|x| x.load(Ordering::Relaxed)).sum()
    }

    // ...maybe_log() and finish() are exactly the same as in the previous version
}

This version of bump() appears to do more, but should still be much more efficient. self.count.get_or_default() obtains a reference to a thread-local atomic that it immediately increments without contention. It checks whether it needs to check time a bit more often (because it’s now observing only counts for its own thread), but still rarely enough not to compromise performance. So how does this version fare?

It runs in 7.86s, 4.4x slower than the original. That’s… disappointing. On the one hand, it’s significantly better than our first attempt, so at least we were on to something with the contention hypothesis. But on the other hand, this is still bad enough to render Progress unusable for calculations that do little work per event.

What went wrong? Is ThreadLocal::get_or_default() just slow, or is AtomicU64::fetch_add() slow even when uncontended? Looking at the source of thread-local, it doesn’t seem like it should introduce slowness – in the hot path it boils down to a relaxed atomic load and a pointer dereference. Plus it’s carefully written using inline-friendly code by an expert author. What remains is the possibility that fetch_add() is slow, and that can only be explained by contention. This is where false sharing comes into the picture.

False sharing

ThreadLocal allocates data on the heap, but it allocates it in buckets that hold multiple values. Also, when allocations are performed in quick succession at the beginning of processing, they might be assigned heap addresses close to each other. Such proximity would be normally considered cache-friendly, a feature, but here we want the exact opposite. According to crossbeam, “Intel’s prefetcher is pulling pairs of 64-byte cache lines at a time”, so false sharing can occur for data within 128 bytes of each other. Even after accounting for heap overhead, it’s not hard to imagine 3-4 AtomicU64s getting allocated within the same 128-byte region, and causing massive contention.

This theory is trivial to test, simply by adding some padding next to each counter. A convenient container that does just that is provided by crossbeam, so we only need to change the definition of the count field in the struct to:

use crossbeam::util::CachePadded;

pub struct Progress {
    count: ThreadLocal<CachePadded<AtomicU64>>, // previously ThreadLocal<AtomicU64>
    last_logged: Mutex<Instant>,
}

With that single change, the code runs in 1.95s, only 1.08x slower than the original code, the difference being close to statistical noise. That confirms the false sharing hypothesis, and shows that false sharing is not something only observed by professionals writing code for bare metal. If 8% slowdown still sounds like a lot, do keep in mind that the above calculation is very close to the worst case for this kind of interface, because the calculation does no IO or allocation and is very fine-grained. Calculations that do more work between calls to Progress::bump() would experience no measurable slowdown from progress reporting.

Tech

Compile-time checks in generic functions work, and you can use them in your code

October 15, 2023 Hrvoje 1 Comment

Some generic functions need to verify properties of the types they accept that are not easy to express using traits, such as that the type’s size or layout satisfies a condition. These constraints come from unsafe or FFI code and are generally hard to avoid. Here is a silly function that has two requirements of the value: that it’s Pod, and that it’s non-zero-sized:

/// Returns the first byte of the in-memory representation of `value`.
/// Requires value to have non-zero size.
fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    assert!(std::mem::size_of::<T>() != 0);
    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

The Pod requirement is expressed with a trait bound (provided by an external crate), but there is no trait bound to cover all zero-sized types, so the function asserts that at run-time. The assertion is what makes the function’s use of unsafe sound: not only does first_byte_of() not make sense with zero-sized types, but attempting to call it with one would cause undefined behavior if it weren’t for the check. Usage looks like you’d expect:

fn main() {
    // 258u16 is [2, 1] in memory
    assert_eq!(first_byte_of(258u16), 2);
    // 3.14f64 is [31, 133, 235, 81, 184, 30, 9, 64] in IEEE 754-ese
    assert_eq!(first_byte_of(3.14f64), 31);
    //first_byte_of(()); // panics at run-time
}

While the above works, it does introduce the possibility of a run-time panic. The obvious fix would be to change the return type to Option<u8>, returning None for zero-sized T. That would shift the burden of panicking to the caller, which would very likely immediately .unwrap() the returned value, at least in cases where it believes the type to be non-zero-sized, and it needs the value unconditionally. Keep in mind that the caller of first_byte_of() might itself be generic, so changing a type very far away from the call to first_byte_of() could introduce the panic as long as the check is performed at run-time, and the panic might go unnoticed until production.

Thanks to monomorphization, the compiler knows the size of T when generating the code of first_byte_of<T>(), so it should in principle be possible to abort compilation when T is something like (). And indeed, beginning with Rust 1.57, the compiler supports compile-time assertions:

pub const FOO: usize = 42;
pub const BAR: usize = 42;
const _: () = assert!(FOO == BAR); // compiles only if FOO == BAR

The const _: () = assert!(...) syntax looks a bit weird, but it sort of makes sense – assignment to a constant makes sure that the assertion is executed at compile-time, and assert!() does technically return (), since it operates by side effect. The assertion which would normally panic at run-time now becomes a compilation failure, which is just what we need.

Applied to first_byte_of(), the check would look like this:

fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    // size_of() is const fn and can be invoked in const contexts
    const _: () = assert!(std::mem::size_of::<T>() != 0);
    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

But… this doesn’t compile! The message is “error[E0401]: can’t use generic parameters from outer function”, and the explanation doesn’t really help with our use case. Simple attempts to work around the error, such as by moving the assertion to a separate const fn, fail to work.

Some older discussions about this topic go even so far as to claim that rustc is actually justified in preventing post-monomorphization errors. They argue that it would be wrong for a generic function to compile with one type but not with another, at least in cases where both types satisfy the trait/lifetime bounds of the function. Fortunately this view was not shared by the compiler team, and Rust does allow you to verify properties of generics at compile-time. It just requires a bit of ceremony:

fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    struct Check<T>(T);
    impl<T> Check<T> {
        const NON_ZERO_SIZE: () = assert!(std::mem::size_of::<T>() != 0);
    }
    let _ = Check::<T>::NON_ZERO_SIZE;

    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

Before explaining what’s going on, let’s see how well this works. The main() function from above compiles as before, but uncommenting the first_byte_of(()) invocation results in this beautiful compile time error:

error[E0080]: evaluation of `first_byte_of::Check::<()>::NON_ZERO_SIZE` failed
 --> src/main.rs:4:35
  |
4 |         const NON_ZERO_SIZE: () = assert!(std::mem::size_of::<T>() != 0);
  |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the evaluated program panicked at 'assertion failed: std::mem::size_of::<T>() != 0', src/main.rs:4:35
  |
  = note: this error originates in the macro `assert` (in Nightly builds, run with -Z macro-backtrace for more info)

note: the above error was encountered while instantiating `fn first_byte_of::<()>`
  --> src/main.rs:16:5
   |
16 |     first_byte_of(());
   |     ^^^^^^^^^^^^^^^^^

Note that the error is shown only if you do cargo build (or cargo run etc), but not with cargo check, where compilation appears to succeed.

To work around “can’t use generic parameters from outer function”, the const in the function must not be initialized by an expression involving generic types. This limitation may be lifted in the future, but for now constants in functions must be fully resolved prior to monomorphization. That means that the const _: () = ... trick doesn’t work and we need to find another way to force the assertion to be evaluated at compile-time.

This is where associated constants come into play – the initialization expression of constants attached to generic types isn’t subject to the same restrictions as the initialization expression of constants defined inside generic functions. We introduce a Check type which is generic over T, and contains a T just to satisfy the compiler (we could use “phantom data” but we don’t bother because we never actually construct a Check value). As before, the NON_ZERO_SIZE constant serves only to execute the assert, its value never being really used and remaining (), as that’s what assert!() returns. But we do need to trigger its evaluation from first_byte_of(), which is accomplished with let _ = Check::<T>::NON_ZERO_SIZE;. The dummy let binding prevents an “unused value” warning we’d get if we just wrote Check::<T>::NON_ZERO_SIZE; at function top-level.

Finally, this pattern for enforcing compile-time assertions using associated constants can be extracted into a macro:

macro_rules! gen_assert {
    ($t:ident, $c:expr) => {{
        struct Check<$t>($t);
        impl<$t> Check<$t> {
            const CHECK: () = assert!($c);
        }
        let _ = Check::<$t>::CHECK;
    }}
}

With the boilerplate taken care of by the macro, first_byte_of() becomes simple again:

fn first_byte_of<T: bytemuck::Pod>(value: T) -> u8 {
    gen_assert!(T, std::mem::size_of::<T>() != 0);
    let addr = (&value) as *const T as *const u8;
    unsafe { *addr }
}

Tech

Faster sorting with decorate-sort-undecorate

January 7, 2023 Hrvoje Leave a comment

Decorate-sort-undecorate is a simple sorting pattern known for decades, which was additionally popularized by Randall Schwartz in the context of Perl, earning it the moniker Schwartzian transform. While I’ve been aware of the pattern for many years, I only recently realized just how relevant it still is for optimizing everyday sorts.

Consider this code that sorts JSON objects by a field:

/// Sorts a slice of JSON objects by the "age" field.
fn sort_by_age(v: &mut [serde_json::Value]) {
    v.sort_by_key(|o| o["age"].as_i64());
}

// usage:
fn main() {
    let mut v = vec![
        json!({"name": "John", "age": 25}),
        json!({"name": "Abel", "age": 18}),
    ];
    sort_by_age(&mut v);
    assert_eq!(v[0]["name"], "Abel");
    assert_eq!(v[1]["name"], "John");
}

The key function here is not doing anything big or controversial, it consists of a single field lookup followed by cheap extraction of integer out of the Value enum – the kind of thing you’d expect a reasonable key function to do. But running this on a larger slice shows that performance isn’t exactly stellar:

fn large_vec() -> Vec<serde_json::Value> {
    std::iter::repeat_with(|| json!({"name": "John", "age": rand::thread_rng().gen_range(1..=100)}))
        .take(1_000_000)
        .collect()
}

fn main() {
    let mut v = large_vec();
    let t0 = Instant::now();
    sort_by_age(&mut v);
    let n = t0.elapsed().as_secs_f64();
    println!("{}", n);
}

On my laptop, the above code takes 1.01s to sort a slice of 1 million JSON objects. While not exactly a catastrophe, it feels like we should be able to do better than that.

First, we can switch from stable to unstable sort by changing v.sort_by_key(...) to v.sort_unstable_by_key(...). Doing so improves the timing to 0.44s, a more than 2x speed boost. In the use case that inspired this article unstable sort was fine because the comparison function provided fallback ordering when the primary key compared equal, and the initial order was meaningless anyway. We’ll continue to use unstable sort in the rest of the article because that gives best performance, but it’s trivial to revert to stable sort.

Looking at performance further, we notice that accessing a field of a JSON object must perform a hash table lookup. Since a sort has to do n log n comparisons on average, this means that we’re doing multiple lookups per each element. That’s the kind of situation where decorate-sort-undecorate might help, as it caches the extracted keys, reducing the number of lookups to “only” n, at the expense of some memory:

fn sort_by_age(v: &mut [Value]) {
    // extract JSON objects into a new vector, where each object is decorated with age
    let mut decorated: Vec<_> = v
        .iter_mut()
        .map(|o| (o["age"].as_i64(), o.take()))
        .collect();
    // sort with a key function that trivially refers to the pre-fetched age
    decorated.sort_unstable_by_key(|&(age, _)| age);
    // undecorate, i.e. store just the sorted objects back into the slice
    for (dest, (_, o)) in v.iter_mut().zip(decorated) {
        *dest = o;
    }
}

With this modification, my laptop takes only 0.059s to sort the slice of 1 million objects! This is 7.5x faster than the previous version, and 17x faster than the original stable sort. This is a much more radical speedup than I ever expected, and it’s something to definitely consider when your key function does anything more than access a struct/tuple field.

What is the explanation for such a drastic improvement? Let’s modify the key function to keep track of how many times it gets called:

    let mut cnt = 0u32;
    decorated.sort_by_key(|&(age, _)| {cnt += 1; age});
    println!("{}", cnt);

This shows that the key function is invoked 39 million times to sort 1 million elements. This is explained by log2(1_000_000) being approximately 20, so n log n is around 20 million. And for each comparison the key function needs to be called twice – once for the left and once for the right-hand side of the comparison – which adds up to 40 million. So the decoration cuts down the number of hash lookups from 40 million to 1 million, and that certainly makes a difference. Another thing to take into account is that sorting the decorated slice sorts by comparing integers directly present in the array, which allows inner loops of the sorting algorithm to be nicely inlined. On the other hand, sorting by a hashmap lookup requires call into the complex hash table code, which doesn’t inline nearly as nicely.

One downside is that we had to replace an elegant call to sort_unstable_by_key() with a bit of manual twiddling. Fortunately it turns out that the standard library has a method that does all this for us – slice::sort_by_cached_key(). According to its documentation, it sorts the slice and calls the key function at most once per element, “using temporary storage to remember the results of key evaluation”. That is functionally equivalent to our decorate-sort-undecorate, differing only in some implementation details (it doesn’t extract the original values to decorate them, but uses indices instead). With it sort_by_age() again becomes a one-liner:

fn sort_by_age(v: &mut [Value]) {
    v.sort_by_cached_key(|o| o["age"].as_i64().unwrap());
}

On my laptop this takes 0.089s to execute, making it 1.5x times slower than the hand-written decorate-sort-undecorate, but still 11x faster than the original naive implementation. The performance difference compared to the manual implementation seems to come from the fact that sort_by_cached_key() implements stable sorting. If we modify the manual decorate-sort-undecorate to use sort_by_key() instead of sort_unstable_by_key(), sort_by_cached_key() will come out as 1.34x faster, which means it’s reasonably optimized.

In conclusion, if you’re sorting by a non-trivial key function, and you have memory to spare, consider replacing sort_by_key() with sort_by_cached_key() and measure the result. If that helps, and you want to squeeze as much performance as possible, you can further optimize by implementing a manual decorate-sort-undecorate as shown above.

Tech

Self-referential types for fun and profit

December 4, 2022 Hrvoje 1 Comment

Some Rust questions are asked over and over, and are therefore met with short and well-rehearsed answers. “Compile in release mode when measuring performance.” “Don’t try to learn Rust with linked lists.” “Use scoped threads.” “That requires specialization.” But there is one response that is delivered in an almost checkmate-like fashion: “You are trying to construct a self-referential type.” This is swiftly explained to be impossible in current Rust, and the asker advised to pursue a different approach. If they absolutely insist on self-references, they’re referred to std::pin, which is unergonomic, hard to understand, and still requires unsafe. Alternatively, they’re given a list of crates that purport to help with creating self-referential structs, but come with disheartening caveats, and some with safety holes which can turn fatal.

Contrary to usual answers, this will show how to create self-referential types manually. A very important subset of such types are straightforward to reason about and, most importantly, can be manipulated through a safe interface. We will not rewrite the borrow checker, we will not use std::pin, but we will use a bit of unsafe. You’ve been warned.

A practical introduction

Just so you don’t think self-referential structs are a theoretical issue that only happen to people trying to implement specialized data structures, consider this task:

Use the zip crate to implement a function that accepts the name of a ZIP file and the name of a member stored inside it, and returns something that implements Read to stream the member.

Streaming the contents of a ZIP file sounds like a reasonable thing to want to do. We’d like to create a function like this:

fn zip_streamer(file_name: &str, member_name: &str) -> impl std::io::Read {
    todo!()
}

Looking at the crate docs, this seems trivial to implement. Open the file, call ZipArchive::new() to create a ZipArchive, then call ZipArchive::by_name() to get a ZipFile, which implements Read. Presto:

pub fn zip_streamer(file_name: &str, member_name: &str) -> impl std::io::Read {
    let file = File::open(file_name).unwrap();
    let mut archive = ZipArchive::new(file).unwrap();
    archive.by_name(member_name).unwrap()
}

But sadly, this fails to compile. The compiler complains that “borrowed value does not live long enough”:

error[E0597]: `archive` does not live long enough
 --> src/main.rs:8:5
  |
8 |     archive.by_name(member_name).unwrap()
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |     |
  |     borrowed value does not live long enough
  |     argument requires that `archive` is borrowed for `'static`
9 | }
  | - `archive` dropped here while still borrowed

The error message is a bit cryptic at first, it’s not clear how exactly “argument requires that archive is borrowed for static” (or what that even means), but looking at the signature of ZipArchive::by_name() reveals the problem. by_name() returns a ZipFile<'a> where 'a is the lifetime of the archive. The ZipFile we’re attempting to return references archive, a local variable. That would produce a dangling reference and obviously can’t be allowed to compile. The only way for that to work would be if archive were itself static, which is where the requirement “that archive is borrowed for ‘static” comes from.

If you’re an experienced Rustacean, at this point you’re developing a sinking feeling about where this is leading to. If you’re new to the language, or even intermediate, this can still look very much fixable: let’s just wrap both archive and its member into a common struct, return it from the function, and implement Read on the struct. That requires a bit of boilerplate, but should resolve the issue of referencing the local variable:

pub fn zip_streamer(file_name: &str, member_name: &str) -> impl std::io::Read {
    struct ZipStreamer<'a> {
        archive: ZipArchive<File>,
        reader: ZipFile<'a>,
    }

    impl Read for ZipStreamer<'_> {
        fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
            self.reader.read(buf)
        }
    }

    let file = File::open(file_name).unwrap();
    let mut archive = ZipArchive::new(file).unwrap();
    let reader = archive.by_name(member_name).unwrap();

    ZipStreamer { archive, reader }
}

You guessed it, this doesn’t compile either, and in fact fails with a very similar error message about argument requiring archive to be borrowed for ‘static, with this addition:

error[E0505]: cannot move out of `archive` because it is borrowed
  --> src/main.rs:22:17
   |
21 |     let reader = archive.by_name(member_name).unwrap();
   |                  ----------------------------
   |                  |
   |                  borrow of `archive` occurs here
   |                  argument requires that `archive` is borrowed for `'static`
22 |     ZipStreamer { archive, reader }
   |                   ^^^^^^^ move out of `archive` occurs here

Now we have a new problem: in order to construct ZipStreamer, we must move our local variable into a field of the ZipStreamer. But this move immediately invalidates reader, because as far as the borrow checker is concerned, use-after-move creates a dangling reference just like use-after-free does.

This looks bad, but there might still be hope. Can’t we box the ZipArchive to heap-allocate it, or wrap it in an Rc or Arc to postpone its destruction? But try as you might, no approach is allowed to compile.

Without ever being aware of it, we’ve been trying to create the dreaded “self-referential struct”. ZipStreamer is self-referential because the reader field refers to the archive field of the same struct. This makes the 'a lifetime nonsensical, because a lifetime refers to some data in the caller, which just doesn’t apply to the reader field. If instead of impl Read we returned the concrete type ZipStreamer<'a>, we simply wouldn’t have a lifetime to use for 'a. While our function does accept two references, neither of those is useful as the lifetime of ZipStreamer is clearly tied to the data inside itself.

This is by no means specific to the zip crate, the same issue pops up in various scenarios: a function that returns a database connection and an open transaction, or a parser that holds the string it parses and the parsed pieces that refer to string data.

Why Rust rejects self-reference

Before showing how to fix this issue, let’s once more look at the reasons why Rust prevents us from creating a self-referential structure. What would go wrong if we could create one? For example:

// imaginary self-referential struct where `current_token` points into `data`
struct ParseState {
    data: String,               // the string we're parsing
    current_token: &'data str,  // current token pointing into the string
}

There are two problems with current_token pointing into data.

First, there is no way to name the lifetime of current_token. In the above example we used a lifetime named the same as the field, inspired by a Polonius talk. In current Rust the only way to provide a lifetime on a struct member is to either make it 'static or take a lifetime declared on the struct, as in the ZipStreamer<'a> attempt above. But that lifetime is not useful, as it always corresponds to some data outside the type.

Second, Rust must assume that any move of ParseState invalidates current_token. Imagine if, ParseState were declared like this:

struct ParseState {
    data: [u8; 4096],            // the fixed-size buffer we're parsing
    current_token: &'data [u8],  // current token pointing into the buffer
}

Moving ParseState from one place to another will obviously invalidate the reference. This is not the case when the data is a String, or simply a Box, because those types heap-allocate their content, and moving the string or box has no effect on their content. But Rust doesn’t know that, it treats String the same as an array of u8, and assumes that moves are detrimental to borrows. This is the issue that prevents self-referential types from working in scenarios that would be considered perfectly safe and reasonable in C++.

This really bears repeating: there is nothing in the Rust abstract machine that prevents self-referential types from working, as long as you are willing to box them, because then you don’t run afoul of the move semantics. The remaining questions are how practical the implementation is, and whether one can expose a safe interface to it. Let’s investigate that.

Implementation of a self-referential struct

Armed with the above insight, we can sketch what we must do to make ZipStreamer work:

box the ZipArchive in ZipStreamer::archive, so we don’t need to worry about moves breaking the ZipFile reference
figure out a lifetime for the ZipFile stored in reader
ensure that the references in ZipFile stored reader are valid for the whole lifetime of ZipStreamer.

Here is the result:

pub fn zip_streamer(file_name: &str, member_name: &str) -> impl std::io::Read {
    #[allow(dead_code)]
    struct ZipStreamer {
        // actually has lifetime of `archive`
        // declared first so it's dropped before `archive`
        reader: ZipFile<'static>,
        // safety: we must never move out of this box as long as reader is alive
        archive: AliasableBox<ZipArchive<File>>,
    }

    impl Read for ZipStreamer {
        fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
            self.reader.read(buf)
        }
    }

    let file = File::open(file_name).unwrap();
    let mut archive = AliasableBox::from_unique(Box::new(ZipArchive::new(file).unwrap()));
    let reader = archive.by_name(member_name).unwrap();
    let reader = unsafe { std::mem::transmute(reader) };

    ZipStreamer { archive, reader }
}

Wait, this is it?

Explanation

Let’s examine the above code in some detail.

How does it work?

It creates a self-referential type that is safe to move around because the referenced data is behind a box, so its memory location doesn’t change when the top-level type is moved. Since the fields of ZipStreamer are private, no external code can mutate archive, or move it out of its box. When the struct is dropped, the documented drop order ensures that the reader field is dropped before archive field, so the reference never dangles, even when unused.

Is it really safe?

The above function is obviously not safe in the sense of “not using any unsafe”. But it exposes a fully safe external API to functionality that uses unsafe internally – much like many parts of the standard library. In other words, I believe that zip_streamer() is sound, but as always with unsafe code, caveat emptor.

What is the meaning of the ‘static lifetime on ZipFile?

The 'static lifetime is the only lifetime we can use in that position. Any other named lifetime would have to be introduced on the type, and would make it unusable. The use of 'static does mean that we are lying to the borrow checker, as the actual lifetime is the scope of the archive member in the box, which cannot be expressed in today’s Rust. This false lifetime is why we must use unsafe below. The important thing is that this lie cannot result in incorrect code being generated – as ZipFile<'a> is generic over any lifetime and cannot misuse 'static in any way (even the notoriously experimental specialization doesn’t allow specializing on lifetimes). Lifetimes are used to reject unsound code, not to modify the generated assembly.

Some examples that show self-referential types are demonstrated on a reference, and get around the lifetime issue by storing a pointer instead. But that trick wouldn’t work here, as we don’t work with a naked reference, but with the ZipFile<'a> type which contains references. Of course, the same approach would work for a reference.

What is the purpose of transmute()?

The call to transmute() admittedly looks scary, as it’s, quoting actual documentation, incredibly unsafe.

But one of the uses of transmute() is to extend the lifetime of a reference, and that’s exactly what we’re doing here. Transmuting between two different types is fraught with pitfalls because their layouts must match precisely, which in some cases requires forcing a non-standard representation. Transmuting between the same type to extend lifetime is still unsafe, but much easier to reason about from the perspective of correct usage of transmute().

What’s the deal with AliasableBox, wouldn’t Box work as well?

When a Box is moved, Rust assumes (and the borrow checker verifies) that the Box has unique access to its contents. Every code along the lines of:

let b = Box::new(...);
let r = &b.some_field;
let b2 = b;
// use "r"

…will fail to compile because b is moved while there are outstanding references into it. But that’s the moral equivalent of what our code does when it creates archive, creates the reader that refers into it, and then moves archive into the ZipStreamer. It seems okay to do this because moving a box doesn’t change the memory address of its content. But it technically creates a dangling reference because moving the box is defined to invalidate any references into it. A reference surviving the move is something the optimizer is allowed to assume won’t happen and is therefore UB. Running the above example under MIRI reports UB due to an invalidated reference.

To fix the issue, instead of Box we can define a box-like type that internally uses a raw pointer. This has the same semantics as a Box, but prevents the optimizer from assuming that moving it invalidates references to its contents. We use AliasableBox from the aliasable crate for that purpose.

This issue was spotted by /u/Darksonn on reddit.

If this works, why doesn’t async Rust use it instead of pinning?

The approach shown here requires the implementation to adhere to a very specific unsafe contract, such as “you must not move out of this Box” or “reader must be declared before archive otherwise you invoke UB.” Async Rust aims to provide a safer experience by limiting the unsafe portion to the creation of the pin, typically abstracted by the standard library or a crate, and the rest of the code being completely safe and sound.

There are ways to abstract this approach, though, but they come at the cost of ergonomics, and is not yet agreed to be sound in all cases. This is what the self-referential crates provide, and is shown below.

Safe version with ouroboros

Ouroboros is a crate that aims to encapsulate the unsafe contract like the one in the above code into abstractions implemented with a mixture of proc macros and custom types. Here is what the above code looks like using it:

use ouroboros::self_referencing;

pub fn zip_streamer(file_name: &str, member_name: &str) -> impl std::io::Read {
    #[self_referencing]
    struct ZipStreamer {
        archive: ZipArchive<File>,
        #[borrows(mut archive)]
        #[not_covariant]
        reader: ZipFile<'this>,
    }

    impl Read for ZipStreamer {
        fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
            self.with_reader_mut(|reader| reader.read(buf))
        }
    }

    let file = File::open(file_name).unwrap();
    let archive = ZipArchive::new(file).unwrap();
    ZipStreamerBuilder {
        archive,
        reader_builder: |archive| archive.by_name(member_name).unwrap(),
    }
    .build()
}

Under the hood ouroboros generates a very similar struct to the one we’ve created manually:

pub(super) struct ZipStreamer {
    reader: ZipFile<'static>,
    archive: ::ouroboros::macro_help::AliasableBox<ZipArchive<File>>,
}

The 'static lifetime is there, as is a reexport of AliasableBox. Ouroboros tries to provide only safe interfaces – if the code compiles, it should be sound, and require no unsafe. For example it puts the above type definition in a module, which makes reader private, and the only way to access it is using with_reader_mut(). This prevents safe code from ever observing the fictitious 'static lifetime.

As mentioned above, the benefit of using a crate like ouroboros is that it provides a safe abstraction to self-referential type. However, there is a number of downsides as well:

It generates a lot of code (500+ lines for the above simple example) which affects compilation times. It also adds convenience constructors on the type, which you can’t turn off, and which sometimes actually get in the way.
The generated code contains a lot of unsafe hidden behind abstractions that give a sense of safety, but might turn out to be unsound, and have in the past. (I’ll refrain from linking to individual issues, but they’re easy to find on the tracker, and there’s the example of rental which was abandoned due to soundness issues.)
It has a lot of hidden complexity, as it modifies the definition of the type in ways that are sometimes hard to reason about.

In summary, it probably makes sense to use a crate like ouroboros if it works for your use case and if you don’t mind all the extra code it generates. But I’d advise at least being aware what it does for you, as it feels like a bit of a leaky abstraction.

Limitations

The approach shown above, which includes both the hand-written unsafe and ouroboros, does have limitations. An interesting use case of self-referential structs is wrapping foreign data, such as C++ strings and vectors with small string/vector optimization. The approach in this article would technically work for such types, but would not be useful because it requires unconditional allocation, and the whole point of such types is efficiency and avoiding allocations. (They work in C++ by implementing a “move constructor” that fixes up the self-references.)

Another situation where the above doesn’t help is if safe code needs to be able to mutably access parts of the self-referential type. Since it’s safe code, you can’t just prohibit it from moving the reference out of it, you need to prevent it using the type system. This is remarkably achieved through std::pin, although many people have a hard time wrapping their heads around the concept in prctice.

Today self-referential types requires either reasoning about unsafe code, which most people shy away from for very good reason, or using crates with very serious hit on ergonomics. This is what we have to work with until Polonius comes to save us.

Tech

The stable HashMap trap

October 27, 2022 Hrvoje Leave a comment

You read about faster hash functions and switch to one. Most of your code gets the expected speed boost, but some parts mysteriously get slower – much slower, especially when dealing with large hashmaps. If this sounds familiar, you might have encountered the stable HashMap trap.

Background

Rust’s HashMap is an implementation of Google’s SwissTable. While the hash table provides excellent performance of, the SipHash hash function used by default is designed to protect against a large class of DoS attacks at the price of performance. When you care about performance and maliciously crafted keys are not an issue, switching to a more efficient hash function can boost performance by a significant amount. Rust hash maps make it fairly easy to change the hash function by switching to a different one, and there are crates that offer high-performance hash functions. The Rust Performance Book recommends, among others, the use of the brutally effective FxHasher from rustc-hash crate and the higher-quality ahash crate. This recommendation, however, comes at a price that is rarely if ever mentioned – some O(n) hash map operations, including deserialization, degrade to O(n**2)!

The issue

Since the issue is easy to stumble upon when deserializing HashMaps, let’s use that as example. We’ll construct a HashSet for simplicity, but HashMap would exhibit the same behavior. This code creates a set of 10m random values, serializes it to a Vec<u8> using bincode, and deserializes it back into a HashSet:

use std::time::Instant;
use rand::Rng;

type TestHashSet<K> = std::collections::HashSet<K>;
//type TestHashSet<K> = rustc_hash::FxHashSet<K>;

fn main() {
    let mut rnd = rand::thread_rng();
    let nums: Vec<_> = (0..10_000_000).map(|_| rnd.gen::<u64>()).collect();

    let t0 = Instant::now();
    let h: TestHashSet<u64> = nums.into_iter().collect();
    let t1 = Instant::now();
    println!("create: {}", (t1 - t0).as_secs_f64());

    let out = bincode::serialize(&h).unwrap();
    let t2 = Instant::now();
    println!("serialize: {}", (t2 - t1).as_secs_f64());

    let h2: TestHashSet<u64> = bincode::deserialize(&out).unwrap();
    let t3 = Instant::now();
    println!("deserialize: {}", (t3 - t2).as_secs_f64());

    println!("{}", h2.len());
}

Output on my machine:

create: 1.440201369
serialize: 0.071765342
deserialize: 1.114031976

Now let’s swith to fxhash to speed it up. We only need to point the TestHashSet alias to rustc_hash::FxHashSet<K> and we’re good to go. Note that FxHashSet is itself an alias for std::collections::HashSet whose hasher is FxHasher, so it only affects the hash function, not the implementation of the table. With that change, I get this output:

create: 1.391839734
serialize: 0.081116361
deserialize: 5.2052695

Creation is slightly faster, serialization takes about the same time, but deserialization is almost 5x slower than with the stdlib hash! Tested against different set sizes, the deserialization time is close to quadratic – it takes 5s with 10m elements, 20s with 20m elements, 69s with 40m elements, and 219s with 80m elements. We’ve seen the trap.

When does it happen?

First, let’s make it clear that the problem has nothing to do with serde or bincode. Serde just happens to create the set from values obtained by iterating it. The same slowdown can be easily reproduced without serde:

// just as slow as the serde version when TestHashSet is FxHashSet
let mut h2: TestHashSet<u64> = Default::default();
for &n in &h {
    h2.insert(n);
}

Second, this is not a consequence of fxhash being of lower quality or hitting some bad edge case. The same quadratic behavior can be observed using any other stable hash function, including the default SipHash when randomization is omitted. To try it, define TestHashSet<K> as std::collections::HashSet<K, BuildHasherDefault<std::collections::hash_map::DefaultHasher>> (quite a mouthful), which uses the default hash function with a non-random initial state – you’ll get a similar slowdown.

The third thing to be aware of is that you can’t reproduce the issue with a simple clone() or collect() – things like let h2 = h.clone() and let h2: TestHashSet<u64> = h.iter().copied().collect() will be O(n), and almost 10x faster with fxhash than with the default hasher. It also doesn’t happen when you pre-allocate the set using with_capacity(). To reproduce, you need to start with a HashSet that must resize at least once to accommodate the data, such as an empty one. Both clone() and collect() preallocate the HashSet which avoids this performance degradation.

This is why serde hits it – although serde stores the size of every container as a “size hint”, it also caps this hint during deserialization to a low number, for security reasons.

In summary, to reproduce quadratic complexity of construction, all of the following conditions must be met:

the HashMap/HashSet must resize to accommodate the values we’re inserting
it uses a stable hash function
it is built from keys in iteration order of the same HashMap/HashSet

Why does it happen?

This bug is thoroughly described in this 2016 article, which I encourage you to read to understand it. In short, copying a larger table into a smaller one assigns the keys from the first half of the larger table to consecutive locations. Then the keys from the second half of the larger table must fit between the keys which were already densely packed, which forms large collision chains. Although this is a temporary situation, fixed on the next resize, it affects performance visibly. But do refer to the article for details.

The article mentions Robin Hood hashing which is no longer used by Rust hash tables, but don’t let that throw you off. Although the implementation switched to a different probing strategy, it still uses open addressing with power-of-two hash table sizes.

The quadratic behavior was fixed for the default hasher by introducing randomization, but the issue is still open because the underlying bug is still present, visible with non-randomized hashers, as in the above code.

Workarounds

All fixes for this bug boil down to shuffling the items in order to avoid the construction in unfavorable order. Since the actual bug that leads to the issue is in the implementation of the hashmap, I refer to these as workarounds rather than proper fixes.

I do hope this bug gets more visibility and gets fixed, as it hurts people who try to use fxhash to get a faster HashMap, and it’s not immediately obvious what’s going on. In general, a nice property of hash tables is that, unlike unbalanced binary trees or quicksort, they’re normally not sensitive to key order. It’s a bit of a shame that that doesn’t apply to Rust’s hash tables with deterministic hashers.

Pre-allocate the HashSet/HashMap

This is by far the simplest fix when you don’t control the exact type of the map, but do control it is created. In the non-serde case, that means initialize the h2 hash map with capacity given upfront:

// fast because h2 is of the correct size to begin with
// note: can't just call with_capacity() because it assumes RandomState from stdlib
let mut h2: TestHashSet<u64> = TestHashSet::with_capacity_and_hasher(h.len(), Default::default());
for &n in &h {
    h2.insert(n);
}

This can be made to work with serde, but requires writing a custom deserializer.

Shuffle the items

Shuffling the elements also fixes the issue. For example:

// fast because h2 is built of elements which are not in iteration order of h
let mut elems: Vec<_> = h.iter().copied().collect();
elems.shuffle(&mut rnd);
let mut h2: TestHashSet<u64> = TestHashSet::default();
for n in elems {
    h2.insert(n);
}

This is a good “quick fix” when you don’t control how the table is constructed, but do control the data that flows into it. It is not that easy to apply to the serde case, where it would require a custom serializer.

Randomize the hasher

This is the most general solution, and is how the issue was worked around in the standard library. The standard library uses the RandomState to build the default hasher.

If you use another hasher, you just need to make sure that you’re using the randomizing one. (This is what AHashMap uses by default, so you don’t need to do anything special.) If your hasher doesn’t provide a random hasher builder, you can always create one using the public API. For example, for FxHasher we can create an FxRandomState that implements BuildHasher:

#[derive(Copy, Clone, Debug)]
pub struct FxRandomState(usize);

impl BuildHasher for FxRandomState {
    type Hasher = FxHasher;

    fn build_hasher(&self) -> FxHasher {
        let mut hasher = FxHasher::default();
        hasher.write_usize(self.0);
        hasher
    }
}

Ideally build_hasher() would just return FxHasher::with_seed(self.0), but with_seed() isn’t provided, so we get equivalent functionality by creating a hasher and mixing the seed into it. The remaining part is to implement Default for our hasher builder, which will obtain the actual seed. To make it efficient, we won’t get a random seed every time, but only when a new thread is created. Then within a thread we’ll assign each FxRandomState (and therefore each hash table) a new consecutive seed.

impl Default for FxRandomState {
    fn default() -> Self {
        thread_local! {
            static SEED: Cell<usize> = Cell::new(rand::thread_rng().gen())
        }
        let seed = SEED.with(|seed| {
            let n = seed.get();
            seed.set(n.wrapping_add(1));
            n
        });
        FxRandomState(seed)
    }
}

With this in place, we can alias TestHashSet<K> to HashSet<K, FxRandomState>, and the quadratic behavior goes away.

Tech

Contention on multi-threaded regex matching

October 22, 2022 Hrvoje 2 Comments

Let’s say you need to match the same regex across a large number of strings – perhaps you’re applying a grep-like filter to data generated or received by your program. This toy example demonstrates it by matching a regex against half a billion strings:

use regex::Regex;

lazy_static! {
    static ref IS_INTEGER: Regex = Regex::new("^[0-9]+$").unwrap();
}

fn main() {
    let strings: Vec<&str> = ["foo", "bar", "1234", "1234foo", ""]
        .into_iter()
        .cycle()
        .take(100_000_000)
        .collect();

    let start = Instant::now();
    let n_ints = strings.iter().filter(|s| IS_INTEGER.is_match(s)).count();
    let elapsed = start.elapsed().as_secs_f32();
    println!("{} {}s", n_ints, elapsed);
}

It’s not a scientific benchmark of regex performance, but it does show some interesting and unexpected effects also observed in real-world code. For starters, it takes 2.0s to execute the matches on my laptop.

This is good performance, but it’s a shame to do everything in one thread – let’s try to speed it up by using all cores. This is the kind of thing Rayon makes really easy, just change iter() to par_iter():

use rayon::prelude::*;
let n_ints = strings.par_iter().filter(|s| IS_INTEGER.is_match(s)).count();

Surprisingly, this takes 6-8s to execute on the system with 4 physical cores. In other words, instead of being 3-4x faster due to running on 4 cores, it’s 3-4 times slower. A very similar slowdown occurs on the system with 32 physical cores, where the time grows from 2.05s to 8.2s.

This result can’t be chalked up to an inefficiency in Rayon, as the same slowdown is observed when dividing the work among threads in other ways. Is it possible that matching a compiled regex in multiple threads causes contention when accessed from multiple threads?

When this was first suggested in discussions with coworkers, it seemed quite unlikely, as contention would imply that the compiled regex held a lock or other form of synchronization. This runs counter to the idea of a compiled regex, which one would expect to be fully constructed during compilation. Compiled regexes are often seen in lazy_statics, and shared by the whole program. But no matter how unlikely, the possibility of contention is easy to test, simply by switching from lazy_static! to thread_local!:

thread_local! {
    static IS_INTEGER: Regex = Regex::new("^[0-9]+$").unwrap();
}

The match now needs an additional closure to access the thread-local, but is still quite readable:

use rayon::prelude::*;
let n_ints = strings
    .par_iter()
    .filter(|s| IS_INTEGER.with(|is_integer| is_integer.is_match(s)))
    .count();

Continuing the surprise, this takes 0.66s to run, which is 3x faster than the single-threaded version – the kind of speedup one might realistically expect from a 4-core computer. On the 32-core server, it takes 0.086s, a 24x speedup.

So, regex matching does have a contention issue. The compiled regex type, Regex, wraps the internal Exec type, which holds ProgramCache values organized in a Pool that stores them inside a mutex. Accessed from a single thread or from multiple threads at different times, this mutex is cheap to acquire and is held for a very short time. But under strong contention it becomes a bottle neck with attempts to acquire it falling back to OS-level waits, causing performance to degrade.

The file PERFORMANCE.md dedicates a whole section to this issue. The text is nuanced, so instead of quoting its parts, I encourage you to go ahead and read it. It warns of the performance impact and shows how to eliminate it (using a slightly different approach than taken above), but it also says that it’s “supported and encouraged” to define regexes using lazy_static! and use them from multiple threads. It explains that, despite expectations, a compiled regex doesn’t contain everything needed to match it – some of the compiled state is built lazily while executing particular kinds of search, and is later reused for searches of the same kind. The mutex protects the “scratch space” used for those updates. Design options and tradeoffs are discussed in this issue in more detail.

In summary, for most use cases it’s perfectly fine to use the same compiled regex from multiple threads. But if you have code that does heavy regex matching from multiple threads, and does most or all of it on one regex, you’ll almost certainly want to give different instances of the compiled regex to different threads. As always, be sure to measure performance before actually making such a change to the code.

Programs, Tech

Rust global variables demystified

November 1, 2021 Hrvoje 1 Comment

Rust has a reputation of a language unfriendly to global variables. While this reputation is not entirely undeserved, most of it stems from guarantees afforded by Rust and not by a desire to stifle the programmer’s creativity. In this article we’ll show how to use global variables, and how to overcome the limitations.

Note: this article has a followup written two years later.

Is it even ok to use global variables?

Global variables are a controversial topic in programming circles, with many educators taking the moral high ground and condemning them as code smell, a shortcut, a crutch, hallmark of throw-away code, and <insert favorite insult>. While there is good reason for the hostility, there is also an abundance of situations where global variables are either appropriate or actually the only way to proceed. For example, pre-compiled regular expressions or state of a logger are examples where you’d probably want to use globals in preference to sending the state all the way from top-level to the bottom-most part of your program. Many low-level system APIs, such as those of signal handlers or hardware-triggered interrupts, have callbacks that don’t receive a “state” argument, so communication between the callback and the rest of the system must go through globals. There are other examples, and in this article we’ll assume that you have good reasons for using globals that your CS professor would approve of. If that’s not the case, don’t worry, we won’t tell.

Declaring globals

A Rust global variable is declared much like any other variable, except it’s declared at top-level and uses static instead of let:

static LOG_LEVEL: u8 = 0;

So we use the keyword static instead of let, and must spell out the type, as the compiler refuses to infer it. That means that you must spell out the type even when it’s unambiguous – static LOG_LEVEL = 0u8 won’t compile.

Note that what we’ve declared is actually a static, which is not necessarily global. If the same declaration appeared in a function, the variable would be visible only inside the function, but its value would still be shared among all invocations of the function and last until the end of the program. This article talks about global variables, but most of the content equally applies to static variables defined inside functions, the only difference being that of visibility.

The global works much like you’d expect: you can read it from any function in the module, you can import it (if declared pub) from another module and then read it from there. You can even borrow it and get a reference with a 'static lifetime – because the global variable lasts until the end of the program. Neat.

What you can’t do is assign to the global because it’s not declared as mut. This is where things get interesting.

Mutable globals – atomics and locks

Ideally we’d like to declare our global as mut and have a public API that manipulates it – say, a function that reads it and another that writes it:

static mut LOG_LEVEL: u8 = 0;

pub fn get_log_level() -> u8 {
    LOG_LEVEL
}

pub fn set_log_level(level: u8) {
    LOG_LEVEL = level;
}

The compiler rejects both the getter and the setter with a similar error message:

error[E0133]: use of mutable static is unsafe and requires unsafe function or block
 --> src/lib.rs:8:5
  |
8 |     LOG_LEVEL = level;
  |     ^^^^^^^^^^^^^^^^^ use of mutable static

The underlying problem is that a global variable is potentially visible from multiple threads. The above functions don’t synchronize their access to the global, so there’s nothing to prevent set_log_level() from being called in one thread while another thread calls get_log_level() or set_log_level(), either of which would constitute a data race. Rust requires an unsafe block to signal that such synchronization has been implemented by the code that surrounds access to the mutable static. Alternatively, the whole function can be unsafe to signal that the burden of synchronization is transferred to its callers. Since we don’t in fact have such synchronization (and it’s unclear how a caller of set_log_level() and get_log_level() would even ensure it), we won’t attempt to “fix” it by adding an unsafe to get the code to compile. We want to access globals without advanced reasoning about unsafe and undefined behavior.

Since we’re dealing with a potential data race, let’s address it with the mechanisms used elsewhere in Rust to avoid data races – locks and atomics. In case of a u8, we can simply replace u8 with AtomicU8:

use std::sync::atomic::{AtomicU8, Ordering};

static LOG_LEVEL: AtomicU8 = AtomicU8::new(0);

pub fn get_log_level() -> u8 {
    LOG_LEVEL.load(Ordering::Relaxed)
}

pub fn set_log_level(level: u8) {
    LOG_LEVEL.store(level, Ordering::Relaxed);
}

The global variable is no longer mut, so no unsafe is needed. The code is thread-safe and as performant as an unsafe version would be – on x86 the relaxed atomic load compiles into an ordinary load. If you need stricter ordering guarantees between LOG_LEVEL and other data in the program, you can use Ordering::SeqCst instead.

But what if we need something that won’t neatly fit into an atomic – say, a string? This compiles, but won’t allow us to modify the global:

static LOG_FILE: String = String::new();

pub fn get_log_file() -> &'static str {
    &LOG_FILE
}

Since there is no AtomicString, we need to use a proper lock:

use std::sync::Mutex;

// XXX - doesn't compile
static LOG_FILE: Mutex<String> = Mutex::new(String::new());

pub fn get_log_file() -> String {
    LOG_FILE.lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *LOG_FILE.lock().unwrap() = file;
}

Note that get_log_file() must return a fresh copy of the string. Returning a reference would require a lifetime, and there is no lifetime to associate with the global variable other than 'static, and 'static is incorrect (and wouldn’t compile) because set_log_file() can modify it at any time.

The above doesn’t compile for a different reason:

error[E0015]: calls in statics are limited to constant functions, tuple structs and tuple variants
 --> src/lib.rs:3:34
  |
3 | static LOG_FILE: Mutex<String> = Mutex::new(String::new());
  |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^

For more information about this error, try `rustc --explain E0015`.

What’s going on here? Why did String::new() compile, but Mutex::new(String::new()) didn’t?

The difference is that the globals we declared so far were just pieces of data whose initial values were available at compilation time. The compiler didn’t need to generate any initialization code for static LOG_LEVEL: u8 = 0 – it only reserved a byte in the executable’s data segment and ensured that it contained 0 at compile time. String::new() also works because it is a const fn, function specifically marked as runnable at compile time. It can be marked like that because an empty string doesn’t allocate, so the string returned by String::new() can be represented in the executable by a triple of (0 [length], 0 [capacity], NonNull::dangling() [constant representing unallocated pointer]). Nothing to do at run time. On the other hand, static LOG_FILE: String = String::from("foo") wouldn’t compile because String::from() requires a run-time allocation and is therefore not a const fn.

std::sync::Mutex::new() is not const fn because it requires an allocation in order to keep the system mutex at a fixed address. And even if we used an allocation-free mutex (such as parking_lot::Mutex which supports a const fn constructor on nightly Rust), we’d face the same issue if we wanted to start off with a non-empty string, a data structure coming from a library we don’t control, or information only available at run-time, such as fresh randomness or the current time. In general, we don’t want to be constrained to const fn functions when initializing global variables.

As a side note, C++ supports initializing global variables with arbitrary code by simply allowing the compiler to generate code that runs before main() (or during dlopen() in the case of dynamic libraries). This approach is convenient for simple values, but when used in real programs it led to issues with initialization order, aptly named static initialization order fiasco. To avoid that problem, as well as issues with libraries that require explicit initialization, Rust doesn’t allow pre-main initialization, opting instead for the approach C++ calls the construct on first use idiom.

We will review three ways to initialize a global variable with arbitrary data, two of them based on external (but extremely well reviewed and widely used) crates, and one based on the standard library.

Once cell

The once_cell crate provides a OnceCell type that can be used to define global variables. Here is how one would use OnceCell for LOG_FILE:

use once_cell::sync::OnceCell;
use std::sync::Mutex;

static LOG_FILE: OnceCell<Mutex<String>> = OnceCell::new();

fn ensure_log_file() -> &'static Mutex<String> {
    LOG_FILE.get_or_init(|| Mutex::new(String::new()))
}

pub fn get_log_file() -> String {
    ensure_log_file().lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *ensure_log_file().lock().unwrap() = file;
}

Looking at the implementation of get_log_file() and set_log_file(), it is immediately apparent that they implement the “construct on first use” idiom – both functions call a method that ensures that the inner value is initialized (and that this is only done once), and retrieve a reference to the globally-stored value. This value can then be manipulated in the usual way through interior mutability.

OnceCell<T> is conceptually similar to a RefCell<Option<T>>. Like an Option, it has two states, one empty and another with a useful value. Like RefCell<Option<T>>, it uses interior mutability to allow setting the inner value using just a shared reference. But unlike Option, once set to a non-empty value, the stored value can never be set again. This allows a non-empty OnceCell to give out shared references to the inner data, which RefCell<Option> wouldn’t be allowed to do (it could at best return a Ref<T>) because the contents may change at any time. OnceCell is also thread-safe, so it would actually compare to Mutex<Option<T>>, except it uses an atomic to efficiently check whether the cell has been set.

once_cell also provides a Lazy type that makes the above even simpler, removing the need for a separate ensure_log_file() function:

use std::sync::Mutex;
use once_cell::sync::Lazy;

static LOG_FILE: Lazy<Mutex<String>> = Lazy::new(|| Mutex::new(String::new()));

pub fn get_log_file() -> String {
    LOG_FILE.lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *LOG_FILE.lock().unwrap() = file;
}

Our globals now work pretty much the way we wanted them. Lazy<T> performs a sleight of hand that even allows us to directly call methods like lock() directly on LOG_FILE. It achieves this by implementing Deref, the trait normally used to treat lightweight containers (typically smart pointers) like the values they contain. Deref is the mechanism that allows you to call methods of Foo on Box<Foo>, or methods of &str on String, and so on. Lazy<T> wraps a OnceCell<T> and implements a Deref<T> that returns self.once_cell.get_or_init(|| self.init.take().unwrap()), where init is the closure passed to Lazy::new().

The Lazy version still uses the construct on first use idiom, it’s just hidden behind the magic of Deref. In some cases this can yield surprising results because the actual type of LOG_FILE is not Mutex<String>, it’s Lazy<Mutex<String>>, so if you use it a context that expects exactly &Mutex<String>, it will fail to compile. It’s not a big deal because you can always obtain the actual &Mutex<String> with &*LOG_FILE (equivalent to LOG_FILE.deref()), but it is something to be aware of.

The OnceCell and Lazy types are in the process of getting stabilized, so we can expect them to become part of the standard library in the near future.

Lazy static

Another popular library for creating global variables is the lazy_static crate, which defines a macro that hides even the lazy initialization, allowing you to write code that looks almost like an ordinary declaration:

use lazy_static::lazy_static;
use std::sync::Mutex;

lazy_static! {
    static ref LOG_FILE: Mutex<String> = Mutex::new(String::new());
}

// get_log_file() and set_log_file() defined as with once_cell::Lazy

pub fn get_log_file() -> String {
    LOG_FILE.lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *LOG_FILE.lock().unwrap() = file;
}

An invocation of lazy_static! is just syntax sugar for defining a Lazy value. Under the hood everything works exactly the same as in the example that used once_cell::Lazy (except lazy_static defines its own lazy_static::Lazy). Like with once_cell::Lazy, the actual type of LOG_FILE is not Mutex<String>, but a different type which uses Deref to give out &'static Mutex<String> on method calls. Some details differ, e.g. lazy_static constructs a dummy type also named LOG_FILE and implements Deref on that, while hiding the actual Lazy<T> value in a static variable defined in a function – but the end result is exactly the same.

If you’re curious, you can run cargo expand on code generated by lazy_static! {...} to learn exactly what it does.

Standard library – Once+unsafe

Note: OnceCell stabilized in Rust 1.70 as std::sync::OnceLock, so this section is useful only for educational/historical purposes.

Until OnceCell stabilizes, the standard library doesn’t offer a way to implement the global variables initialized with non-const functions without unsafe code. In most cases this should be avoided because you can use use once_cell or lazy_static. But if you must only depend on the standard library, if you want tighter control, or if you just want to learn how it’s done, here is an example that uses std::sync::Once to implement a mutable global:

use std::mem::MaybeUninit;
use std::sync::{Mutex, Once};

fn ensure_log_file() -> &'static Mutex<String> {
    static mut LOG_FILE: MaybeUninit<Mutex<String>> = MaybeUninit::uninit();
    static LOG_FILE_ONCE: Once = Once::new();

    // Safety: initializing the variable is only done once, and reading is
    // possible only after initialization.
    unsafe {
        LOG_FILE_ONCE.call_once(|| {
            LOG_FILE.write(Mutex::new(String::new()));
        });
        // We've initialized it at this point, so it's safe to return the reference.
        LOG_FILE.assume_init_ref()
    }
}

// get_log_file() and set_log_file() defined as with once_cell::OnceCell

pub fn get_log_file() -> String {
    ensure_log_file().lock().unwrap().clone()
}

pub fn set_log_file(file: String) {
    *ensure_log_file().lock().unwrap() = file;
}

Once ensures that MaybeUninit is initialized only once, which OnceCell guaranteed in previous versions of the code. Once is also efficient, using atomics to optimize the fast path when the variable has already been initialized. The definition of the variable, now static rather than global, is placed inside the function to prevent code outside ensure_log_file() from accessing it directly. All accesses inside ensure_log_file() are synchronized through call_once(), writer by running inside it, and readers by waiting for it to complete, which makes the access data-race-free.

Once the initialization of LOG_FILE is complete, ensure_log_file() can proceed to return the reference to the inside of the MaybeUninit, using assume_init_ref().

Which option to choose

In most cases involving real-world types, you’ll want to use once_cell or lazy_static, depending on which syntax you prefer. They have the exact same run-time characteristics, and you won’t go wrong with choosing either. Of course, when once_cell stabilizes in the stdlib, that will become the obvious choice.

There are two exceptions:

When your globals are const-initialized and you don’t need to modify them, you can just declare them as static or const. The difference between the two is that static guarantees that they are stored in only one place and const doesn’t (it inlines them where they’re used).
When you need to modify your globals, but their type is supported by std::sync::atomic: bool, u8–u64, i8–i64, usize, or isize. In that case you can declare the variable as static with the appropriate atomic type, and use the atomic API to read and modify it.

A case not covered by this article are thread-local variables, which can also be global. Those are provided by the thread_local macro from the standard library, and allow the use of non-Sync types in globals.

Tech

Upgradable parking_lot::RwLock might not be what you expect

May 13, 2021 Hrvoje Leave a comment

Let’s say we’re building a simple table indexed by integers starting with 0. Although the keys are contiguous, the table is loaded from key-value pairs that arrive in arbitrary order. The number of elements is not known in advance and can be anywhere from a handful up to a hundred million, but is expected to fit in working memory, given a reasonable representation. Due to these constraints the table loader naturally represents the data as Vec<Option<Payload>>. It starts out empty and sports a set() method that fills the payload at the given position, resizing the table as needed:

/// Called during loading, place `val` at `pos`
fn set(&mut self, pos: usize, val: Payload) {
    if pos >= self.data.len() {
        self.data.resize_with(pos + 1, Default::default);
    }
    self.data[pos] = Some(val);
}

The table payload is comparatively expensive to compute (it is deserialized from JSON), so that’s done from multiple worker threads, and set() must be thread-safe. The obvious approach is to place the whole table under a lock, using the fantastic parking_lot crate to provide the implementation. Under a mutex data would be a Mutex<Vec<Option<Payload>>>:

fn set(&self, pos: usize, val: Payload) {
    let mut data = self.data.lock();  // no unwrap() because parking_lot is awesome!
    if pos >= data.len() {
        data.resize_with(pos + 1, Default::default);
    }
    data[pos] = Some(val);
}

/// Called once loading is done to extract the data
fn into_data(self) -> Vec<Option<Payload>> {
    self.data.into_inner()
}

This simple solution has a serious problem – now the whole of set() operates with the big lock acquired. On a machine with 32 physical cores, there is a lot of contention around calls to set(), and the worker threads end up spending about 80% of their time idly waiting for the lock. The worst part is that the coarse lock is only really needed to resize the table, which occurs very rarely, because just one large key immediately removes the need for resizes for all smaller keys. The large majority of calls to set() just access non-overlapping slots in data and have no real need for locking.

Fortunately parking_lot’s Mutex is so incredibly well crafted that it only costs one (1) byte of space. Even if it ends up taking up more space due to padding, just compare that to a POSIX mutex that requires 40 bytes on x86-64 Linux, and must be heap-allocated because it’s UB to move it after initialization! With parking_lot fine-grained locking is quite feasible, even on tables with hundreds of millions of elements. Individual locks will be uncontended and therefore acquirable without system calls, and virtually free compared to the big lock. If we knew the table size in advance, we’d represent it as Vec<Mutex<Option<Payload>>> and be done with it.

However, we still need some kind of lock around the whole table as well, in order to be able to resize it when necessary. Since resizing only happens occasionally, we can use a RwLock, acquiring a shared read lock to access the table and its individual row (which has its own mutex for updating), and request an exclusive write lock only to resize the whole table. The type for data now becomes RwLock<Vec<Mutex<Option<Payload>>>>. (Don’t be scared by the piled-up generics, just read them in the order of appearance: read-write lock containing a vector of mutexes which protect optional payloads.)

To top it off, the parking_lot README advertises a feature of its RwLock that looks exactly like what we need:

RwLock supports atomically upgrading an “upgradable” read lock into a write lock.

Perfect – we acquire the read lock for the whole table, and only upgrade it to a write lock if a resize is needed. After the resize, we downgrade it back to a read lock and proceed. With resizing taken care of, we access the inner mutexes through the shared read lock, acquire the finer-grained mutex without any contention, and store the data. Here is an implementation of that idea:

// XXX
pub fn set(&self, pos: usize, val: Payload) {
    let data = self.data.upgradable_read();
    if data.len() <= pos {   // need to resize the table
        let mut wdata = RwLockUpgradableReadGuard::upgrade(data);
        wdata.resize_with(pos + 1, Default::default);
        let data = RwLockWriteGuard::downgrade(wdata);
        *data[pos].lock() = Some(val);
    } else {
        *data[pos].lock() = Some(val);
    }
}

fn into_data(self) -> Vec<Option<Payload>> {
    self.data.into_inner().into_iter().map(Mutex::into_inner).collect()
}

However, benchmarking this code in production shows no improvement whatsoever compared to the version with one big mutex. What’s going on?

After spending a lot of time trying to figure out where else in the program there could be a lock causing contention (including investigating the allocator, among other suspects turned innocent), I finally came back to this code and carefully reviewed the documentation of upgradable_read, which says the following:

Locks this rwlock with upgradable read access, blocking the current thread until it can be acquired. The calling thread will be blocked until there are no more writers or other upgradable reads which hold the lock. There may be other readers currently inside the lock when this method returns.

So, an upgradable read is locking out other upgradable reads, even if they never exercise their option to upgrade the lock! Since every access to the table from set() is an upgradable read, this made it behave just like a mutex. I feel that this should be more prominently stressed in the docs, as it is easy to miss and makes upgradable reads useless for this scenario. The correct way to implement set() is by manually releasing the read lock and reacquiring a write lock:

pub fn set(&self, pos: usize, val: Payload) {
    let mut data = self.data.read();
    if data.len() <= pos {
        // "upgrade" the lock
        drop(data);
        let mut wdata = self.data.write();
        // check that someone else hasn't resized the table in the meantime
        if wdata.len() <= pos {
            wdata.resize_with(pos + 1, Default::default);
        }
        // now "downgrade" it back again
        drop(wdata);
        data = self.data.read();
    }
    *data[pos].lock() = Some(val);
}

The upgrade and the downgrade are of course no longer atomic, but we don’t really care, we just perform another check after the upgrade in case another thread beat us to the punch and resized the table. And most importantly, this version of the code utilized all the available cores, bringing load times from 7 minutes to 1m45s to parse and load ~70 GiB of JSON data. (Just going through all the data with cat takes 32s, so this is pretty good.)

The immediate moral of the story is: be careful when adopting non-standard features of fundamental APIs like the synchronization devices. The documentation stated the behavior clearly, and quite a bit of debugging would have been avoided by reading it in advance. Even an otherwise excellent library like parking_lot sometimes violates the principle of least surprise.

The deeper lesson is that fearless concurrency, and the more general “if it compiles, it works” principle have their limits. Rust has spoiled us by providing a nice experience with multi-threading, so much that it’s easy to be genuinely surprised when a program that successfully compiles ends up misbehaving in a non-obvious way.

Tech

Rust async is colored, and that’s not a big deal

March 14, 2021 Hrvoje 6 Comments

In the last several years async-friendly languages and APIs have received a large amount of attention. One contentious point in the language design space are the “colored functions”, or division of functions to async and non-async ones. The term was introduced by the now-famous 2015 article titled What Color is Your Function?, which uses color as a metaphor for the often painful mismatch between sync and async functions in JavaScript and other languages with explicitly async functions. Since 2015 many more languages have jumped on the aysnc bandwagon, so many more programmers are now getting familiar with the metaphor. Given that some languages managed to provide async IO without colored functions, such as Go, Zig, and in the future likely Java, the discussion around function colors is picking up once again and is raised in the context of Rust. Some people have even tried to argue that the bad rap of colored function doesn’t apply to Rust’s async because it’s not colored in the first place. An article from several days ago is titled “Rust’s async isn’t f#@king colored!”, and similar arguments have appeared on reddit. I’m not picking on any specific post, but I’d like to provide a response to that sort of argument in general.

In this article I will show that Rust async functions are colored, by both the original definition and in practice. This is not meant as an criticism of Rust async, though – I don’t see function colors as an insurmountable issue, but as a reflection of the fundamental difference of async and sync models of the world. Languages that hide that difference do so by introducing compromises that might not be acceptable in a systems language like Rust or C++ – for example, by entirely forbidding the use of system threads, or by complicating the invocation of foreign or OS-level blocking calls. Colored functions are also present in at least C#, Python, Kotlin, and C++, so they’re not a quirk of JavaScript and Rust. And additional features of Rust async do make it easier to connect async code with traditional blocking code, something that is just not possible in JavaScript.

Colored functions

“What Color is Your Function?” starts off by describing an imaginary language that perversely defines two types of functions: red and blue. The language enforces a set of seemingly arbitrary rules regarding how the two are allowed to interact:

Every function has a color.
The way you call a function depends on its color.
You can only call a red function from within another red function.
Red functions are more painful to call.
Some core library functions are red.

Without knowing the details, a reasonable person would agree that the described language is not particularly well designed. Of course, readers of this article in 2021 will not find it hard to recognize the analogy with async: red functions are async functions, and blue functions are just ordinary functions. For example, #2 and #4 refers to the fact that calling an async function requires either explicit callback chaining or await, whereas a sync function can just be called. #3 refers to the fact that await and callback resolution work only inside async functions, and JavaScript doesn’t provide a way to block the current non-async function until a promise (async value) is resolved. The article portrays async functions as a leaky abstraction that profoundly and negatively affects the language, starting with the above rules.

The rules of async make async code contagious, because using just one async function in one place requires all the callers up the stack to become async. This splits the ecosystem into async and non-async libraries, with little possibility to use them interchangeably. The article describes async functions as functions that operate on classic JavaScript callbacks, but further argues that async/await, which was novel at the time, doesn’t help with the issue. Although await constitutes a massive ergonomic improvement for calling async from async (#4), it does nothing to alleviate the split – you still cannot call async code from non-async code because await requires async.

Function colors in Rust async

How does all this apply to Rust? Many people believe that it applies only in part, or not at all. Several objections have been raised:

Rust async functions are in essence ordinary functions that happen to return values that implement the Future trait. async fn is just syntactic sugar for defining such a function, but you can make one yourself using an ordinary fn as long as your function returns a type that implements Future. Since async functions, the argument goes, are functions that return Future<Output = T> instead of T and as such are not “special” in any way, any more than functions that return a Result<T> instead of T are special – so rule #1 (“every function has a color”) doesn’t apply.
Unlike JavaScript, Rust async executors provide a block_on() primitive that invokes an async function from a non-async context and blocks until the result is available – so rule #3 (“you can only call a red function from within another red function”) doesn’t apply.
Again unlike JavaScript, Rust async provides spawn_blocking() which invokes a blocking sync function from an async context, temporarily suspending the current async function without blocking the rest of the async environment. This one doesn’t correspond to a rule from the original article because JavaScript doesn’t support blocking sync functions.
Rule #5 (“some core library functions are red”) doesn’t apply because Rust’s stdlib is sync-only.

If these arguments are correct, the only color rule that remains is rule #4, “red functions are more painful to call”, and that part is almost completely alleviated by await. The original JavaScript problems where async functions “don’t compose in expressions because of the callbacks” or “have different error-handling” simply don’t exist with await, in either JavaScript or Rust. Taking these arguments at face value, it would seem that the whole function-color problem is made up or at least wildly exaggerated from some exotic JavaScript problems that Rust async doesn’t inherit. Unfortunately, this is not the case.

First, the split between the sync and async ecosystems is immediately apparent to anyone who looks at the ecosystem. The very existence of async_std, a crate with the explicit purpose to provide an “async version of the Rust standard library”, shows that the regular standard library is not usable in an async context. If function colors weren’t present in Rust, the ordinary stdlib would be used in both sync and async code, as is the case in Go, where a distinction between “sync” and “async” is never made to begin with.

Then what of the above objections? Let’s go through them one by one and see how they hold up under scrutiny.

Aren’t Rust async functions just ordinary functions with a wacky return type?

While this is true in a technical sense, the same is also true in JavaScript and almost all languages with colored async (with the exception of Kotlin) in exactly the same way. JavaScript async functions are syntactic sugar for functions that create and return a Promise. Python’s async functions are regular callables that immediately return a coroutine object. That doesn’t change the fact that in all those languages the caller must handle the returned Promise (coroutine object in Python, Future in Rust) in ways that differ from handling normal values returned from functions. For example, you cannot pass an async function to Iterator::filter() because Iterator::filter() expects a function that returns an actual bool, not an opaque value that just might produce a bool at some point in the future. No matter what you put in the body of your async function, it will never return bool, and extracting the bool requires executor magic that creates other problems, as we’ll see below. Regardless of whether it’s technically possible to call an async function from a sync context, inability to retrieve its result is at the core of function color distinction.

Ok, but doesn’t the same apply to Result? Functions that need a u32 aren’t particularly happy to receive a Result<u32, SomeError>. A generic function that accepts u32, such as Iterator::min(), has no idea what to do with Result<u32, SomeError>. And yet people don’t go around claiming that Result somehow “colors” their functions. I admit that this argument has merit – Result indeed introduces a semantic shift that is not always easy to bridge, including in the example we used above, Iterator::filter(). There is even a proposal to add 21 new iterator methods such as try_filter(), try_min_by_key(), try_is_partitioned(), and so on, in order to support doing IO in your filter function (and key function, etc.). Doing this completely generically might require Haskell-style monads or at least some form of higher-kinded types. All this indicates that supporting both Result and non-Result types in fully generic code is far from a trivial matter. But is that enough to justify the claim that Result and Future are equivalent in how they affect functions that must handle them? I would say it’s not, and here is why.

If the recipient of a Result doesn’t care about the error case, it can locally resolve Result to the actual value by unwrapping it. If it doesn’t want to panic on error, it can choose to convert the error to a fallback value, or skip the processing of the value. While it can use the ? operator to propagate the error to its caller, it is not obliged to do so. The recipient of a Future doesn’t have that option – it can either .await the future, in which case it must become async itself, or it must ask an executor to resolve the future, in which case it must have access to an executor, and license to block. What it cannot do is get to the underlying value without interaction with the async environment.

Verdict: Rule #1 mostly applies to Rust – async functions are special because they return values that require async context to retrieve the actual payload.

Doesn’t `block_on()` offer a convenient way to invoke an async function from a non-async context?

Yes, provided you are actually allowed to use it. Libraries are expected to work with the executor provided by the environment and don’t have an executor lying around which they can just call to resolve async code. The standard library, for example, is certainly not allowed to assume any particular executor, and there are currently no traits that abstract over third-party executors.

But even if you had access to an executor, there is a more fundamental problem with block_on(). Consider a sync function fn foo() that, during its run, needs to obtain the value from an async function async fn bar(). To do so, foo() does something like let bar_result = block_on(bar()). But that means that foo() is no longer just a non-async function, it’s now a blocking non-async function. What does that mean? It means that foo() can block for arbitrarily long while waiting for bar() to complete. Async functions are not allowed to call functions like foo() for the same reason they’re not allowed to call thread::sleep() or TcpStream::connect() – calling a blocking function from async code halts the whole executor thread until the blocking function returns. In case of that happening in multiple threads, or in case of a single-threaded executor, that freezes the whole async system. This is not described in the original function color article because neither block_on() nor blocking functions exist in stock JavaScript. But the implications are clear: a function that uses block_on() is no longer blue, but it’s not red either – it’s of a new color, let’s call it purple.

If this looks like it’s changing the landscape, that’s because it is. And it gets worse. Consider another async function, xyzzy(), that needs to call foo(). If foo() were a blue/non-async function, xyzzy() would just call it and be done with it, the way it’d call HashMap::get() or Option::take() without thinking. But foo() is a purple function which blocks on block_on(bar()), and xyzzy() is not allowed to call it. The irony is that both xyzzy() and bar() are async and if xyzz() could just await bar() directly, everything would be fine. The fact that xyzzy() calls bar() through the non-async foo() is what creates the problem – foo‘s use of block_on() breaks the chain of suspensions required for bar() to communicate to xyzzy() that it needs to suspend until further notice. The ability to propagate suspension from the bottom-most awaitee all the way to the executor is the actual reason why async must be contagious. By eliminating async from the signature of foo() one also eliminates much of the advantage of bar() being async, along with the possibility of calling foo() from async code.

Verdict: rule #3 applies because block_on() changes a blue function into something that is neither red nor callable from red.

Doesn’t `spawn_blocking()` resolve the issue of awaiting blocking functions in async contexts?

spawn_blocking() is a neat bridge between sync and async code: it takes a sync function that might take a long time to execute, and instead of calling it, submits it to a thread pool for execution. It returns a Future, so you can await spawn_blocking(|| some_blocking_call()) like you’d await a true async function without the issues associated with block_on(). This is because the Future returned by spawn_blocking() is pending until until the thread pool reports that it’s done executing the submitted sync function. In our extended color metaphor, spawn_blocking() is an adapter that converts a purple function into a red function. Its main intended use case are CPU-bound functions that might take a long time to execute, as well as blocking functions that just don’t have a good async alternative. The example of the latter are functions that work with the file system, which still don’t have a good async alternative, or legacy blocking code behind FFI (think ancient database drivers and the like).

Problems arise when code tries to avoid multiple function colors and use block_on() or spawn_blocking() to hide the “color” of the implementation. For example, a library might be implemented using async code internally, but use block_on() to expose only a sync API. Someone might then use that library in an async context and wrap the sync calls in spawn_blocking(). What would be the consequences if that was done across the board? Recall that the important advantage of async is the ability to scale the number of concurrent agents (futures) without increasing the number of OS threads. As long as the agents are mostly IO-bound, you can have literally millions of them executing (most of them being suspended at any given time) on a single thread. But if an async function like the above xyzzy() uses spawn_blocking() to await a purple function like foo(), which itself uses block_on() to await an async function like bar(), then we have a problem: the number of xyzzy() instances that can run concurrently and make progress is now limited by the number of threads in the thread pool employed by spawn_blocking(). If you need to spawn a large number of tasks awaiting xyzzy() concurrently, most of them will need to wait for a slot in the thread pool to open up before their foo() functions even begin executing. And all this because foo() blocks on bar(), which is again ironic because bar(), being an async function, is designed to scale independently of the number of threads available to execute it.

The above is not just a matter of performance degradation; in the worst case spawn_blocking(|| block_on(...)) can deadlock. Consider what happens if one async function behind spawn_blocking(|| block_on(...)) needs data from another async function started the same way in order to proceed. It is possible that the other async function cannot make progress because it is waiting for a slot in the thread pool to even begin executing. And the slot won’t free up because it is taken by the first async function, which also runs inside a spawn_blocking() invocation. The slot is never going to change owner, and a deadlock occurs. This can’t happen with async functions that are directly executed as async tasks because those don’t require a slot in a fixed-size pool. They can all be in a suspended state waiting for something to happen to any of them, and resume execution at any moment. In an async system the number of OS threads deployed by the executor doesn’t limit the number of async functions that can work concurrently. (There are executors that use a single thread to drive all futures.)

Verdict: spawn_blocking() is fine to use with CPU-bound or true blocking code, but it’s not a good idea to use it with block_on() because the advantages of async are then lost and there is a possibility of deadlock.

But Rust’s stdlib is sync-only.

That’s technically true, but Rust’s stdlib is intentionally minimal. Important parts of functionality associated with Rust are delegated to external crates, with great success. Many of these external crates now require async, or even a specific executor like tokio. So while the standard library is async-free, you cannot ignore async while programming in Rust.

Verdict: technically true but not useful in a language with a minimalistic standard library.

Dealing with a two-colored world

Again, the above is not a criticism of Rust async, but merely of the claim that it’s not colored. Once we accept that it is, it becomes clear that, unlike JavaScript, Rust actually does provide the tools we need to deal with the mismatch. We can:

Accept that sync and async are two separate worlds, and not try to hide it. In particular, don’t write “sync” interfaces that use block_on() to hide async ones, and the other way around with spawn_blocking(). If you absolutely must hide the async interfaces behind sync ones, then do so at immediately at the entry point, document that you’re doing so, and provide a public interface to the underlying native call.
Respecting the above, use block_on() and spawn_blocking() in application-level code on the boundaries between the two worlds.
In more complex scenarios, create clear and documented boundaries between the two worlds and use channels to communicate between them. This technique is already used for both multi-threaded and async code, so it should come to no surprise to future maintainers. Ideally you’d use channels that provide both a sync and an async interface, but if those are not available, use async channels with block_on() on the sync side.

Const Mutex and RwLock constructors

Once cell is now in std

What to use in 2024

Atomic counter

Thread-local counter

False sharing

A practical introduction

Why Rust rejects self-reference

Implementation of a self-referential struct

Explanation

How does it work?

Is it really safe?

What is the meaning of the ‘static lifetime on ZipFile?

What is the purpose of transmute()?

What’s the deal with AliasableBox, wouldn’t Box work as well?

If this works, why doesn’t async Rust use it instead of pinning?

Safe version with ouroboros

Limitations

Background

The issue

When does it happen?

Why does it happen?

Workarounds

Pre-allocate the HashSet/HashMap

Shuffle the items

Randomize the hasher

Is it even ok to use global variables?

Declaring globals

Mutable globals – atomics and locks

Once cell

Lazy static

Standard library – Once+unsafe

Which option to choose

Colored functions

Function colors in Rust async

Aren’t Rust async functions just ordinary functions with a wacky return type?

Doesn’t block_on() offer a convenient way to invoke an async function from a non-async context?

Doesn’t spawn_blocking() resolve the issue of awaiting blocking functions in async contexts?

But Rust’s stdlib is sync-only.

Dealing with a two-colored world

Ramblings about whatever

Doesn’t `block_on()` offer a convenient way to invoke an async function from a non-async context?

Doesn’t `spawn_blocking()` resolve the issue of awaiting blocking functions in async contexts?