Exploring lock-free Rust 2: Atomics

The previous article showed a mutex-based implementation of the LazyTransform value container. However, the exercise was explicit about implementing a “lock-free” container. What does that mean, exactly?

Limitations of locks

The interaction between LazyTransform methods was carefully designed to prevent deadlock, but in a busy system one could expect significant contention for the source mutex between get_transformed and set_source. This is not optimal; we would like set_source to just store the new data without any waiting at all, and get_transformed to keep returning the old cached value until a new one is observed.

Consider also the contention for the value read-write lock. After get_transformed performs the transformation, it must acquire the value lock in write mode, which means it must wait until all the readers release it. What is even worse is that once the value lock is finally acquired in write mode, other readers must wait for the writer to release it. Of course, the writer only holds the lock for as long as it takes to store new_value, which is an extremely efficient operation, but all readers are still blocked, and if the writing thread gets suspended by the OS scheduler, all readers will remain blocked until the writer thread resumes. The readers, on the other hand, hold the lock for as long as it takes to clone the value, which is not bounded in any way. What is common to all these cases is that progress of the entire system hinges on the behavior or scheduling of one thread, and this is what lock-free algorithms are designed to avoid.

In a lock-free system, the program will make progress as long as any of its threads are allowed to run. Being lock-free is not just about refraining from using the OS lock primitives, but about carefully designing the system so it cannot lock up, be it by deadlock or livelock or unfortunate thread scheduling patterns. As Jeff Preshing puts it, a lock-free application won’t lock up even with hypothetical thread scheduling decisions being made by your worst enemy.

Another reason to want to avoid mutexes is their cost on object size. On x86_64 Linux, a pthread_mutex_t, internally used by Rust’s Mutex, takes 40 bytes of memory, and a RwLock takes 56. To make Mutex movable (a requirement for all Rust types), the actual OS lock object is dynamically allocated. The smallest possible Rust mutex, a Mutex<()>, takes 16 bytes of memory for the object itself (8 for the pointer to the dynamically allocated OS mutex and 8 more for the poison flag). With each LazyTransform object requiring ~200 bytes of memory and three dynamic allocations just to store the locks, we would not want to have millions of such objects lying around. If we don’t really need the waiting functionality offered by a full-featured Mutex and RwLock, we would do well to eliminate them and significantly reduce the footprint of LazyTransform.

AtomicBool

For a start, let’s eliminate the transform_lock mutex, an easy target as it holds no data and is only ever acquired with try_lock. As used, this lock was already “lock-free” in the above sense, because owning it would never cause another thread to stall.

We could replace it with a bool, but then try_lock would be doomed to a race condition in the window between checking the previous lock state and changing it to “locked”. Not to mention that Rust would see this as a data race and the compiler would reject access to a bool without protecting it with a mutex. Fortunately, a mutex is not the only option for avoiding data races. Modern processors support a number of specialized instructions to read and update values atomically, including one that fetches a machine value and atomically set the new one. In Rust, these instructions are exposed as methods on types in the std::sync::atomic module. The stable channel provides AtomicBool, AtomicUsize, AtomicIsize, and AtomicPtr<T>. For transform_lock, we can use AtomicBool to store a flag indicating whether the lock is currently taken. The resulting try_lock would look like this:

if !self.transform_lock.swap(true, Ordering::Acquire) {
    // lock acquired
    ...
    // release the lock
    self.transform_lock.store(false, Ordering::Release);
} else {
    // failed to acquire the lock
}

swap atomically stores true into the lock and returns the previous value – the operation Java AtomicBoolean calls getAndSet. If the previous value was false, it means it was us that changed the value from false to true and thus successfully acquired the lock. Conversely, if the previous value was true, it means that the lock had been already taken by someone else, and our store of true was a no-op. The Ordering argument tells the compiler which memory order is requested for the operation. The memory order corresponds to guarantees that place constraints on how the compiler and the CPU can reorder data reads and writes, both ordinary and atomic ones, around one particular atomic read/write; see the nomicon for a detailed explanation. For transform_lock we use the acquire and release ordering, so named because of its usefulness in implementing locks. A store with the “release” ordering guarantees that all writes executed prior to the atomic store will be observable by any thread that observes the newly stored atomic value. It is used when releasing a lock by setting the “locked” flag to false with the “release” ordering, which signals to other threads that the lock is now available for the taking, but also ensures that the objects that have been updated while the lock was held are available for reading. Conversely, any code that observes the false value using the “acquire” ordering can be sure that its subsequent load of other locations will also observe the results of all writes executed before the “release” store.

Note that the AtomicBool::swap method operates on the object by &self shared reference. This is another example of the conceptual magic like the one employed by mutexes and described in the previous post. The difference is that with AtomicBool no waiting takes place, so we are not trusted to ever get a mutable reference to the underlying bool. Instead, we tell AtomicBool how to change the object, and it executes the change through hardware mechanisms that are guaranteed to be exclusive, which allows it to declare the mutating methods to take &self.

Here is an implementation using AtomicBool for transform_lock:

use std::sync::{Mutex, RwLock};
use std::sync::atomic::{AtomicBool, Ordering};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Mutex<Option<S>>,
    value: RwLock<Option<T>>,
    transform_lock: AtomicBool,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Mutex::new(None),
            value: RwLock::new(None),
            transform_lock: AtomicBool::new(false),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut locked_source = self.source.lock().unwrap();
        *locked_source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        if !self.transform_lock.swap(true, Ordering::Acquire) {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    self.transform_lock.store(false, Ordering::Release);
                    return new_value;
                }
            }
            self.transform_lock.store(false, Ordering::Release);
        }
        self.value.read().unwrap().clone()
    }
}

Lock abstraction

One disadvantage of directly using an AtomicBool compared to a mutex is that we don’t get a fancy guard that automatically unlocks the lock when going out of scope. Instead, we must remember to unlock it ourselves, which can be difficult to guarantee during panic. A robust implementation would wrap the AtomicBool into a lock object that returns a guard that implements Drop:

struct LightLock(AtomicBool);

impl LightLock {
    pub fn new() -> LightLock {
        LightLock(AtomicBool::new(false))
    }

    pub fn try_lock<'a>(&'a self) -> Option<LightGuard<'a>> {
        let was_locked = self.0.swap(true, Ordering::Acquire);
        if was_locked {
            None
        } else {
            Some(LightGuard { lock: self })
        }
    }
}

struct LightGuard<'a> {
    lock: &'a LightLock,
}

impl<'a> Drop for LightGuard<'a> {
    fn drop(&mut self) {
        self.lock.0.store(false, Ordering::Release);
    }
}

In accordance with Rust conventions, the method that attempts the lock is named try_lock. Our version returns an Option that tells the caller whether taking the lock succeeded and, if so, contains the guard object to automatically unlock it on exit. Rust’s Mutex::try_lock has a more sophisticated return value that allows for other kinds of errors, but the logic of using it is essentially the same.

Using this type, transform_lock gets declared as LightLock and initialized as LightLock::new(). get_transformed automatically becomes panic-safe and also cleaner and more maintainable because it no longer needs to remember to release the lock before every exit point:

    pub fn get_transformed(&self) -> Option<T> {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    return new_value;
                }
            }
        }
        self.value.read().unwrap().clone()
    }

Getting rid of one mutex wasn’t that hard. Can the same logic apply to other locks? transform_lock was distinctive in that it never guarded concrete data, it was only used to serialize access to the transformation of source into value. This ensured that source and value locks, which are actually exclusive in nature, were held as briefly as possible. Once the lock was obtained, the actual protection of source and value from access by different threads was left to their respective locks. To safely access stored values without locks, we will need some way to access the values atomically like we did with the bool stored by the AtomicBool.

Atomic arbitrary values

Ideally we would have at our disposal a generic atomic type, let’s call it AtomicCell<X>, that provides a swap method to atomically modify the shared value and retrieve the old, and load to only read the existing one. With source and value fields declared as AtomicCell<Option<T>> and AtomicCell<Option<S>> respectively, LazyTransform could look like this:

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn set_source(&self, source: S) {
        self.source.swap(source);
    }

    fn try_transform(&self) {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let source_maybe = self.source.swap(None);
            let source = match source_maybe {
                Some(source) => source,
                None => return,
            };
            let newval = match (self.transform_fn)(source) {
                Some(newval) => newval,
                None => return,
            };
            self.value.swap(newval);
        }
    }

    pub fn get_transformed(&self) -> Option<T> {
        let source = self.source.load();
        if source.is_some() {
            self.try_transform();
        }
        self.value.load().as_ref().map(T::clone)
    }
}

get_transformed is now split into two methods: get_transformed, which now only returns the cached value, and a non-public try_transform, which actually attempts the transformation. set_source atomically swaps in the new source, and drops the old one (by virtue of ignoring the return value of swap). try_transform does the exact same thing with the source, and also with value when it obtains a new one. When the old value needs to be read, load() is called to access a reference to the underlying value, which is cloned.

The problem with the above code is that AtomicCell doesn’t exist. Rust’s standard std::sync::atomic module is limited to atomic manipulation of pointer-sized values which correspond to the operands of the CPU’s atomic instructions. Thus the most promising building block for implementing an AtomicCell is the AtomicPtr type, which provides atomic access to a pointer. An AtomicCell<T> would be represented by an AtomicPtr<T> that references T stored dynamically. The dynamic allocation is unfortunate, but unavoidable if we ever want to support T larger than a pointer. Using a pointer, AtomicCell::swap could look like this:

struct AtomicCell<T> {
    ptr: AtomicPtr<T>,
}

impl<T> AtomicCell<T> {
    fn new(initial: T) -> AtomicCell<T> {
        AtomicCell {
            ptr: AtomicPtr::new(Box::into_raw(Box::new(initial)))
        }
    }

    fn swap(&self, new: T) -> T {
        let new_ptr = Box::into_raw(Box::new(new));
        let old_ptr = self.ptr.swap(new_ptr, Ordering::AcqRel);
        unsafe {
            *Box::from_raw(old_ptr)
        }
    }
}

AtomicCell::new simply creates a cell whose ptr points to a valid heap-allocated object, which is AtomicCell‘s invariant. The interesting part happens in swap.

In swap the new object, owned by the local variable received from the caller, is moved to the heap and swapped into AtomicPtr as a raw pointer. The old object is retrieved from the same call to AtomicPtr::swap, also as a raw pointer, immediately converted back to a Box whose contents are moved from it, and returned by value. The ownership of each object is clear at every point. Values stored in the AtomicPtr are (conceptually) owned by the AtomicPtr. When writing a value to the pointer, the ownership over the new value is relinquished to the AtomicPtr using Box::into_raw, and at the same time the ownership of the old value is reasserted with Box::from_raw.

This last part is also the first time we’ve had to use unsafe in the implementation. Box::into_raw is not unsafe because it only gives back a raw pointer, trusting the caller to arrange for the object to be eventually destroyed. Even if the caller fails to uphold their end of the bargain, the worst thing that can happen is a leak, already possible with std::mem::forget. Box::from_raw is unsafe because it reinterprets a raw pointer as a box without any proof that the raw pointer is unreachable from elsewhere or even valid. We, however, know both things to be true. The pointer must be valid because we are only ever passing it pointers created by Box::into_raw, as only such pointers are stored in the AtomicPtr. We also know that the object is not reachable from another thread because we have just obtained it from swap, meaning it has already been disowned by the previous owner. Thus our swap is sound and the unsafe block is a mere implementation detail. Taking ownership is not only safe, but necessary if we want to prevent the old value from leaking.

swap was easy. Now what about AtomicCell::load, which we also need to implement lock-free reading? Based on AtomicPtr::load, a load could be imagined like this:

// doesn't work - see below
fn load(&self) -> T {
    let old_ptr = self.ptr.load(Ordering::Acquire);
    unsafe {
        *Box::from_raw(old_ptr)
    }
}

This code is written in analogy to swap, only without modifying the existing. Unfortunately it is also unsound, which means that it breaks Rust’s ownership rules and causes a crash. The problem is that load takes ownership of the object behind the atomic pointer without also changing the pointer to point to something else. This leads to double free as soon as load is called again to observe the same pointer, or in any subsequent call to swap, which will attempt to take ownership of a freed value. And there is no easy way to fix it, either. For example, suppose we defined load() to return Option<&T>, thus avoiding taking ownership of the observed object. This would prevent double free, but it wouldn’t work on the type level because there is no way to express the lifetime of the returned reference. While multiple loads would no longer be an issue, a swap() could still invalidate the value while a reference is live. Immediately cloning the observed object and returning a new value is also unsound because the reference can be invalidated by a swap before T::clone finishes. The atomic_cell crate resolves the issue using a lock around clone(). Locking is performed with a spinlock, whose lock looks like LightLock::try_lock above, except it retries taking the lock in a busy loop until it succeeds. Despite its efficiency when the lock is uncontended, a spinlock is still a lock and it is not acceptable in the middle of a lock-free algorithm.

Ideally load() would return a guard that provided a reference to the underlying value, much like the guard returned by Mutex::lock. The existence of the guard will prevent writers like swap() from destroying the object. Instead, swap will store the new raw pointer into AtomicPtr and mark the old one for deletion, deferring its actual destruction until all the load guards have been dropped. It is far from obvious how to implement this as efficiently as possible, and of course without heavy synchronization, lest we lose the very benefits lock-free code was supposed to provide.

Lock-free access to arbitrary values with correct memory management is the topic of the next article.

4 thoughts on “Exploring lock-free Rust 2: Atomics”

  1. First of all, famtastic article! Very clear writing on lock free programming–thak you!

    An observation, re: AtomicCell implementation:
    “The pointer must be valid because we are only ever passing it pointers created by Box::into_raw… Thus our swap is sound…”

    The premise of the pointer being valid is not actually true–on first run, the value being swapped out did not come via Box::into_raw–it’s the initialized value of AtomicCell.ptr, which I presume to be 0, right?

    My question is what does it mean to be returning the dreaded null pointer from the first run of AtomicCell.swap()? Isn’t that UB?

    1. The premise of the pointer being valid is not actually true–on first run, the value being swapped out did not come via Box::into_raw–it’s the initialized value of AtomicCell.ptr, which I presume to be 0, right?

      The text assumed that ptr is initialized with a valid object, e.g. by its constructor accepting an initial argument. This is consistent with other Rust “cell” types, which also work with a valid (and owned) object, leaving it to the user to specify Option<T> if they want nullability.

      I have now made the assumption explicit by providing a new implementation that accepts an initial: T to maintain the non-null-ptr invariant.

  2. Thanks for the great post! I’ve just recently discovered how useful lock-free concurrency can be, and this is quite helpful.

    A couple of thoughts on the implementation:

    Your AtomicCell leaks memory because of the heap-allocated pointer; it should have a Drop implementation that converts it back into a Box so it will be de-allocated properly.
    Converting the pointer back into a Box, de-allocating, and re-allocating every time the value is swapped seems like a waste. I’d personally convert it back into a Box, replace the value with mem::replace, and then leak the box again. If you’re confident that the behavior is correct, ptr::replace could probably do all of that with a single statement, but I’d prefer using the actual types involved and let the compiler optimize it.

Leave a Reply