The previous article showed a mutex-based implementation of the LazyTransform
value container. However, the exercise was explicit about implementing a “lock-free” container. What does that mean, exactly?
Limitations of locks
The interaction between LazyTransform
methods was carefully designed to prevent deadlock, but in a busy system one could expect significant contention for the source
mutex between get_transformed
and set_source
. This is not optimal; we would like set_source
to just store the new data without any waiting at all, and get_transformed
to keep returning the old cached value until a new one is observed.
Consider also the contention for the value
read-write lock. After get_transformed
performs the transformation, it must acquire the value
lock in write mode, which means it must wait until all the readers release it. What is even worse is that once the value
lock is finally acquired in write mode, other readers must wait for the writer to release it. Of course, the writer only holds the lock for as long as it takes to store new_value
, which is an extremely efficient operation, but all readers are still blocked, and if the writing thread gets suspended by the OS scheduler, all readers will remain blocked until the writer thread resumes. The readers, on the other hand, hold the lock for as long as it takes to clone the value, which is not bounded in any way. What is common to all these cases is that progress of the entire system hinges on the behavior or scheduling of one thread, and this is what lock-free algorithms are designed to avoid.
In a lock-free system, the program will make progress as long as any of its threads are allowed to run. Being lock-free is not just about refraining from using the OS lock primitives, but about carefully designing the system so it cannot lock up, be it by deadlock or livelock or unfortunate thread scheduling patterns. As Jeff Preshing puts it, a lock-free application won’t lock up even with hypothetical thread scheduling decisions being made by your worst enemy.
Another reason to want to avoid mutexes is their cost on object size. On x86_64 Linux, a pthread_mutex_t
, internally used by Rust’s Mutex
, takes 40 bytes of memory, and a RwLock
takes 56. To make Mutex
movable (a requirement for all Rust types), the actual OS lock object is dynamically allocated. The smallest possible Rust mutex, a Mutex<()>
, takes 16 bytes of memory for the object itself (8 for the pointer to the dynamically allocated OS mutex and 8 more for the poison flag). With each LazyTransform
object requiring ~200 bytes of memory and three dynamic allocations just to store the locks, we would not want to have millions of such objects lying around. If we don’t really need the waiting functionality offered by a full-featured Mutex
and RwLock
, we would do well to eliminate them and significantly reduce the footprint of LazyTransform
.
AtomicBool
For a start, let’s eliminate the transform_lock
mutex, an easy target as it holds no data and is only ever acquired with try_lock
. As used, this lock was already “lock-free” in the above sense, because owning it would never cause another thread to stall.
We could replace it with a bool
, but then try_lock
would be doomed to a race condition in the window between checking the previous lock state and changing it to “locked”. Not to mention that Rust would see this as a data race and the compiler would reject access to a bool
without protecting it with a mutex. Fortunately, a mutex is not the only option for avoiding data races. Modern processors support a number of specialized instructions to read and update values atomically, including one that fetches a machine value and atomically set the new one. In Rust, these instructions are exposed as methods on types in the std::sync::atomic
module. The stable channel provides AtomicBool
, AtomicUsize
, AtomicIsize
, and AtomicPtr<T>
. For transform_lock
, we can use AtomicBool
to store a flag indicating whether the lock is currently taken. The resulting try_lock
would look like this:
if !self.transform_lock.swap(true, Ordering::Acquire) {
// lock acquired
...
// release the lock
self.transform_lock.store(false, Ordering::Release);
} else {
// failed to acquire the lock
}
swap
atomically stores true
into the lock and returns the previous value – the operation Java AtomicBoolean
calls getAndSet
. If the previous value was false, it means it was us that changed the value from false to true and thus successfully acquired the lock. Conversely, if the previous value was true, it means that the lock had been already taken by someone else, and our store of true
was a no-op. The Ordering
argument tells the compiler which memory order is requested for the operation. The memory order corresponds to guarantees that place constraints on how the compiler and the CPU can reorder data reads and writes, both ordinary and atomic ones, around one particular atomic read/write; see the nomicon for a detailed explanation. For transform_lock
we use the acquire and release ordering, so named because of its usefulness in implementing locks. A store with the “release” ordering guarantees that all writes executed prior to the atomic store will be observable by any thread that observes the newly stored atomic value. It is used when releasing a lock by setting the “locked” flag to false with the “release” ordering, which signals to other threads that the lock is now available for the taking, but also ensures that the objects that have been updated while the lock was held are available for reading. Conversely, any code that observes the false value using the “acquire” ordering can be sure that its subsequent load of other locations will also observe the results of all writes executed before the “release” store.
Note that the AtomicBool::swap
method operates on the object by &self
shared reference. This is another example of the conceptual magic like the one employed by mutexes and described in the previous post. The difference is that with AtomicBool
no waiting takes place, so we are not trusted to ever get a mutable reference to the underlying bool
. Instead, we tell AtomicBool
how to change the object, and it executes the change through hardware mechanisms that are guaranteed to be exclusive, which allows it to declare the mutating methods to take &self
.
Here is an implementation using AtomicBool
for transform_lock
:
use std::sync::{Mutex, RwLock};
use std::sync::atomic::{AtomicBool, Ordering};
pub struct LazyTransform<T, S, FN> {
transform_fn: FN,
source: Mutex<Option<S>>,
value: RwLock<Option<T>>,
transform_lock: AtomicBool,
}
impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
LazyTransform {
transform_fn: transform_fn,
source: Mutex::new(None),
value: RwLock::new(None),
transform_lock: AtomicBool::new(false),
}
}
pub fn set_source(&self, source: S) {
let mut locked_source = self.source.lock().unwrap();
*locked_source = Some(source);
}
pub fn get_transformed(&self) -> Option<T> {
if !self.transform_lock.swap(true, Ordering::Acquire) {
let mut new_source = None;
if let Ok(mut locked_source) = self.source.try_lock() {
new_source = locked_source.take();
}
if let Some(new_source) = new_source {
let new_value = (self.transform_fn)(new_source);
if new_value.is_some() {
*self.value.write().unwrap() = new_value.clone();
self.transform_lock.store(false, Ordering::Release);
return new_value;
}
}
self.transform_lock.store(false, Ordering::Release);
}
self.value.read().unwrap().clone()
}
}
Lock abstraction
One disadvantage of directly using an AtomicBool
compared to a mutex is that we don’t get a fancy guard that automatically unlocks the lock when going out of scope. Instead, we must remember to unlock it ourselves, which can be difficult to guarantee during panic. A robust implementation would wrap the AtomicBool
into a lock object that returns a guard that implements Drop
:
struct LightLock(AtomicBool);
impl LightLock {
pub fn new() -> LightLock {
LightLock(AtomicBool::new(false))
}
pub fn try_lock<'a>(&'a self) -> Option<LightGuard<'a>> {
let was_locked = self.0.swap(true, Ordering::Acquire);
if was_locked {
None
} else {
Some(LightGuard { lock: self })
}
}
}
struct LightGuard<'a> {
lock: &'a LightLock,
}
impl<'a> Drop for LightGuard<'a> {
fn drop(&mut self) {
self.lock.0.store(false, Ordering::Release);
}
}
In accordance with Rust conventions, the method that attempts the lock is named try_lock
. Our version returns an Option
that tells the caller whether taking the lock succeeded and, if so, contains the guard object to automatically unlock it on exit. Rust’s Mutex::try_lock
has a more sophisticated return value that allows for other kinds of errors, but the logic of using it is essentially the same.
Using this type, transform_lock
gets declared as LightLock
and initialized as LightLock::new()
. get_transformed
automatically becomes panic-safe and also cleaner and more maintainable because it no longer needs to remember to release the lock before every exit point:
pub fn get_transformed(&self) -> Option<T> {
if let Some(_lock_guard) = self.transform_lock.try_lock() {
let mut new_source = None;
if let Ok(mut locked_source) = self.source.try_lock() {
new_source = locked_source.take();
}
if let Some(new_source) = new_source {
let new_value = (self.transform_fn)(new_source);
if new_value.is_some() {
*self.value.write().unwrap() = new_value.clone();
return new_value;
}
}
}
self.value.read().unwrap().clone()
}
Getting rid of one mutex wasn’t that hard. Can the same logic apply to other locks? transform_lock
was distinctive in that it never guarded concrete data, it was only used to serialize access to the transformation of source
into value
. This ensured that source
and value
locks, which are actually exclusive in nature, were held as briefly as possible. Once the lock was obtained, the actual protection of source
and value
from access by different threads was left to their respective locks. To safely access stored values without locks, we will need some way to access the values atomically like we did with the bool
stored by the AtomicBool
.
Atomic arbitrary values
Ideally we would have at our disposal a generic atomic type, let’s call it AtomicCell<X>
, that provides a swap
method to atomically modify the shared value and retrieve the old, and load
to only read the existing one. With source
and value
fields declared as AtomicCell<Option<T>>
and AtomicCell<Option<S>>
respectively, LazyTransform
could look like this:
impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
pub fn set_source(&self, source: S) {
self.source.swap(source);
}
fn try_transform(&self) {
if let Some(_lock_guard) = self.transform_lock.try_lock() {
let source_maybe = self.source.swap(None);
let source = match source_maybe {
Some(source) => source,
None => return,
};
let newval = match (self.transform_fn)(source) {
Some(newval) => newval,
None => return,
};
self.value.swap(newval);
}
}
pub fn get_transformed(&self) -> Option<T> {
let source = self.source.load();
if source.is_some() {
self.try_transform();
}
self.value.load().as_ref().map(T::clone)
}
}
get_transformed
is now split into two methods: get_transformed
, which now only returns the cached value, and a non-public try_transform
, which actually attempts the transformation. set_source
atomically swaps in the new source, and drops the old one (by virtue of ignoring the return value of swap
). try_transform
does the exact same thing with the source, and also with value
when it obtains a new one. When the old value needs to be read, load()
is called to access a reference to the underlying value, which is cloned.
The problem with the above code is that AtomicCell
doesn’t exist. Rust’s standard std::sync::atomic
module is limited to atomic manipulation of pointer-sized values which correspond to the operands of the CPU’s atomic instructions. Thus the most promising building block for implementing an AtomicCell
is the AtomicPtr
type, which provides atomic access to a pointer. An AtomicCell<T>
would be represented by an AtomicPtr<T>
that references T
stored dynamically. The dynamic allocation is unfortunate, but unavoidable if we ever want to support T
larger than a pointer. Using a pointer, AtomicCell::swap
could look like this:
struct AtomicCell<T> {
ptr: AtomicPtr<T>,
}
impl<T> AtomicCell<T> {
fn new(initial: T) -> AtomicCell<T> {
AtomicCell {
ptr: AtomicPtr::new(Box::into_raw(Box::new(initial)))
}
}
fn swap(&self, new: T) -> T {
let new_ptr = Box::into_raw(Box::new(new));
let old_ptr = self.ptr.swap(new_ptr, Ordering::AcqRel);
unsafe {
*Box::from_raw(old_ptr)
}
}
}
AtomicCell::new
simply creates a cell whose ptr
points to a valid heap-allocated object, which is AtomicCell
‘s invariant. The interesting part happens in swap
.
In swap
the new object, owned by the local variable received from the caller, is moved to the heap and swapped into AtomicPtr
as a raw pointer. The old object is retrieved from the same call to AtomicPtr::swap
, also as a raw pointer, immediately converted back to a Box
whose contents are moved from it, and returned by value. The ownership of each object is clear at every point. Values stored in the AtomicPtr
are (conceptually) owned by the AtomicPtr
. When writing a value to the pointer, the ownership over the new value is relinquished to the AtomicPtr
using Box::into_raw
, and at the same time the ownership of the old value is reasserted with Box::from_raw
.
This last part is also the first time we’ve had to use unsafe
in the implementation. Box::into_raw
is not unsafe because it only gives back a raw pointer, trusting the caller to arrange for the object to be eventually destroyed. Even if the caller fails to uphold their end of the bargain, the worst thing that can happen is a leak, already possible with std::mem::forget
. Box::from_raw
is unsafe because it reinterprets a raw pointer as a box without any proof that the raw pointer is unreachable from elsewhere or even valid. We, however, know both things to be true. The pointer must be valid because we are only ever passing it pointers created by Box::into_raw
, as only such pointers are stored in the AtomicPtr
. We also know that the object is not reachable from another thread because we have just obtained it from swap
, meaning it has already been disowned by the previous owner. Thus our swap
is sound and the unsafe
block is a mere implementation detail. Taking ownership is not only safe, but necessary if we want to prevent the old value from leaking.
swap
was easy. Now what about AtomicCell::load
, which we also need to implement lock-free reading? Based on AtomicPtr::load
, a load
could be imagined like this:
// doesn't work - see below
fn load(&self) -> T {
let old_ptr = self.ptr.load(Ordering::Acquire);
unsafe {
*Box::from_raw(old_ptr)
}
}
This code is written in analogy to swap
, only without modifying the existing. Unfortunately it is also unsound, which means that it breaks Rust’s ownership rules and causes a crash. The problem is that load
takes ownership of the object behind the atomic pointer without also changing the pointer to point to something else. This leads to double free as soon as load
is called again to observe the same pointer, or in any subsequent call to swap
, which will attempt to take ownership of a freed value. And there is no easy way to fix it, either. For example, suppose we defined load()
to return Option<&T>
, thus avoiding taking ownership of the observed object. This would prevent double free, but it wouldn’t work on the type level because there is no way to express the lifetime of the returned reference. While multiple load
s would no longer be an issue, a swap()
could still invalidate the value while a reference is live. Immediately cloning the observed object and returning a new value is also unsound because the reference can be invalidated by a swap
before T::clone
finishes. The atomic_cell
crate resolves the issue using a lock around clone()
. Locking is performed with a spinlock, whose lock
looks like LightLock::try_lock
above, except it retries taking the lock in a busy loop until it succeeds. Despite its efficiency when the lock is uncontended, a spinlock is still a lock and it is not acceptable in the middle of a lock-free algorithm.
Ideally load()
would return a guard that provided a reference to the underlying value, much like the guard returned by Mutex::lock
. The existence of the guard will prevent writers like swap()
from destroying the object. Instead, swap
will store the new raw pointer into AtomicPtr
and mark the old one for deletion, deferring its actual destruction until all the load guards have been dropped. It is far from obvious how to implement this as efficiently as possible, and of course without heavy synchronization, lest we lose the very benefits lock-free code was supposed to provide.
Lock-free access to arbitrary values with correct memory management is the topic of the next article.
First of all, famtastic article! Very clear writing on lock free programming–thak you!
An observation, re: AtomicCell implementation:
“The pointer must be valid because we are only ever passing it pointers created by Box::into_raw… Thus our swap is sound…”
The premise of the pointer being valid is not actually true–on first run, the value being swapped out did not come via Box::into_raw–it’s the initialized value of AtomicCell.ptr, which I presume to be 0, right?
My question is what does it mean to be returning the dreaded null pointer from the first run of AtomicCell.swap()? Isn’t that UB?
The text assumed that
ptr
is initialized with a valid object, e.g. by its constructor accepting aninitial
argument. This is consistent with other Rust “cell” types, which also work with a valid (and owned) object, leaving it to the user to specifyOption<T>
if they want nullability.I have now made the assumption explicit by providing a
new
implementation that accepts aninitial: T
to maintain the non-null-ptr invariant.Thanks for the great post! I’ve just recently discovered how useful lock-free concurrency can be, and this is quite helpful.
A couple of thoughts on the implementation:
Your
AtomicCell
leaks memory because of the heap-allocated pointer; it should have aDrop
implementation that converts it back into aBox
so it will be de-allocated properly.Converting the pointer back into a
Box
, de-allocating, and re-allocating every time the value is swapped seems like a waste. I’d personally convert it back into aBox
, replace the value withmem::replace
, and then leak the box again. If you’re confident that the behavior is correct,ptr::replace
could probably do all of that with a single statement, but I’d prefer using the actual types involved and let the compiler optimize it.