In the previous installment we showed that while atomic types provided by the Rust standard library can be used for lock-free access to shared values, memory reclamation must be ensured manually because Rust’s normal scoping rules do not cleanly map to lock-free concurrency.
Crossbeam
The problem of memory reclamation in lock-free data structures is not unique to Rust, it is shared by other languages without garbage collection, most notably C++. Different solutions have been proposed, sporting exotic names such as quiescent state based reclamation, epoch-based reclamation, and hazard pointers. See Tom Hart’s thesis for an extensive description of the memory reclamation strategies and analysis of their benefits and drawbacks.
In Rust the currently favored strategy is the epoch-based memory reclamation, a clever scheme that keeps track of objects marked for destruction in three thread-local bins. Each bin corresponds to an “epoch”, somewhat similar to a GC generation. When a thread becomes active, i.e. it is expected to start executing lock-free code, it locally stores the current epoch number (0-2) and uses it until deactivated. During this period, objects slated for destruction will be registered in the corresponding bin. A thread performing GC will first check whether all currently active threads belong to the current epoch. If that is the case, it means that there are no threads remaining from the previous epoch, and the epoch number is atomically bumped (incremented modulo 3). The thread that succeeds in incrementing the epoch proceeds to destroy objects from the bin of two epochs ago. For example, the thread that increments epoch from 1 to 2 can at that point safely destroy objects in bin 0. Objects in bin 1 cannot yet be destroyed because the epoch was just switched from 1 to 2, and there can still be active threads from epoch 1. But no new epoch 1 threads are being activated, and as soon as existing ones deactivate, all active threads will have been from epoch 2. At this point it will be safe to bump the epoch to 0 and drop objects from bin 1.
The nice thing about epoch-based memory reclamation is that it is a good fit for libraries, since it can be fully embedded inside the code that, say, implements a lock-free queue, without the rest of the application having to know anything about it. Rust’s implementation of epoch-based memory reclamation is provided by the Crossbeam crate. Aaron Turon’s original blog post is an excellent read on the topic, describing both Crossbeam and epoch-based memory reclamation in some detail using the classic Treiber’s stack as an example.
Here is a Crossbeam based implementation of LazyTransform
:
extern crate crossbeam;
use std::sync::atomic::{AtomicBool, Ordering};
use crossbeam::epoch::{self, Atomic, Owned, Guard};
pub struct LazyTransform<T, S, FN> {
transform_fn: FN,
source: Atomic<S>,
value: Atomic<T>,
transform_lock: LightLock,
}
impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
LazyTransform {
transform_fn: transform_fn,
source: Atomic::null(),
value: Atomic::null(),
transform_lock: LightLock::new(),
}
}
pub fn set_source(&self, source: S) {
let guard = epoch::pin();
let prev = self.source.swap(Some(Owned::new(source)),
Ordering::AcqRel, &guard);
if let Some(prev) = prev {
unsafe { guard.unlinked(prev); }
}
}
fn try_transform(&self, guard: &Guard) -> Option<T> {
if let Some(_lock_guard) = self.transform_lock.try_lock() {
let source_maybe = self.source.swap(None, Ordering::AcqRel, &guard);
let source = match source_maybe {
Some(source) => source,
None => return None,
};
let source_data = unsafe { ::std::ptr::read(source.as_raw()) };
let newval = match (self.transform_fn)(source_data) {
Some(newval) => newval,
None => return None,
};
let prev = self.value.swap(Some(Owned::new(newval.clone())),
Ordering::AcqRel, &guard);
unsafe {
if let Some(prev) = prev {
guard.unlinked(prev);
}
guard.unlinked(source);
}
return Some(newval);
}
None
}
pub fn get_transformed(&self) -> Option<T> {
let guard = epoch::pin();
let source = self.source.load(Ordering::Relaxed, &guard);
if source.is_some() {
let newval = self.try_transform(&guard);
if newval.is_some() {
return newval;
}
}
self.value.load(Ordering::Acquire, &guard)
.as_ref().map(|x| T::clone(&x))
}
}
This version is very similar to the version from last article based on the imaginary AtomicCell
, except it adapts to the requirements of Crossbeam. Let’s first cover the basics:
source
andvalue
areAtomic
, Crossbeam’s equivalent ofAtomicCell
.Option
is not needed because Crossbeam always allows options by representingNone
values asnull
.- Before calling
Atomic::swap
andAtomic::load
, the thread needs to be “pinned”, i.e. marked as active within the current epoch. The guard returned byepoch::pin
serves as proof that the thread has been pinned, and automatically marks it as inactive when destroyed. A reference to this guard can be sent to helper methods such astry_transform
. - Crossbeam’s
Atomic::swap
acceptsOwned
, an object similar toBox
that guarantees that the value we’re storing is heap-allocated and owned by the caller (who just transferred that ownership toswap
). This is similar toAtomicCell::swap
from last post, except Crossbeam’s design allows reusing a previously extracted box. - Methods working with
Atomic
accept anOrdering
argument, with the same meaning as in Rust’s atomic types. The initial test ofsource
requests the least strictRelaxed
ordering, which is safe because thesource
, if non-null, will be re-fetched using a stricter ordering once again intry_transform
.
The key feature introduced by Crossbeam lies in the mostly-automatic memory management implemented on top of epoch reclamation. Atomic::swap
returns a Shared
guard which encapsulates the pointer obtained from AtomicPtr::swap
and provides safe access to the underlying object, concurrently observable by other threads. The lifetime bound on the returned Shared
ensures that it cannot outlive the guard returned by epoch::pin()
, preventing the object from being collected while reachable through Shared
. Once we are done with the object, we must manually mark it for collection. This is an unsafe operation and something Crossbeam cannot attempt automatically because it cannnot prove that the retrieved pointer is not still used elsewhere in the data model, for example in a linked list chaining to the pointer. We know no such reference exists, so it’s safe to deallocate the object. Atomic::load
is used exactly the same way, only without the final deallocation.
try_transform
extracts the source value published by set_source
by calling std::ptr::read
, a function that moves the object from an arbitrary location and returns it by value. After the call to std::ptr::read
, the memory where the object resided is treated as uninitialized, and it is left to Crossbeam to free it at a later epoch switch. std::ptr::read
is marked unsafe because Rust cannot trace the pointer to prove that we own the object on that location. But since we pass it the location freshly swapped out that set_source
won’t ever read, we know calling std::ptr::read
is safe. An unsafe
block hiding unsafe implementation inside a completely safe public API forms the essence of unsafe Rust. A safe function is not only one that uses no unsafe code, but also one that can be called with any kind of argument without incurring undefined behavior.
This version of LazyTransform
satisfies the requirements of the exercise. It is not only lock-free, but also wait-free because it avoids compare-and-swap retry loops. The size of the LazyTransform
object equals the size of two pointers and one bool
, and the pointers only ever allocate the amount memory needed to store S
and T
respectively. Given the requirements, that is as memory-efficient as it gets.
Coco
Having written and tested the above code, I had expected it to be the final version of the code. However, running some additional test code had a strange effect on my memory monitor – the program was leaking memory, and in large quantities! I had both expected and in prior runs observed the memory to fluctuate due to epoch-based memory reclamation, but this was different. What I observed here was memory consumption monotonically growing for as long as the program was running. Also, the leak could only be reproduced when using a value type that allocates, such as a String
. It looked like Crossbeam was simply not dropping the unreachable objects.
Carefully looking at the code, it is obviously inconsistent in its memory management of shared values. set_source
simply forgets about the previous value, presumably expecting guard.unlinked(prev)
to dispose of it. But try_transform()
uses std::ptr::read()
to move source data out of the Crossbeam-managed Owned
container, and also calling guard.unlinked
afterwards. They cannot both be correct: either guard.unlinked
doesn’t drop the underlying object and guard.unlinked(prev)
in set_source
leaks memory, or it does drop and guard.unlinked(source)
in try_transform
results in a double free because the underlying source_data
was moved to transform_fn
and dropped there.
I posted a StackOverflow question and, again to my surprise, it turned out that not running destructors was a known limitation of the current Crossbeam. The description of Crossbeam does state that “the epoch reclamation scheme does not run destructors [emphasis in the original], but merely deallocates memory.” This means that Crossbeam’s guard.unlink(prev_value)
deletes the dynamically allocated storage for T
internally created by Atomic<T>
, but doesn’t drop the underlying T
instance. That works for the lock-free collections supported by current Crossbeam, which automatically remove items “observed” by the collection user (no peeking is allowed) and take ownership of the object inside, similar to our AtomicCell::swap
. Support for such semantics fits the needs of a queue or stack, but not e.g. a lock-free map, or even of a simple container such as LazyTransform
.
Maintainers of Crossbeam are aware of the issue and are working on a new version which will include many improvements, such as the support for full dropping of objects and an improved and tunable garbage collection. A preview of the new Crossbeam design is already available in the form of the Concurrent collections (Coco) crate, whose epoch-based reclamation implements the object dropping we need, and also optimizes epoch::pin
.
Switching to Coco finally resolves the memory leak and leads to the following LazyTransform
implementation:
extern crate coco;
use std::sync::atomic::{AtomicBool, Ordering};
use coco::epoch::{self, Atomic, Owned, Ptr, Scope};
pub struct LazyTransform<T, S, FN> {
transform_fn: FN,
source: Atomic<S>,
value: Atomic<T>,
transform_lock: LightLock,
}
impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
LazyTransform {
transform_fn: transform_fn,
source: Atomic::null(),
value: Atomic::null(),
transform_lock: LightLock::new(),
}
}
pub fn set_source(&self, source: S) {
epoch::pin(|scope| unsafe {
let source_ptr = Owned::new(source).into_ptr(&scope);
let prev = self.source.swap(source_ptr, Ordering::AcqRel, &scope);
if !prev.is_null() {
scope.defer_drop(prev);
}
});
}
fn try_transform(&self, scope: &Scope) -> Option<T> {
if let Some(_lock_guard) = self.transform_lock.try_lock() {
let source = self.source.swap(Ptr::null(), Ordering::AcqRel, &scope);
if source.is_null() {
return None;
}
let source_data;
unsafe {
source_data = ::std::ptr::read(source.as_raw());
scope.defer_free(source);
}
let newval = match (self.transform_fn)(source_data) {
Some(newval) => newval,
None => return None,
};
let prev = self.value.swap(Owned::new(newval.clone()).into_ptr(&scope),
Ordering::AcqRel, &scope);
unsafe {
if !prev.is_null() {
scope.defer_drop(prev);
}
}
return Some(newval);
}
None
}
pub fn get_transformed(&self) -> Option<T> {
epoch::pin(|scope| {
let source = self.source.load(Ordering::Relaxed, &scope);
if !source.is_null() {
let newval = self.try_transform(&scope);
if newval.is_some() {
return newval;
}
}
unsafe {
self.value.load(Ordering::Acquire, &scope)
.as_ref().map(T::clone)
}
})
}
}
Compared to Crossbeam, the differences are minor, and mostly to Coco’s advantage.
Where appropriate, defer_drop
is used to drop the object in addition to the memory that it occupied. This eliminates the leak. The inconsistency regarding ptr::read
is no longer present – when ptr::read
is used to move the object out of the Coco-managed memory, defer_free
is used in place of defer_drop
.
epoch::pin
no longer returns a guard, it now accepts a closure that will be run with the thread pinned to the current epoch (“active”). This makes no practical difference in our example, but might reduce readability of Crossbeam code that embedded flow control constructs such as return
or break
inside a pinned block.
Finally, accessing the value through a shared reference now requires an unsafe
block. This is unfortunate, as pinning was explicitly designed to guarantee safety of such access. The problem was that such access was really safe only when memory orderings were correctly specified. As this was impossible to enforce statically, unsafe
was introduced to eliminate a serious soundness issue in current Crossbeam.
Performance
After taking the trouble to write the code, it makes sense to measure it and see how much of a performance benefit Rust brings to the table. The Java version is admittedly much shorter (although not necessarily easier to devise) because it can rely on a volatile
variable to achieve atomic access to an object. Likewise, memory reclamation is a non-issue because it is transparently handled by the GC. But surely this comes at a cost? Even with the advantage GC brings to lock-free code, Rust is a statically typed ahead-of-time compiled language specifically targeted for systems and high-performance programming.
The benchmark simulates a busy single producer thread that occasionally publishes a randomly generated value, and then remains busy spending CPU for several microseconds. At the same time, 8 consumer threads are continuously reading the transformed and (most of the time) cached value, trivially inspecting it in order to prevent a very clever compiler from optimizing away the whole loop. The whole benchmark is run three times, allowing the JVM to warm up the JIT, and also to make it easier to spot anomalies between runs.
To run the benchmark:
- Download the source.
- For the Rust version, build it with
cargo build --release
and runtarget/release/bench
. - For the Java version,
cd
tosrc/java
, byte-compile it withjavac *.java
and run it withjava Benchmark
.
Results
On my 2012 desktop workstation with 3GHz Xeon W3550, the Java benchmark reports an average of 7.3 ns per getTransformed
invocation. The Rust benchmark reports 128 ns in get_transformed
, a whopping 17 times slower execution. These timings are in stark contrast with the original Crossbeam article which documents the lock-free queue implemented in Crossbeam as not only competitive with, but consistently faster than java.util.concurrent.ConcurrentLinkedQueue
. What could explain such a performance difference in this case?
Let’s consider the most common “happy case” for get_transformed
, when it simply returns the cached value. The Java version performs the following:
- an atomic load of
source
with the sequentially consistent ordering (the docs defineget
as having “the memory effects of reading avolatile
variable”, which is sequentially consistent in Java.) - if non-null, as it will be in the happy case, an atomic load of
transformed
.
So we have two atomic loads, a check against null
, and looping overhead. The Rust version also performs two loads, a relaxed load of self.value
and a sequentially consistent load of self.value
. However, behind the scenes it additionally does the following:
- Pin the epoch
- Check the garbage bins for dead objects
- Clone the cached
String
value, which allocates - In the
get_transformed
caller, destroy the clonedString
, again using the allocator
For a start, using a String
value that Rust clones and Java only returns by pointer would appear to favor Java. Since a typical payload object is expected to be a complex object, it would surely be more efficient to make Payload
an Arc<String>
. “Cloning” the payload will only increment a reference count and string allocations will be eliminated. However, making this change not only fails to pay off, it makes the code even slower, with an average get_transformed
invocation now taking 290 ns!
Breakdown
To make sense of this measurement, I decided to strip down down get_transformed
to its very basics, breaking its contract where necessary, just to see which part takes what time. Here are the findings, now measuring only the “happy case” obtained with PRODUCE_ITERS
reduced to 1. Repeating the benchmark showed some variation in numbers, but not significant enough to change their overall meaning. Keep in mind that absolute figures are obtained on my old desktop; a modern computer would be significantly faster.
- single unsafe relaxed load of
u64
value: 3 ns epoch::pin()
+u64
payload: 26 ns- like the above, but payload that allocates,
Box<u64>
: 74 ns Arc<u64>
payload: 230 nsString
(3 chars): 95 nsString
(128 chars) -> 105 nsString
(1024 chars) -> 136 nsArc<String>
(any string size) -> 231 nsString
(2048 chars) -> 280 ns
Pinning the epoch costs around 23 ns on my machine, 26 ns measurement minus the 3 ns load and some loop overhead. This is consistent with the documentation cautioning of 10-15 ns pin time on a modern computer. This is likely the only work done, as no allocation is needed, and the thread-local garbage bins are empty. The u64
payload we’re cloning is Copy
, so its clone()
just loads the primitive value. No garbage is generated in the “happy case” because neither the source nor the cached value are written to, only read.
One surprising finding is that atomic reference counting is expensive, especially so when there is high contention over access to the object. It is no wonder that Rust opted to implement a separate single-threaded reference-counted type – using atomics adds a large overhead to Arc::clone
compared to Rc::clone
. (Also confirmed by separately benchmarking that comparison only.) Compared to the cost of cloning an Arc
, string allocation and copying are fantastically optimized. It takes strings of almost 2 kilobytes for String::clone
to match the cost of contended Arc::clone
. Most surprisingly, it turns out that a heap allocation is actually cheaper than increasing and decreasing an atomic reference count. Allocation time can be obtained by subtracting the Box<u64>
timing from the u64
one, which pegs allocation at under 50ns, in an 8-thread scenario. jemalloc’s segmented locking seems very successful in this scenario.
It would appear that this kind of micro-benchmark favors GC-backed languages, although it’s hard to tell by how much. It would be interesting to extend it to include some sort of processing and test whether the cumulative cost of garbage collection elsewhere in the program tips the scale.
Conclusion
Implementing the exercise was an interesting learning experience in both Rust and lock-free programming. The material presented here of course only scratches the surface of the topic. Jeff Preshing’s articles provide a much more in-depth treatment and further references.
Within Rust, Crossbeam and its successor Coco provide a convenient way to implement custom lock-free algorithms. This convenience does come at a cost – the mere cost of pinning the epoch would make Rust’s get_transformed
fast path 3-4 times slower than equivalent Java. The real challenge comes when sharing objects among threads. Timings show that lock-free Rust requires very careful allocation design, as the cost of memory management can easily dwarf the lock-free operations that were supposed to be chosen for their efficiency. Specifically, Arc
is not a panacea and can even add significant overhead to performance-sensitive designs. If access to a reasonably small object is needed, it may actually be more efficient to clone the object than to expose it through Arc
. If some portion of a large object needs to be accessed, it may be more efficient to temporarily expose the reference to the object to a closure provided by the caller, which can then pick up the information it needs. Whatever solution is chosen, there does not appear to be a silver bullet that would fit all kinds of objects.
Both Crossbeam and Coco require unsafe
in a couple of key places, so they are not as elegant as the statically checked lock-based design offered by the standard library. The libraries themselves are not to blame here – it is a hard problem and might require additional research and possibly even support from the language to resolve satisfactorily. This area is still under active research, especially in the wider C++ community, and it will be interesting to follow how it will develop.
What isn’t considered in the benchmark results is the overall cost amortized over time with GC overhead included, and consistent latency vs. sporadic latency that GC would introduce. That is the trade off being made.
The benchmark was designed so that the producer generates enough garbage to occasionally trigger GC. As the final time is calculated by dividing the gross run time with the number of iterations, it should actually include the amortized cost of GC.