Let’s say we’re building a simple table indexed by integers starting with 0. Although the keys are contiguous, the table is loaded from key-value pairs that arrive in arbitrary order. The number of elements is not known in advance and can be anywhere from a handful up to a hundred million, but is expected to fit in working memory, given a reasonable representation. Due to these constraints the table loader naturally represents the data as Vec<Option<Payload>>
. It starts out empty and sports a set()
method that fills the payload at the given position, resizing the table as needed:
/// Called during loading, place `val` at `pos`
fn set(&mut self, pos: usize, val: Payload) {
if pos >= self.data.len() {
self.data.resize_with(pos + 1, Default::default);
}
self.data[pos] = Some(val);
}
The table payload is comparatively expensive to compute (it is deserialized from JSON), so that’s done from multiple worker threads, and set()
must be thread-safe. The obvious approach is to place the whole table under a lock, using the fantastic parking_lot crate to provide the implementation. Under a mutex data
would be a Mutex<Vec<Option<Payload>>>
:
fn set(&self, pos: usize, val: Payload) {
let mut data = self.data.lock(); // no unwrap() because parking_lot is awesome!
if pos >= data.len() {
data.resize_with(pos + 1, Default::default);
}
data[pos] = Some(val);
}
/// Called once loading is done to extract the data
fn into_data(self) -> Vec<Option<Payload>> {
self.data.into_inner()
}
This simple solution has a serious problem – now the whole of set()
operates with the big lock acquired. On a machine with 32 physical cores, there is a lot of contention around calls to set()
, and the worker threads end up spending about 80% of their time idly waiting for the lock. The worst part is that the coarse lock is only really needed to resize the table, which occurs very rarely, because just one large key immediately removes the need for resizes for all smaller keys. The large majority of calls to set()
just access non-overlapping slots in data
and have no real need for locking.
Fortunately parking_lot’s Mutex
is so incredibly well crafted that it only costs one (1) byte of space. Even if it ends up taking up more space due to padding, just compare that to a POSIX mutex that requires 40 bytes on x86-64 Linux, and must be heap-allocated because it’s UB to move it after initialization! With parking_lot fine-grained locking is quite feasible, even on tables with hundreds of millions of elements. Individual locks will be uncontended and therefore acquirable without system calls, and virtually free compared to the big lock. If we knew the table size in advance, we’d represent it as Vec<Mutex<Option<Payload>>>
and be done with it.
However, we still need some kind of lock around the whole table as well, in order to be able to resize it when necessary. Since resizing only happens occasionally, we can use a RwLock
, acquiring a shared read lock to access the table and its individual row (which has its own mutex for updating), and request an exclusive write lock only to resize the whole table. The type for data
now becomes RwLock<Vec<Mutex<Option<Payload>>>>
. (Don’t be scared by the piled-up generics, just read them in the order of appearance: read-write lock containing a vector of mutexes which protect optional payloads.)
To top it off, the parking_lot README advertises a feature of its RwLock
that looks exactly like what we need:
RwLock supports atomically upgrading an “upgradable” read lock into a write lock.
Perfect – we acquire the read lock for the whole table, and only upgrade it to a write lock if a resize is needed. After the resize, we downgrade it back to a read lock and proceed. With resizing taken care of, we access the inner mutexes through the shared read lock, acquire the finer-grained mutex without any contention, and store the data. Here is an implementation of that idea:
// XXX
pub fn set(&self, pos: usize, val: Payload) {
let data = self.data.upgradable_read();
if data.len() <= pos { // need to resize the table
let mut wdata = RwLockUpgradableReadGuard::upgrade(data);
wdata.resize_with(pos + 1, Default::default);
let data = RwLockWriteGuard::downgrade(wdata);
*data[pos].lock() = Some(val);
} else {
*data[pos].lock() = Some(val);
}
}
fn into_data(self) -> Vec<Option<Payload>> {
self.data.into_inner().into_iter().map(Mutex::into_inner).collect()
}
However, benchmarking this code in production shows no improvement whatsoever compared to the version with one big mutex. What’s going on?
After spending a lot of time trying to figure out where else in the program there could be a lock causing contention (including investigating the allocator, among other suspects turned innocent), I finally came back to this code and carefully reviewed the documentation of upgradable_read
, which says the following:
Locks this rwlock with upgradable read access, blocking the current thread until it can be acquired. The calling thread will be blocked until there are no more writers or other upgradable reads which hold the lock. There may be other readers currently inside the lock when this method returns.
So, an upgradable read is locking out other upgradable reads, even if they never exercise their option to upgrade the lock! Since every access to the table from set()
is an upgradable read, this made it behave just like a mutex. I feel that this should be more prominently stressed in the docs, as it is easy to miss and makes upgradable reads useless for this scenario. The correct way to implement set()
is by manually releasing the read lock and reacquiring a write lock:
pub fn set(&self, pos: usize, val: Payload) {
let mut data = self.data.read();
if data.len() <= pos {
// "upgrade" the lock
drop(data);
let mut wdata = self.data.write();
// check that someone else hasn't resized the table in the meantime
if wdata.len() <= pos {
wdata.resize_with(pos + 1, Default::default);
}
// now "downgrade" it back again
drop(wdata);
data = self.data.read();
}
*data[pos].lock() = Some(val);
}
The upgrade and the downgrade are of course no longer atomic, but we don’t really care, we just perform another check after the upgrade in case another thread beat us to the punch and resized the table. And most importantly, this version of the code utilized all the available cores, bringing load times from 7 minutes to 1m45s to parse and load ~70 GiB of JSON data. (Just going through all the data with cat
takes 32s, so this is pretty good.)
The immediate moral of the story is: be careful when adopting non-standard features of fundamental APIs like the synchronization devices. The documentation stated the behavior clearly, and quite a bit of debugging would have been avoided by reading it in advance. Even an otherwise excellent library like parking_lot sometimes violates the principle of least surprise.
The deeper lesson is that fearless concurrency, and the more general “if it compiles, it works” principle have their limits. Rust has spoiled us by providing a nice experience with multi-threading, so much that it’s easy to be genuinely surprised when a program that successfully compiles ends up misbehaving in a non-obvious way.