The stable HashMap trap

You read about faster hash functions and switch to one. Most of your code gets the expected speed boost, but some parts mysteriously get slower – much slower, especially when dealing with large hashmaps. If this sounds familiar, you might have encountered the stable HashMap trap.

Background

Rust’s HashMap is an implementation of Google’s SwissTable. While the hash table provides excellent performance of, the SipHash hash function used by default is designed to protect against a large class of DoS attacks at the price of performance. When you care about performance and maliciously crafted keys are not an issue, switching to a more efficient hash function can boost performance by a significant amount. Rust hash maps make it fairly easy to change the hash function by switching to a different one, and there are crates that offer high-performance hash functions. The Rust Performance Book recommends, among others, the use of the brutally effective FxHasher from rustc-hash crate and the higher-quality ahash crate. This recommendation, however, comes at a price that is rarely if ever mentioned – some O(n) hash map operations, including deserialization, degrade to O(n**2)!

The issue

Since the issue is easy to stumble upon when deserializing HashMaps, let’s use that as example. We’ll construct a HashSet for simplicity, but HashMap would exhibit the same behavior. This code creates a set of 10m random values, serializes it to a Vec<u8> using bincode, and deserializes it back into a HashSet:

use std::time::Instant;
use rand::Rng;

type TestHashSet<K> = std::collections::HashSet<K>;
//type TestHashSet<K> = rustc_hash::FxHashSet<K>;

fn main() {
    let mut rnd = rand::thread_rng();
    let nums: Vec<_> = (0..10_000_000).map(|_| rnd.gen::<u64>()).collect();

    let t0 = Instant::now();
    let h: TestHashSet<u64> = nums.into_iter().collect();
    let t1 = Instant::now();
    println!("create: {}", (t1 - t0).as_secs_f64());

    let out = bincode::serialize(&h).unwrap();
    let t2 = Instant::now();
    println!("serialize: {}", (t2 - t1).as_secs_f64());

    let h2: TestHashSet<u64> = bincode::deserialize(&out).unwrap();
    let t3 = Instant::now();
    println!("deserialize: {}", (t3 - t2).as_secs_f64());

    println!("{}", h2.len());
}

Output on my machine:

create: 1.440201369
serialize: 0.071765342
deserialize: 1.114031976

Now let’s swith to fxhash to speed it up. We only need to point the TestHashSet alias to rustc_hash::FxHashSet<K> and we’re good to go. Note that FxHashSet is itself an alias for std::collections::HashSet whose hasher is FxHasher, so it only affects the hash function, not the implementation of the table. With that change, I get this output:

create: 1.391839734
serialize: 0.081116361
deserialize: 5.2052695

Creation is slightly faster, serialization takes about the same time, but deserialization is almost 5x slower than with the stdlib hash! Tested against different set sizes, the deserialization time is close to quadratic – it takes 5s with 10m elements, 20s with 20m elements, 69s with 40m elements, and 219s with 80m elements. We’ve seen the trap.

When does it happen?

First, let’s make it clear that the problem has nothing to do with serde or bincode. Serde just happens to create the set from values obtained by iterating it. The same slowdown can be easily reproduced without serde:

// just as slow as the serde version when TestHashSet is FxHashSet
let mut h2: TestHashSet<u64> = Default::default();
for &n in &h {
    h2.insert(n);
}

Second, this is not a consequence of fxhash being of lower quality or hitting some bad edge case. The same quadratic behavior can be observed using any other stable hash function, including the default SipHash when randomization is omitted. To try it, define TestHashSet<K> as std::collections::HashSet<K, BuildHasherDefault<std::collections::hash_map::DefaultHasher>> (quite a mouthful), which uses the default hash function with a non-random initial state – you’ll get a similar slowdown.

The third thing to be aware of is that you can’t reproduce the issue with a simple clone() or collect() – things like let h2 = h.clone() and let h2: TestHashSet<u64> = h.iter().copied().collect() will be O(n), and almost 10x faster with fxhash than with the default hasher. It also doesn’t happen when you pre-allocate the set using with_capacity(). To reproduce, you need to start with a HashSet that must resize at least once to accommodate the data, such as an empty one. Both clone() and collect() preallocate the HashSet which avoids this performance degradation.

This is why serde hits it – although serde stores the size of every container as a “size hint”, it also caps this hint during deserialization to a low number, for security reasons.

In summary, to reproduce quadratic complexity of construction, all of the following conditions must be met:

  • the HashMap/HashSet must resize to accommodate the values we’re inserting
  • it uses a stable hash function
  • it is built from keys in iteration order of the same HashMap/HashSet

Why does it happen?

This bug is thoroughly described in this 2016 article, which I encourage you to read to understand it. In short, copying a larger table into a smaller one assigns the keys from the first half of the larger table to consecutive locations. Then the keys from the second half of the larger table must fit between the keys which were already densely packed, which forms large collision chains. Although this is a temporary situation, fixed on the next resize, it affects performance visibly. But do refer to the article for details.

The article mentions Robin Hood hashing which is no longer used by Rust hash tables, but don’t let that throw you off. Although the implementation switched to a different probing strategy, it still uses open addressing with power-of-two hash table sizes.

The quadratic behavior was fixed for the default hasher by introducing randomization, but the issue is still open because the underlying bug is still present, visible with non-randomized hashers, as in the above code.

Workarounds

All fixes for this bug boil down to shuffling the items in order to avoid the construction in unfavorable order. Since the actual bug that leads to the issue is in the implementation of the hashmap, I refer to these as workarounds rather than proper fixes.

I do hope this bug gets more visibility and gets fixed, as it hurts people who try to use fxhash to get a faster HashMap, and it’s not immediately obvious what’s going on. In general, a nice property of hash tables is that, unlike unbalanced binary trees or quicksort, they’re normally not sensitive to key order. It’s a bit of a shame that that doesn’t apply to Rust’s hash tables with deterministic hashers.

Pre-allocate the HashSet/HashMap

This is by far the simplest fix when you don’t control the exact type of the map, but do control it is created. In the non-serde case, that means initialize the h2 hash map with capacity given upfront:

// fast because h2 is of the correct size to begin with
// note: can't just call with_capacity() because it assumes RandomState from stdlib
let mut h2: TestHashSet<u64> = TestHashSet::with_capacity_and_hasher(h.len(), Default::default());
for &n in &h {
    h2.insert(n);
}

This can be made to work with serde, but requires writing a custom deserializer.

Shuffle the items

Shuffling the elements also fixes the issue. For example:

// fast because h2 is built of elements which are not in iteration order of h
let mut elems: Vec<_> = h.iter().copied().collect();
elems.shuffle(&mut rnd);
let mut h2: TestHashSet<u64> = TestHashSet::default();
for n in elems {
    h2.insert(n);
}

This is a good “quick fix” when you don’t control how the table is constructed, but do control the data that flows into it. It is not that easy to apply to the serde case, where it would require a custom serializer.

Randomize the hasher

This is the most general solution, and is how the issue was worked around in the standard library. The standard library uses the RandomState to build the default hasher.

If you use another hasher, you just need to make sure that you’re using the randomizing one. (This is what AHashMap uses by default, so you don’t need to do anything special.) If your hasher doesn’t provide a random hasher builder, you can always create one using the public API. For example, for FxHasher we can create an FxRandomState that implements BuildHasher:

#[derive(Copy, Clone, Debug)]
pub struct FxRandomState(usize);

impl BuildHasher for FxRandomState {
    type Hasher = FxHasher;

    fn build_hasher(&self) -> FxHasher {
        let mut hasher = FxHasher::default();
        hasher.write_usize(self.0);
        hasher
    }
}

Ideally build_hasher() would just return FxHasher::with_seed(self.0), but with_seed() isn’t provided, so we get equivalent functionality by creating a hasher and mixing the seed into it. The remaining part is to implement Default for our hasher builder, which will obtain the actual seed. To make it efficient, we won’t get a random seed every time, but only when a new thread is created. Then within a thread we’ll assign each FxRandomState (and therefore each hash table) a new consecutive seed.

impl Default for FxRandomState {
    fn default() -> Self {
        thread_local! {
            static SEED: Cell<usize> = Cell::new(rand::thread_rng().gen())
        }
        let seed = SEED.with(|seed| {
            let n = seed.get();
            seed.set(n.wrapping_add(1));
            n
        });
        FxRandomState(seed)
    }
}

With this in place, we can alias TestHashSet<K> to HashSet<K, FxRandomState>, and the quadratic behavior goes away.

Contention on multi-threaded regex matching

Let’s say you need to match the same regex across a large number of strings – perhaps you’re applying a grep-like filter to data generated or received by your program. This toy example demonstrates it by matching a regex against half a billion strings:

use regex::Regex;

lazy_static! {
    static ref IS_INTEGER: Regex = Regex::new("^[0-9]+$").unwrap();
}

fn main() {
    let strings: Vec<&str> = ["foo", "bar", "1234", "1234foo", ""]
        .into_iter()
        .cycle()
        .take(100_000_000)
        .collect();

    let start = Instant::now();
    let n_ints = strings.iter().filter(|s| IS_INTEGER.is_match(s)).count();
    let elapsed = start.elapsed().as_secs_f32();
    println!("{} {}s", n_ints, elapsed);
}

It’s not a scientific benchmark of regex performance, but it does show some interesting and unexpected effects also observed in real-world code. For starters, it takes 2.0s to execute the matches on my laptop.

This is good performance, but it’s a shame to do everything in one thread – let’s try to speed it up by using all cores. This is the kind of thing Rayon makes really easy, just change iter() to par_iter():

use rayon::prelude::*;
let n_ints = strings.par_iter().filter(|s| IS_INTEGER.is_match(s)).count();

Surprisingly, this takes 6-8s to execute on the system with 4 physical cores. In other words, instead of being 3-4x faster due to running on 4 cores, it’s 3-4 times slower. A very similar slowdown occurs on the system with 32 physical cores, where the time grows from 2.05s to 8.2s.

This result can’t be chalked up to an inefficiency in Rayon, as the same slowdown is observed when dividing the work among threads in other ways. Is it possible that matching a compiled regex in multiple threads causes contention when accessed from multiple threads?

When this was first suggested in discussions with coworkers, it seemed quite unlikely, as contention would imply that the compiled regex held a lock or other form of synchronization. This runs counter to the idea of a compiled regex, which one would expect to be fully constructed during compilation. Compiled regexes are often seen in lazy_statics, and shared by the whole program. But no matter how unlikely, the possibility of contention is easy to test, simply by switching from lazy_static! to thread_local!:

thread_local! {
    static IS_INTEGER: Regex = Regex::new("^[0-9]+$").unwrap();
}

The match now needs an additional closure to access the thread-local, but is still quite readable:

use rayon::prelude::*;
let n_ints = strings
    .par_iter()
    .filter(|s| IS_INTEGER.with(|is_integer| is_integer.is_match(s)))
    .count();

Continuing the surprise, this takes 0.66s to run, which is 3x faster than the single-threaded version – the kind of speedup one might realistically expect from a 4-core computer. On the 32-core server, it takes 0.086s, a 24x speedup.

So, regex matching does have a contention issue. The compiled regex type, Regex, wraps the internal Exec type, which holds ProgramCache values organized in a Pool that stores them inside a mutex. Accessed from a single thread or from multiple threads at different times, this mutex is cheap to acquire and is held for a very short time. But under strong contention it becomes a bottle neck with attempts to acquire it falling back to OS-level waits, causing performance to degrade.

The file PERFORMANCE.md dedicates a whole section to this issue. The text is nuanced, so instead of quoting its parts, I encourage you to go ahead and read it. It warns of the performance impact and shows how to eliminate it (using a slightly different approach than taken above), but it also says that it’s “supported and encouraged” to define regexes using lazy_static! and use them from multiple threads. It explains that, despite expectations, a compiled regex doesn’t contain everything needed to match it – some of the compiled state is built lazily while executing particular kinds of search, and is later reused for searches of the same kind. The mutex protects the “scratch space” used for those updates. Design options and tradeoffs are discussed in this issue in more detail.

In summary, for most use cases it’s perfectly fine to use the same compiled regex from multiple threads. But if you have code that does heavy regex matching from multiple threads, and does most or all of it on one regex, you’ll almost certainly want to give different instances of the compiled regex to different threads. As always, be sure to measure performance before actually making such a change to the code.