Faster sorting with decorate-sort-undecorate

Decorate-sort-undecorate is a simple sorting pattern known for decades, which was additionally popularized by Randall Schwartz in the context of Perl, earning it the moniker Schwartzian transform. While I’ve been aware of the pattern for many years, I only recently realized just how relevant it still is for optimizing everyday sorts.

Consider this code that sorts JSON objects by a field:

/// Sorts a slice of JSON objects by the "age" field.
fn sort_by_age(v: &mut [serde_json::Value]) {
    v.sort_by_key(|o| o["age"].as_i64());

// usage:
fn main() {
    let mut v = vec![
        json!({"name": "John", "age": 25}),
        json!({"name": "Abel", "age": 18}),
    sort_by_age(&mut v);
    assert_eq!(v[0]["name"], "Abel");
    assert_eq!(v[1]["name"], "John");

The key function here is not doing anything big or controversial, it consists of a single field lookup followed by cheap extraction of integer out of the Value enum – the kind of thing you’d expect a reasonable key function to do. But running this on a larger slice shows that performance isn’t exactly stellar:

fn large_vec() -> Vec<serde_json::Value> {
    std::iter::repeat_with(|| json!({"name": "John", "age": rand::thread_rng().gen_range(1..=100)}))

fn main() {
    let mut v = large_vec();
    let t0 = Instant::now();
    sort_by_age(&mut v);
    let n = t0.elapsed().as_secs_f64();
    println!("{}", n);

On my laptop, the above code takes 1.01s to sort a slice of 1 million JSON objects. While not exactly a catastrophe, it feels like we should be able to do better than that.

First, we can switch from stable to unstable sort by changing v.sort_by_key(...) to v.sort_unstable_by_key(...). Doing so improves the timing to 0.44s, a more than 2x speed boost. In the use case that inspired this article unstable sort was fine because the comparison function provided fallback ordering when the primary key compared equal, and the initial order was meaningless anyway. We’ll continue to use unstable sort in the rest of the article because that gives best performance, but it’s trivial to revert to stable sort.

Looking at performance further, we notice that accessing a field of a JSON object must perform a hash table lookup. Since a sort has to do n log n comparisons on average, this means that we’re doing multiple lookups per each element. That’s the kind of situation where decorate-sort-undecorate might help, as it caches the extracted keys, reducing the number of lookups to “only” n, at the expense of some memory:

fn sort_by_age(v: &mut [Value]) {
    // extract JSON objects into a new vector, where each object is decorated with age
    let mut decorated: Vec<_> = v
        .map(|o| (o["age"].as_i64(), o.take()))
    // sort with a key function that trivially refers to the pre-fetched age
    decorated.sort_unstable_by_key(|&(age, _)| age);
    // undecorate, i.e. store just the sorted objects back into the slice
    for (dest, (_, o)) in v.iter_mut().zip(decorated) {
        *dest = o;

With this modification, my laptop takes only 0.059s to sort the slice of 1 million objects! This is 7.5x faster than the previous version, and 17x faster than the original stable sort. This is a much more radical speedup than I ever expected, and it’s something to definitely consider when your key function does anything more than access a struct/tuple field.

What is the explanation for such a drastic improvement? Let’s modify the key function to keep track of how many times it gets called:

    let mut cnt = 0u32;
    decorated.sort_by_key(|&(age, _)| {cnt += 1; age});
    println!("{}", cnt);

This shows that the key function is invoked 39 million times to sort 1 million elements. This is explained by log2(1_000_000) being approximately 20, so n log n is around 20 million. And for each comparison the key function needs to be called twice – once for the left and once for the right-hand side of the comparison – which adds up to 40 million. So the decoration cuts down the number of hash lookups from 40 million to 1 million, and that certainly makes a difference. Another thing to take into account is that sorting the decorated slice sorts by comparing integers directly present in the array, which allows inner loops of the sorting algorithm to be nicely inlined. On the other hand, sorting by a hashmap lookup requires call into the complex hash table code, which doesn’t inline nearly as nicely.

One downside is that we had to replace an elegant call to sort_unstable_by_key() with a bit of manual twiddling. Fortunately it turns out that the standard library has a method that does all this for us – slice::sort_by_cached_key(). According to its documentation, it sorts the slice and calls the key function at most once per element, “using temporary storage to remember the results of key evaluation”. That is functionally equivalent to our decorate-sort-undecorate, differing only in some implementation details (it doesn’t extract the original values to decorate them, but uses indices instead). With it sort_by_age() again becomes a one-liner:

fn sort_by_age(v: &mut [Value]) {
    v.sort_by_cached_key(|o| o["age"].as_i64().unwrap());

On my laptop this takes 0.089s to execute, making it 1.5x times slower than the hand-written decorate-sort-undecorate, but still 11x faster than the original naive implementation. The performance difference compared to the manual implementation seems to come from the fact that sort_by_cached_key() implements stable sorting. If we modify the manual decorate-sort-undecorate to use sort_by_key() instead of sort_unstable_by_key(), sort_by_cached_key() will come out as 1.34x faster, which means it’s reasonably optimized.

In conclusion, if you’re sorting by a non-trivial key function, and you have memory to spare, consider replacing sort_by_key() with sort_by_cached_key() and measure the result. If that helps, and you want to squeeze as much performance as possible, you can further optimize by implementing a manual decorate-sort-undecorate as shown above.