All posts by Hrvoje

Contact: hniksic@gmail.com

Exploring lock-free Rust 2: Atomics

The previous article showed a mutex-based implementation of the LazyTransform value container. However, the exercise was explicit about implementing a “lock-free” container. What does that mean, exactly?

Limitations of locks

The interaction between LazyTransform methods was carefully designed to prevent deadlock, but in a busy system one could expect significant contention for the source mutex between get_transformed and set_source. This is not optimal; we would like set_source to just store the new data without any waiting at all, and get_transformed to keep returning the old cached value until a new one is observed.

Consider also the contention for the value read-write lock. After get_transformed performs the transformation, it must acquire the value lock in write mode, which means it must wait until all the readers release it. What is even worse is that once the value lock is finally acquired in write mode, other readers must wait for the writer to release it. Of course, the writer only holds the lock for as long as it takes to store new_value, which is an extremely efficient operation, but all readers are still blocked, and if the writing thread gets suspended by the OS scheduler, all readers will remain blocked until the writer thread resumes. The readers, on the other hand, hold the lock for as long as it takes to clone the value, which is not bounded in any way. What is common to all these cases is that progress of the entire system hinges on the behavior or scheduling of one thread, and this is what lock-free algorithms are designed to avoid.

In a lock-free system, the program will make progress as long as any of its threads are allowed to run. Being lock-free is not just about refraining from using the OS lock primitives, but about carefully designing the system so it cannot lock up, be it by deadlock or livelock or unfortunate thread scheduling patterns. As Jeff Preshing puts it, a lock-free application won’t lock up even with hypothetical thread scheduling decisions being made by your worst enemy.

Another reason to want to avoid mutexes is their cost on object size. On x86_64 Linux, a pthread_mutex_t, internally used by Rust’s Mutex, takes 40 bytes of memory, and a RwLock takes 56. To make Mutex movable (a requirement for all Rust types), the actual OS lock object is dynamically allocated. The smallest possible Rust mutex, a Mutex<()>, takes 16 bytes of memory for the object itself (8 for the pointer to the dynamically allocated OS mutex and 8 more for the poison flag). With each LazyTransform object requiring ~200 bytes of memory and three dynamic allocations just to store the locks, we would not want to have millions of such objects lying around. If we don’t really need the waiting functionality offered by a full-featured Mutex and RwLock, we would do well to eliminate them and significantly reduce the footprint of LazyTransform.

AtomicBool

For a start, let’s eliminate the transform_lock mutex, an easy target as it holds no data and is only ever acquired with try_lock. As used, this lock was already “lock-free” in the above sense, because owning it would never cause another thread to stall.

We could replace it with a bool, but then try_lock would be doomed to a race condition in the window between checking the previous lock state and changing it to “locked”. Not to mention that Rust would see this as a data race and the compiler would reject access to a bool without protecting it with a mutex. Fortunately, a mutex is not the only option for avoiding data races. Modern processors support a number of specialized instructions to read and update values atomically, including one that fetches a machine value and atomically set the new one. In Rust, these instructions are exposed as methods on types in the std::sync::atomic module. The stable channel provides AtomicBool, AtomicUsize, AtomicIsize, and AtomicPtr<T>. For transform_lock, we can use AtomicBool to store a flag indicating whether the lock is currently taken. The resulting try_lock would look like this:

if !self.transform_lock.swap(true, Ordering::Acquire) {
    // lock acquired
    ...
    // release the lock
    self.transform_lock.store(false, Ordering::Release);
} else {
    // failed to acquire the lock
}

swap atomically stores true into the lock and returns the previous value – the operation Java AtomicBoolean calls getAndSet. If the previous value was false, it means it was us that changed the value from false to true and thus successfully acquired the lock. Conversely, if the previous value was true, it means that the lock had been already taken by someone else, and our store of true was a no-op. The Ordering argument tells the compiler which memory order is requested for the operation. The memory order corresponds to guarantees that place constraints on how the compiler and the CPU can reorder data reads and writes, both ordinary and atomic ones, around one particular atomic read/write; see the nomicon for a detailed explanation. For transform_lock we use the acquire and release ordering, so named because of its usefulness in implementing locks. A store with the “release” ordering guarantees that all writes executed prior to the atomic store will be observable by any thread that observes the newly stored atomic value. It is used when releasing a lock by setting the “locked” flag to false with the “release” ordering, which signals to other threads that the lock is now available for the taking, but also ensures that the objects that have been updated while the lock was held are available for reading. Conversely, any code that observes the false value using the “acquire” ordering can be sure that its subsequent load of other locations will also observe the results of all writes executed before the “release” store.

Note that the AtomicBool::swap method operates on the object by &self shared reference. This is another example of the conceptual magic like the one employed by mutexes and described in the previous post. The difference is that with AtomicBool no waiting takes place, so we are not trusted to ever get a mutable reference to the underlying bool. Instead, we tell AtomicBool how to change the object, and it executes the change through hardware mechanisms that are guaranteed to be exclusive, which allows it to declare the mutating methods to take &self.

Here is an implementation using AtomicBool for transform_lock:

use std::sync::{Mutex, RwLock};
use std::sync::atomic::{AtomicBool, Ordering};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Mutex<Option<S>>,
    value: RwLock<Option<T>>,
    transform_lock: AtomicBool,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Mutex::new(None),
            value: RwLock::new(None),
            transform_lock: AtomicBool::new(false),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut locked_source = self.source.lock().unwrap();
        *locked_source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        if !self.transform_lock.swap(true, Ordering::Acquire) {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    self.transform_lock.store(false, Ordering::Release);
                    return new_value;
                }
            }
            self.transform_lock.store(false, Ordering::Release);
        }
        self.value.read().unwrap().clone()
    }
}

Lock abstraction

One disadvantage of directly using an AtomicBool compared to a mutex is that we don’t get a fancy guard that automatically unlocks the lock when going out of scope. Instead, we must remember to unlock it ourselves, which can be difficult to guarantee during panic. A robust implementation would wrap the AtomicBool into a lock object that returns a guard that implements Drop:

struct LightLock(AtomicBool);

impl LightLock {
    pub fn new() -> LightLock {
        LightLock(AtomicBool::new(false))
    }

    pub fn try_lock<'a>(&'a self) -> Option<LightGuard<'a>> {
        let was_locked = self.0.swap(true, Ordering::Acquire);
        if was_locked {
            None
        } else {
            Some(LightGuard { lock: self })
        }
    }
}

struct LightGuard<'a> {
    lock: &'a LightLock,
}

impl<'a> Drop for LightGuard<'a> {
    fn drop(&mut self) {
        self.lock.0.store(false, Ordering::Release);
    }
}

In accordance with Rust conventions, the method that attempts the lock is named try_lock. Our version returns an Option that tells the caller whether taking the lock succeeded and, if so, contains the guard object to automatically unlock it on exit. Rust’s Mutex::try_lock has a more sophisticated return value that allows for other kinds of errors, but the logic of using it is essentially the same.

Using this type, transform_lock gets declared as LightLock and initialized as LightLock::new(). get_transformed automatically becomes panic-safe and also cleaner and more maintainable because it no longer needs to remember to release the lock before every exit point:

    pub fn get_transformed(&self) -> Option<T> {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    return new_value;
                }
            }
        }
        self.value.read().unwrap().clone()
    }

Getting rid of one mutex wasn’t that hard. Can the same logic apply to other locks? transform_lock was distinctive in that it never guarded concrete data, it was only used to serialize access to the transformation of source into value. This ensured that source and value locks, which are actually exclusive in nature, were held as briefly as possible. Once the lock was obtained, the actual protection of source and value from access by different threads was left to their respective locks. To safely access stored values without locks, we will need some way to access the values atomically like we did with the bool stored by the AtomicBool.

Atomic arbitrary values

Ideally we would have at our disposal a generic atomic type, let’s call it AtomicCell<X>, that provides a swap method to atomically modify the shared value and retrieve the old, and load to only read the existing one. With source and value fields declared as AtomicCell<Option<T>> and AtomicCell<Option<S>> respectively, LazyTransform could look like this:

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn set_source(&self, source: S) {
        self.source.swap(source);
    }

    fn try_transform(&self) {
        if let Some(_lock_guard) = self.transform_lock.try_lock() {
            let source_maybe = self.source.swap(None);
            let source = match source_maybe {
                Some(source) => source,
                None => return,
            };
            let newval = match (self.transform_fn)(source) {
                Some(newval) => newval,
                None => return,
            };
            self.value.swap(newval);
        }
    }

    pub fn get_transformed(&self) -> Option<T> {
        let source = self.source.load();
        if source.is_some() {
            self.try_transform();
        }
        self.value.load().as_ref().map(T::clone)
    }
}

get_transformed is now split into two methods: get_transformed, which now only returns the cached value, and a non-public try_transform, which actually attempts the transformation. set_source atomically swaps in the new source, and drops the old one (by virtue of ignoring the return value of swap). try_transform does the exact same thing with the source, and also with value when it obtains a new one. When the old value needs to be read, load() is called to access a reference to the underlying value, which is cloned.

The problem with the above code is that AtomicCell doesn’t exist. Rust’s standard std::sync::atomic module is limited to atomic manipulation of pointer-sized values which correspond to the operands of the CPU’s atomic instructions. Thus the most promising building block for implementing an AtomicCell is the AtomicPtr type, which provides atomic access to a pointer. An AtomicCell<T> would be represented by an AtomicPtr<T> that references T stored dynamically. The dynamic allocation is unfortunate, but unavoidable if we ever want to support T larger than a pointer. Using a pointer, AtomicCell::swap could look like this:

struct AtomicCell<T> {
    ptr: AtomicPtr<T>,
}

impl<T> AtomicCell<T> {
    fn new(initial: T) -> AtomicCell<T> {
        AtomicCell {
            ptr: AtomicPtr::new(Box::into_raw(Box::new(initial)))
        }
    }

    fn swap(&self, new: T) -> T {
        let new_ptr = Box::into_raw(Box::new(new));
        let old_ptr = self.ptr.swap(new_ptr, Ordering::AcqRel);
        unsafe {
            *Box::from_raw(old_ptr)
        }
    }
}

AtomicCell::new simply creates a cell whose ptr points to a valid heap-allocated object, which is AtomicCell‘s invariant. The interesting part happens in swap.

In swap the new object, owned by the local variable received from the caller, is moved to the heap and swapped into AtomicPtr as a raw pointer. The old object is retrieved from the same call to AtomicPtr::swap, also as a raw pointer, immediately converted back to a Box whose contents are moved from it, and returned by value. The ownership of each object is clear at every point. Values stored in the AtomicPtr are (conceptually) owned by the AtomicPtr. When writing a value to the pointer, the ownership over the new value is relinquished to the AtomicPtr using Box::into_raw, and at the same time the ownership of the old value is reasserted with Box::from_raw.

This last part is also the first time we’ve had to use unsafe in the implementation. Box::into_raw is not unsafe because it only gives back a raw pointer, trusting the caller to arrange for the object to be eventually destroyed. Even if the caller fails to uphold their end of the bargain, the worst thing that can happen is a leak, already possible with std::mem::forget. Box::from_raw is unsafe because it reinterprets a raw pointer as a box without any proof that the raw pointer is unreachable from elsewhere or even valid. We, however, know both things to be true. The pointer must be valid because we are only ever passing it pointers created by Box::into_raw, as only such pointers are stored in the AtomicPtr. We also know that the object is not reachable from another thread because we have just obtained it from swap, meaning it has already been disowned by the previous owner. Thus our swap is sound and the unsafe block is a mere implementation detail. Taking ownership is not only safe, but necessary if we want to prevent the old value from leaking.

swap was easy. Now what about AtomicCell::load, which we also need to implement lock-free reading? Based on AtomicPtr::load, a load could be imagined like this:

// doesn't work - see below
fn load(&self) -> T {
    let old_ptr = self.ptr.load(Ordering::Acquire);
    unsafe {
        *Box::from_raw(old_ptr)
    }
}

This code is written in analogy to swap, only without modifying the existing. Unfortunately it is also unsound, which means that it breaks Rust’s ownership rules and causes a crash. The problem is that load takes ownership of the object behind the atomic pointer without also changing the pointer to point to something else. This leads to double free as soon as load is called again to observe the same pointer, or in any subsequent call to swap, which will attempt to take ownership of a freed value. And there is no easy way to fix it, either. For example, suppose we defined load() to return Option<&T>, thus avoiding taking ownership of the observed object. This would prevent double free, but it wouldn’t work on the type level because there is no way to express the lifetime of the returned reference. While multiple loads would no longer be an issue, a swap() could still invalidate the value while a reference is live. Immediately cloning the observed object and returning a new value is also unsound because the reference can be invalidated by a swap before T::clone finishes. The atomic_cell crate resolves the issue using a lock around clone(). Locking is performed with a spinlock, whose lock looks like LightLock::try_lock above, except it retries taking the lock in a busy loop until it succeeds. Despite its efficiency when the lock is uncontended, a spinlock is still a lock and it is not acceptable in the middle of a lock-free algorithm.

Ideally load() would return a guard that provided a reference to the underlying value, much like the guard returned by Mutex::lock. The existence of the guard will prevent writers like swap() from destroying the object. Instead, swap will store the new raw pointer into AtomicPtr and mark the old one for deletion, deferring its actual destruction until all the load guards have been dropped. It is far from obvious how to implement this as efficiently as possible, and of course without heavy synchronization, lest we lose the very benefits lock-free code was supposed to provide.

Lock-free access to arbitrary values with correct memory management is the topic of the next article.

Exploring lock-free Rust 1: Locks

As a learning exercise I set out to implement a simple lock-free algorithm in Rust. It was inspired by a problem posed at job interviews at a company where a friend works. The problem is simple enough that it can be tackled by a beginner, but tricky enough to require some thought to get right – and Rust presents several new challenges compared to the original Java.

This series of articles presents the evolution of a simple Rust lock-free container starting from single-threaded, progressing to a multi-threaded variant with locks, and finally settling on a lock-free implementation, discussing the trade-offs at each step. A basic understanding of Rust and of multi-threading programming is assumed, but the articles might be useful to adventurous beginners at both. Do note, however, that the author is not an expert at lock-free programming, so there might be errors – if you find some, please do leave a comment.

The Exercise

Implement a class providing access to a cached value. In more details:

Write a LazyTransform class that stores a single value, allowing it to be updated as needed. The value is potentially expensive to compute, so the setter method, set_source receives a “source” that will be used to compute the final value using a transformation function received in the LazyTransform constructor. Transformation must not be attempted until requested by get_transformed. Once generated, the value is cached and returned by further invocations of get_transformed, until invalidated by a new call to set_source.

A single-threaded version can be summarized with the following Python:

class LazyTransform:
    def __init__(self, transform_fn):
        self.transform_fn = transform_fn
        self.source = None
        self.value = None

    def set_source(self, new_source):
        self.source = new_source

    def get_transformed(self):
        if self.source is not None:
            newval = self.transform_fn(self.source)
            if newval is not None:
                self.value = newval
                self.source = None
        return self.value

The class must support being called from multiple threads, with the following semantics:

  • set_source and get_transformed can and will be called on the same LazyTransformer instance in parallel;
  • Once set_source completes, future invocations of get_transformed must eventually start returning the new value.
  • Read-heavy usage pattern is expected, so get_transformed must not block regardless of how many times set_source or get_transformed are called in other threads. The one exception is when a new source is detected – it is allowed for get_transformed to block until the transformation finishes before returning the transformed value (and caching it for future calls).
  • The code must be lock-free: neither set_source nor get_transformed should get stuck waiting on each other, even if they are called in quick succession or in parallel by many threads, or both.

Rust API tradeoffs

Before proceeding to parallelization, let’s review how the above interface would map to Rust’s type system. Ideally we’d want to place as few restrictions as possible on the type of values used for the source and the final objects; either could be as simple as a single u32 or a huge heap-allocated object. We know, for example, that both the source and the value type must be Send, because they need to be accessed from threads different from those that create them.

Another necessary restriction is that the final value type must be Clone. Why? Consider how the concept of “returning of cached value”, the return self.value line in the above Python, maps to Rust. In Python the semantics are clear because all of its objects are heap-allocated, and you always get the same instance shared. This is also the specified by the original Java exercise, which returns an Object. But a correct Rust implementation needs to deal with an actual value stored in the cache, and has three options for returning it:

  1. move the object out of the container, typically by making the value field an Option and returning self.value.take();
  2. return a reference to the object, return &self.value;
  3. clone the object and return the cloned value to the caller.

The first option obviously doesn’t work because it would prevent get_transformed to return the cached value more than once. The second option looks feasible until one considers that the returned reference cannot be allowed to outlive the stored value. Since the stored value can be invalidated by a call to set_source, which can happen literally at any time, it is clear that allowing a reference to be returned would be unsound. Indeed, all such attempts are promptly rejected by the borrow checker.

Although cloning at first appears like it would be inefficient for arbitrary objects, it actually provides the greatest flexibility for the user. Light values, such as numeric IDs which are Copy (and hence also Clone), or small strings which are cheap to clone, can be placed in the cache as-is. Heavy values, on the other hand, can be dynamically allocated and accessed as Arc<ActualData>, ensuring that their clone only increments a reference count, providing the semantics one would expect from equivalent Python or Java. If needed, one can even combine the two and store a tuple of a light object and a heavy one.

So, LazyTransform needs to be generic on the value type (T) and the source type (S). But let’s not forget the transformation function received by the constructor. Fixing its type to fn(S) -> T would limit it to stateless functions, and we would like the user to be able to provide an arbitrary closure for transformation. One option would be to accept a generic function object in the constructor and box it in a Box<Fn(S) -> T>, but that would impose a dynamic allocation on each LazyTransform instance, as well as an indirection when invoking the function. If the transformation function is known at compile time and carries no state, it should incur neither storage nor run-time indirection overhead. This is easily achieved by adding a third type parameter, that of the transformation function type.

As a demonstration of the API, here is a single-threaded implementation of the container:

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Option<S>,
    value: Option<T>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: None, value: None,
        }
    }

    pub fn set_source(&mut self, source: S) {
        self.source = Some(source);
    }

    pub fn get_transformed(&mut self) -> Option<T> {
        if let Some(source) = self.source.take() {
            let newval = (self.transform_fn)(source);
            if newval.is_some() {
                self.value = newval;
            }
        }
        self.value.clone()
    }
}

In spirit this is exactly the same thing as the original Python, except sprinkled with a healthy dose of static typing.

Compile-time thread safety

What happens if we try to share an instance of LazyTransform among threads? Rust will prevent that at compile time — invoking a &mut method from multiple threads would require creating multiple &mut references to the same object, which is prevented by the borrow checker. For example, the following doesn’t compile:

fn main() {
    let mut lt = LazyTransform::new(|x: u64| Some(x + 1));
    std::thread::spawn(move || {            // lt moved here
        for i in 0..10_000 {
            lt.set_source(i);
        }
    });
    while lt.get_transformed().is_none() {  // lt used after move
    }
    let val = lt.get_transformed().unwrap();
    assert!(val >= 0 && val < 10_000);
}

lt gets moved into the closure executed by the new thread, but then it is no longer available for use by the main thread. Sending it by reference wouldn’t work because there can exist only one &mut reference to an object, so we wouldn’t be allowed to send the same reference to multiple threads. Allocating LazyTransform dynamically and using Arc to share it among threads wouldn’t help either because Arc only provides shared access to the data it owns.

In Rust, supporting parallel access to a container requires not only a change in implementation, but also in method signatures. This is an intentional design decision – while Python or Java single-threaded code will happily execute when called from multiple threads, providing incorrect results, the Rust version will refuse to compile when the thread-unsafe LazyTransform object is accessed from two threads.

The compiler uses simple rules to decide whether it is safe to share an object between threads:

  1. Methods invoked from more than one thread must accept &self rather than &mut self. This rule is enforced by the borrow checker for single-threaded code as well.
  2. The object must not contain values of types specifically blacklisted for multi-threaded access even through shared references. In Rust terms, its type must “be Sync”, meaning it implements the Sync marker trait, which most objects do. Examples of non-Sync types are Rc or Cell, and both have thread-safe equivalents.

At a glance, the first rule seems to rule out LazyTransform as a multi-threaded type. Both its public methods clearly modify the object, with set_source even doing that in a way that is observable from the outside. Changing the signatures to accept &self instead of &mut self fails to compile because both methods modify the data behind the &self shared reference, which is prohibited. Accessing an object inside &self will also result in further shared references that are read-only.

To modify data, we must find a way to obtain an exclusive mutable reference from the shared reference to self. This is not allowed for ordinary objects because the compiler would have no way to ensure that writes are exclusive, i.e. that while a thread holds a mutable reference to a value, no other thread can read it or write to it. However, if we could statically convince Rust that the reference’s ownership of the data will be exclusive, it would be within the rules to allow the conversion. This is where mutexes come in.

Mutexes

Rust’s Mutex type provides read-write access to the value it protects, using the appropriate operating system primitive to ensure that this can be done by only one thread at a time. Here is an implementation of LazyTransform updated to use a mutex:

use std::sync::Mutex;

struct LazyState<T, S> {
    source: Option<S>,
    value: Option<T>,
}

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    state: Mutex<LazyState<T, S>>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            state: Mutex::new(LazyState { source: None, value: None }),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut state = self.state.lock().unwrap();
        state.source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        let mut state = self.state.lock().unwrap();
        if let Some(new_source) = state.source.take() {
            let new_value = (self.transform_fn)(new_source);
            if new_value.is_some() {
                state.value = new_value;
            }
        }
        state.value.clone()
    }
}

Both methods now operate on &self, relying on the mutex to obtain write access to the data in self.state. As far as method signatures are concerned, this is the final version of the API – all future versions will only differ in implementation.

The storage is now split into transform_fn, which is itself immutable and can be invoked from a shared reference, and state, the mutable part of the object’s state moved to a separate struct and enclosed in a mutex. As can be seen here, Rust’s Mutex is a container that holds and owns the data it protects. While that coupling looks strange at first, it enables the mutex to safely grant read-write access to the data it owns.

Calling Mutex::lock() waits until an exclusive OS lock is acquired, then returns a “guard” object, that both LazyTransform methods store in a local variable called state. The mutex will not be unlocked until the guard goes out of scope. Therefore the existence of a live guard represents a proof that the mutex is locked and therefore provides read-write access to the underlying data.

In Rust’s twist on mutex semantics, the very meaning of the act of locking a mutex is obtaining temporary exclusive write access to its data through a temporary guard object. Despite self being a shared reference, a successful self.state.lock() grants access to &mut LazyState that may last for as long as the mutex is locked (guard exists) and no more. This is the crux of the way Rust prevents data races through static analysis.

Other than the curious mutex design, there is nothing really interesting about the code itself. Once the mutex is locked, both functions do exactly the same thing that their single-threaded counterparts did. While this code is thread-safe in the sense Rust promises, i.e. free from data races, it is still very far from being efficient when invoked in parallel, even ignoring the stringent lock-free requirements. In particular, get_transformed is extremely inefficient in a read-heavy scenario because each call blocks all other calls even when set_source isn’t called at all. When a transformation is in progress, all the other readers are blocked until it is finished.

Fine-grained locking

To minimize the amount of time spent waiting, we can take advantage of the following facts:

  • The methods are operating on two distinct pieces of data, source and value. set_source, for example, doesn’t access value at all. The two fields can be protected with different locks.
  • get_transformed has two distinct modes of operation: a fast one when it only returns the cached value, and the slow one when it detects that the source has changed and it needs to calculate the new value. The vast majority of calls to get_transformed can be expected to belong to the “fast” scenario.

Here is an implementation that uses finer-grained locking to ensure that readers don’t wait for either writers or other readers:

use std::sync::{Mutex, RwLock};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Mutex<Option<S>>,
    value: RwLock<Option<T>>,
    transform_lock: Mutex<()>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Mutex::new(None),
            value: RwLock::new(None),
            transform_lock: Mutex::new(()),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut locked_source = self.source.lock().unwrap();
        *locked_source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        if let Ok(_) = self.transform_lock.try_lock() {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    return new_value;
                }
            }
        }
        self.value.read().unwrap().clone()
    }
}

In this implementation there is no longer a “state” structure protected by a coarse mutex, we are back to individual fields. The source field is protected by its own mutex and the value field is protected by a separate RwLock, which is like a mutex, except it allows read access by multiple concurrent readers that don’t block each other. Finally, a new transform_lock field doesn’t protect any particular piece of data, it serves as something resembling a conventional mutex.

set_source locks the source mutex and replaces the source with the new value. It assigns to *locked_source because locked_source is just the variable holding the guard, and assigning Option<S> to it would be a type error. Since the guard provides automatic access to &mut Option<S>, *locked_source at the left-hand side of the assignment serves to both coerce the guard to &mut Option<S> (returned by guard’s implementation of DerefMut::deref_mut) and at the same time to dereference it, so that the value behind the reference is replaced with the new one.

get_transformed is more sophisticated. It first ensures that only a single call attempts to interact with the writer at one time. This is for two reasons: first, to avoid set_source being “attacked” by a potentially large number of readers in a read-heavy scenario. Second, we want to prevent more than one transformation happening in parallel, which would require the result of one expensive transformation to be thrown away. The synchronization is implemented using try_lock, which immediately returns if the lock could not be obtained. In case of failure to lock, get_transformed gives up and returns the cached value, which meets its requirements. If it acquires transform_lock, it proceeds to check whether a new source is available, again with a try_lock and a fallback to returning the cached value. This ensures that get_transformed gets out of the way of set_source as much as possible. If it acquires the source lock, it uses Option::take() to grab the new value, leaving None in its place. If the captured source is not None, meaning a new source was published since the last check, get_transformed performs the transformation, caches its result, and returns a copy.

get_transformed uses a RwLock to ensure that readers don’t wait for each other, but that the update is exclusive. RwLock nicely maps to Rust’s ownership system by RwLock::read returning a guard that provides shared reference to the underlying data, and RwLock::write returning a guard that provides a mutable reference.

This implementation is about as good as it can get with the use of locks. The problem statement, however, requires get_transformed and set_source not to block each other regardless of how often they are invoked. The above implementation will attempt an exclusive lock of source just to check if a new source has appeared. When this lock succeeds, set_source will be blocked for the duration of the lock. In a read-heavy scenario, get_transformed will be called often and by many different threads, and it is easy to imagine it hogging the lock enough to slow down set_source, which must wait to acquire the lock (it cannot use try_lock) in order to do its job.

Changing this requires looking outside the constructs offered by safe Rust, as discussed in the next post.

Closure lifetimes in Rust

In a comment on my answer to a StackOverflow question about callbacks in Rust, the commenter asked why it is necessary to specify 'static lifetime when boxing closures. The code in the answer looks similar to this:

struct Processor {
    callback: Box<dyn Fn()>,
}

impl Processor {
    fn new() -> Processor {
        Processor { callback: Box::new(|| ()) }
    }
    fn set_callback<CB: 'static + Fn()>(&mut self, c: CB) {
        self.callback = Box::new(c);
    }
    fn invoke(&self) {
        (self.callback)();
    }
}

It seems redundant to specify the lifetime of a boxed object that we already own. In other places when we create a Box<T>, we don’t need to add 'static to T’s trait bounds. But without the 'static bound the code fails to compile, complaining that “the parameter type CB may not live long enough.” This is a strange error – normally the borrow checker complains of an object not living long enough, but here it specifically refers to the type.

Let’s say Processor::set_callback compiled without the lifetime trait on Fn(). In that case the following usage would be legal as well:

fn crash_rust() {
    let mut p = Processor::new();
    {
        let s = "hi".to_string();
        p.set_callback(|| println!("{}", s.len()));
    }
    // access to destroyed "s"!
    p.invoke();
}

When analyzing set_callback, Rust notices that the returned box could easily outlive the data referenced by the CB closure and requires a harder lifetime bound, even helpfully suggesting 'static as a safe choice. If we add 'static to the bound of CB, set_callback compiles, but crash_rust predictably doesn’t. In case the desire was not to actually crash Rust, it is easy to fix the closure simply by adding move in front of it closure, as is is again helpfully suggested by the compiler. Moving s into the closure makes the closure own it, and it will not be destroyed for as long as the closure is kept alive.

This also explains the error message – it is not c that may not live long enough, it is the references captured by the arbitrary CB closure type. The 'static bound ensures the closure is only allowed to refer to static data which by definition outlives everything. The downside is that it becomes impossible for the closure to refer to any non-static data, even one that outlives Processor. Fixing the closure by moving all captured values inside it is not always possible, sometimes we want the closure to capture by reference because we also need the value elsewhere. For example, we would like the following to compile:

// safe but currently disallowed
{
    let s = "hi".to_string();
    let mut p = Processor::new();
    p.set_callback(|| println!("later {}", s.len()));
    println!("sooner: {}", s.len());
    // safe - "s" lives longer than "p"
    p.invoke();
}

Rust makes it possible to pin the lifetime to one of a specific object. Using this definition of Processor:

struct Processor<'a> {
    // the boxed closure is free to reference any data that
    // doesn't outlive this Processor instance
    callback: Box<dyn 'a + Fn()>,
}

impl<'a> Processor<'a> {
    fn new() -> Processor<'a> {
        Processor { callback: Box::new(|| ()) }
    }
    fn set_callback<CB: 'a + Fn()>(&mut self, c: CB) {
        self.callback = Box::new(c);
    }
    fn invoke(&self) {
        (self.callback)();
    }
}

…allows the safe code to compile, while still disallowing crash_rust.

Potato House

Nema više Potato Housea. Vlasnik tužno stoji ispred lokala i pozdravlja se sa stalnim mušterijama.

Za one koji ne znaju, Potato House je bio fini bistro generalno ruskog stila u kojem su se mogle dobiti odlične guste juhe, npr. boršč, kao i još nekolicina pomno probranih jela. Njihov specijalitet bio je kuhani krumpir izdubljen, na licu mjesta obogaćen smjesom maslaca i krumpira, kao i umacima s nekoliko okusa (indijski, chilli con carne, kikiriki+piletina, ćuftice…). Nazvati ih restoranom brze hrane bilo bi tehnički ispravno, jer tamo bi hrana brzo stigla, ali potpuno promašeno po kvaliteti, koja je bila daleko iznad onog što se normalno podrazumijeva pod fast foodom. I sve to po vrlo prihvatljivim cijenama.

Priča čovjek da zatvara jer “nakon tri godine više ne može”. I nakon svih horor priča o besmislenim propisima, uspjelo me šokirati da “objekt koji nema vlastiti WC mora imati isključivo visoke stolove”. Ha? “Time si mi diskriminirao niže ljude, starce, djecu, invalide”, nastavlja ogorčeno. “I zašto moram plaćati turističkoj komori, naknadu za šume? Kakve to ima veze sa mnom?” Pa problemi s osobljem, skup najam lokala, opće nerazumijevanje na svakom koraku.

“Puno lokala se danas otvara, ali ih se još više zatvara”, kaže, a to i sam vidim. Usprkos današnjem boomu restoranske ponude (ili baš zbog njega) u zadnjih mjesec-dva nestalo nekoliko meni dragih mjesta – “Kroštula” kod Cvjetnog, slastičarnica s vaflima u Masarykovoj, a evo sad i Potato House. A ni propast “Kiše”, čija se vlasnica također žalila na nemoguće propise, nije bila tako davno.

Nadam se da će boom zagrebačke restoranske ponude dovesti u biznis nove ljude, od kojih se neki možda i uspješno izbore s hrvatskom poslovnom klimom.

Kuhano vino, iliti Glühwein

Kuhano vino, iliti Glühwein

gluehwein

Posljednjih godina kuhano vino je postalo omiljeno piće božićnih sajmova širom Hrvatske, tako da je ovo zimsko kontinentalno piće prodrlo i do tople Dalmacije. Na sjeveru Hrvatske se nerijetko kuha i doma, ali najčešće u minimalističkoj varijanti, samo sa šećerom i klinčićima. Kako je autor ovog bloga u mladosti posjećivao minhenske božićne štandove i uživao u njihovom gustom i toplom Glühweinu čiji miris se širi gradom, a okus bogat voćem i začinima još dugo pamti, ovaj članak će opisati pravljenje kuhanog vina te vrste.

KUHANO VINO

za 1.5L vina, oko 10 porcija
vrijeme pripreme: 30 minuta

1 limun
2 naranče, ne prevelike
3 štapića cimeta (npr. Kotany)
10 klinčića
5 sjemenki kardamoma (opcionalno)
300g šećera
3dl vode
2 butelje (1.5L) suhog crnog vina
1dl žestice, npr. neutralnije rakije (izbjegavati travaricu) ili vinjaka

  1. Nožem za guljenje povrća ogulite koru s naranči i limuna i stavite je u lonac. Zatim ih iscijedite i to također ubacite u lonac. Dodajte i začine, šećer i vodu. Stavite lonac da jaku vatru i pustite da provri. Kuhajte na laganoj vatri 20-ak minuta, dok oko pola tekućine ne ishlapi.

  2. Ulijte vino i žesticu, lagano promiješajte, i zagrijavajte dok se ne pojave mjehurići. Uklonite s vatre prije nego što tekućina provri.

Kuhano vino poslužite vruće. Možete ga podgrijavati ili održavati vrućim na vrlo laganoj vatri, pazeći da nikako ne počne kipjeti.

About the “ineffectiveness of torture”

A friend of mine posted a commentary about the ongoing torture debate. He was appalled about people arguing that torture doesn’t work. “Why stop there,” he asks. “Does rape work? Let’s have a debate! Is genocide effective? Get some pundits to discuss pros and cons! Does sexual abuse of children yield results? Get some experts on screen!” This line of inquiry prompted me to consider whether the ineffectiveness of torture is ever an acceptable argument against it.

To consider the issue on a broader level, one can imagine “pundits” of other times hotly debating effectiveness of slavery, systemic censorship, forced prison labor, or gladiator battles.

The debate over torture has revealed the ugly fact that torture is widely supported by Americans, and the support has risen in the last several years. I see it as a result of a careful propaganda campaign that made skillful use of the naive portrayals of torture popular in fiction, and especially in pulp. Fiction tends to show torture as the only measure available to defeat infinitely evil opponents. This is so wide-spread that the tvtropes site documents a number of torture-related tropes, such as the “ticking time bomb”. When Americans support torture, they envision this scenario, a fact the agencies promoting their own use of torture happily embraced. If one’s thinking is confined to the ticking time bomb scenario, their moral judgment (and, one might add, decency and good taste) is suspended; they ask themselves nonsensical questions with leading answers, such as “would you torture one person to save a hundred?” (or a thousand, million, etc.) The fact that this scenario is an unrealistic fictional invention fueled by non-fictional propaganda machinery never enters their mind.

Given the way Americans think about torture, it is probably simply easier, and more efficient in the short term, to point out the ineffectiveness of torture than to debunk the ticking-time-bomb image. But in the long run, I would argue that relying on the ineffectiveness argument is an extremely bad idea. Imagine future interrogators and their teams of psychologists and rogue doctors perfecting techniques to use torture effectively and start extracting reliable information from tortured enemies and suspects. An argument against torture hinging on its ineffectiveness would immediately fall apart.

The dilemma whether to use scientific data to back up what is essentially an ethical choice is present elsewhere. For example, lack of measurable differences between races has been cited as argument against racism, sometimes extended to a complete dismissal any notion of race as a “social construct”, like ethnicity. But what if geneticists do discover measurable and important differences between races? Basing what is essentially an ethical judgment on scientific data positions it on thin ice because all scientific views can (and must be allowed to) change. This is why I am against drawing the ineffectiveness argument into the discussion against torture, even if it appears useful in the short term.

International file names in cross-platform programs

I work for a company that builds simulation software with the front-end GUI developed mostly in Python. This document is a slightly modified version of a guide written for the GUI developers to ensure that file names with international characters work across the supported platforms. Note that this document is specifically about file names, not file contents, which is a separate topic.

Introduction

Modern operating systems support use of international characters in file and directory names. Users not only routinely expect being able to name their files in their native language, but also being able to manipulate files created by users of other languages.

Historically, most systems implemented file names with byte strings where the value of each byte was restricted to the ASCII range (0-127). When operating systems started supporting non-English scripts, byte values between 128 and 255 got used for accented characters. Since there are more than 128 such characters in European languages, they were grouped in character encodings or code pages, and the interpretation of a specific byte value was determined according to the currently active code page. Thus a file with the name specified in Python as '\xa9ibenik.txt' would appear to an Eastern-European language user as Šibenik.txt, but to a Western-European as ©ibenik.txt. As long as users from different code pages never exchanged files, this trick allowed smuggling non-English letters to files names. And while this worked well enough for localization in European countries, it failed at internationalization, which implies exchange and common storage of files from different languages and existence of bilingual and multilingual environments. In addition to that, single-byte code pages failed to accomodate East Asian languages, which required much more than 128 different characters in a single language. The solution chosen for this issue by operating system vendors was allowing the full Unicode repertoire in file names.

Popular operating systems have settled on two strategies for supporting Unicode file names, one taken by Unix systems, and the other by MS Windows. Unix continued to treat file names as bytes, and deployed a scheme for applications to encode Unicode characters into the byte sequence. Windows, on the other hand, switched to natively representing file names in Unicode, and added new Unicode-aware APIs for manipulating them. Old byte-based APIs continued to be available on Windows for backward compatibility, but could not be used to access files other than those with names representable in the currently active code page.

These design differences require consideration on the part of designers of cross-platform software in order to fully support multilingual file names on all relevant platforms.

Unicode encodings

Unicode is a character set designed to support writing all human languages in present use. It currently includes more than 100 thousand characters, each assigned a numeric code called a code point. Characters from ASCII and ISO 8859-1 (Western-European) character sets retained their previous numeric values in Unicode. Thus the code point 65 corresponds to the letter A, and the code point 169 corresponds to the copyright symbol ©. On the other hand, the letter Š has the value 169 in ISO 8859-2, the value 138 in Windows code page 1250, and code point 352 in Unicode.

Unicode strings are sequences of code points. Since computer storage is normally addressed in fixed-size units such as bytes, code point values need to be mapped to such fixed-size code units, or encoded. Mainstream usage has stabilized on a small number of standard encodings.

UTF-8

UTF-8 is an encoding that maps Unicode characters to sequences of 1-4 bytes. ASCII characters are mapped to their ASCII values, so that any ASCII string is also a valid UTF-8 string with the same meaning. Non-ASCII characters are encoded as sequences of up to four bytes.

Compatibility with ASCII makes UTF-8 convenient for introducing Unicode to previously ASCII-only file formats and APIs. Unix internationalization and modern Internet protocols heavily rely on UTF-8.

UTF-16

The UTF-16 encoding maps Unicode characters to 16-bit numbers. Characters with code points that fit in 16 bits are represented by a single 16-bit number, and others are split into pairs of 16-bit numbers, the so-called surrogates.

Windows system APIs use UTF-16 to represent Unicode, and the documentation often refers to UTF-16 strings as “Unicode strings”. Java and DotNET strings also use the UTF-16 encoding.

UTF-32

The UTF-32 encoding maps characters to 32-bit numbers that directly correspond to their code point values. It is the simplest of the standard encodings, and the most memory-intensive one.

System support for Unicode

Windows

Windows file names are natively stored in Unicode. All relevant Win32 calls work with UTF-16 and accept wchar_t * “wide string” arguments, with char * “ansi” versions provided for backward compatibility. Since file names are internally stored as Unicode, only the Unicode APIs are guaranteed to operate on all possible files. The char based APIs are considered legacy and work on a subset of files, namely those whose names can be expressed in the current code page. Windows provides no native support for accessing Unicode file names using UTF-8.

The Win32 API automatically maps C API calls to wide (UTF-16) or single-byte variants according to the value of the UNICODE preprocessor symbol. Functions standardized by C, C++, and POSIX have types specified by the standard and cannot be automatically mapped to Unicode versions. To simplify porting, Windows provides proprietary alternatives, such as the _wfopen() alternative to C fopen(), or the _wstat() alternative to POSIX stat(). Like Win32 byte-oriented functions, the standard functions only work for files whose names can be represented in the current code page. Opening a Japanese-named file on a German-language workstation is simply not possible using standard functions such as fopen() (except by resorting to unreliable workarounds such as 8+3 paths). This is a very important limitation which affects the design of portable applications.

Standard C++ functions, such as std::fstream::open, have overloads for both char * and wchar_t *. Programmers that want their programs to be able to manipulate any file on the file system must make sure to use the wchar_t * overloads. The char * overloads are also limited to opening non-Unicode file names.

Unix

The Unix C library does support the wchar_t type for accessing file contents as Unicode, but not for specifying file names. The operating system kernel treats file names as byte strings, leaving it up to the user environment to interpret them. This interpretation, known as the “file name encoding”, is defined by the locale, itself configured with LC_* environment variables. Modern systems use UTF-8 locales in order to support multilingual use.

For example, when a user wishes to open a file with Unicode characters, such as Šibenik.txt, the application will encode the file name as a UTF-8 byte string, such as "\xc5\xa0ibenik.txt", and pass that string to fopen(). Later, system functions like readdir() will retrieve the same UTF-8 file name, which the application’s file chooser will display to the user as Šibenik.txt. As long as all programs agree on the use of UTF-8, this scheme supports unrestricted use of Unicode characters in file names.

The important consequence of this design is that storing file names as Unicode in the application and encoding them as UTF-8 when passing them to the system will only allow manipulating files whose names are valid UTF-8 strings. To open an arbitrary file on the file system, one must store file names as byte strings. This is exactly the opposite of the situation on Windows, a fact that portable code must take into account.

Python

Beginning with version 2.0, Python optionally supports Unicode strings. However, most libraries work with byte strings natively (often using UTF-8 to support Unicode), and using Unicode strings is slower and leads to problems when Unicode strings interact with ordinary strings.

On Windows, Python internally uses the legacy byte-based APIs when given byte strings and Windows-specific Unicode APIs when given Unicode strings. This means that Unicode files can be manipulated as long as the programmer remembers to create the correct Unicode string. It is not only impossible to open some files using the bytes API, they are misrepresented by functions such as os.listdir::

>>> with open(u'\N{SNOWMAN}.txt', 'w'):
...   pass   # create a file with Unicode name
... 
>>> os.listdir('.')
['?.txt']
>>> open('?.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '?.txt'

Opening the directory in Windows Explorer reveals that Python created the file with the correct name. It is os.listdir and its constraint to return byte strings when given a byte string that creates the problem. os.listdir(u'.') returns the usable [u'\u2603.txt'].

Python 3

Python 3 strings are Unicode by default, so it automatically calls the Unicode versions of Win32 calls and does not exhibit bugs like the listdir bug shown above. On the other hand, Python 3 needs special provisions to map arbitrary Unix file names to Unicode, as described in PEP 383.

File names in applications

Portable programs that want to enable the user to create arbitrary file names must take care how to create and access them. Using portable IO libraries such as gio and Qt resolves many of these problems automatically, but these libraries carry a lot of weight that is unacceptable in many situations. Also, those libraries often don’t interact well with “traditional” C code that accepts file names. In this chapter we present an implementation strategy that enables correct use of Unicode file names with minimal intrusion to the code base.

Since file names are natively bytes on some platforms and Unicode on others, a cross-platform application must choose between these representations. Using Unicode makes programming somewhat easier on platforms with native Unicode APIs, while using UTF-8 bytes has the advantage on platforms with native bytes APIs.

What representation works best depends on the application’s surroundings and the implementation platform. A Python 3 or Java application running on a web server is probably best served by using Unicode consistently and not bothering with Unix non-UTF-8 file names at all. On the other hand, a GTK application, a Python 2 application, or an application needing to interface with C will be off with UTF-8, which guarantees interoperability with the world of bytes, while retaining lossless conversion to Unicode and back.

This guide presents a programming model based on UTF-8 as the file name representation. UTF-8 was chosen for AVL simulation GUIs due to ease of interoperability with various C APIs, including GTK itself. This choice is also shared by the gio library and other modern Unix-based software. Of course, use of UTF-8 is not limited just to file names, it should be used for representation of all user-visible textual data.

Interacting with the file system from Python

Since Python’s built-in functions such as open and os.listdir accept and correctly handle Unicode file names on Windows, the trick is making sure that they are called with correct arguments. This requires two primitives:

  • to_os_pathname— converts a UTF-8 pathname (file or directory name) to OS-native representation, i.e. Unicode when on Windows. The return value should only be used as argument to built-in open(), or to functions that will eventually call it.

  • from_os_pathname — the exact reverse. Given an OS-native representation of pathname, returns a UTF-8-encoded byte string suitable for use in the GUI.

The implementation of both functions is trivial:

  def to_os_pathname(utf8_pathname):
      """Convert UTF-8 pathname to OS-native representation."""
      if os.path.supports_unicode_filenames:
          return unicode(utf8_pathname, 'utf-8')
      else:
          return pathname

  def from_os_pathname(os_pathname):
      """Convert OS-native pathname to UTF-8 representation."""
      if os.path.supports_unicode_filenames:
          return os_pathname.encode('utf-8')
      else:
          return os_pathname

With these in place, the next step is wrapping file name access with calls to to_os_pathname. Likewise, file names obtained from the system, as with a call to os.listdir must be converted back to UTF-8.

def x_open(utf8_pathname, *args, **kwds):
    return open(to_os_pathname(utf8_pathname), *args, **kwds)

def x_stat(utf8_pathname):
    return os.stat(to_os_pathname(utf8_pathname))
...

# The above pattern can be used to wrap other useful functions from
# the os and os.path modules, e.g. os.stat, os.remove, os.mkdir,
# os.makedirs, os.isfile, os.isdir, os.exists, and os.getcwd.

def x_listdir(utf8_pathname):
    return map(from_os_pathname, os.listdir(to_os_pathname(utf8_pathname)))

The function standing out is x_listdir, which is like os.listdir, except it converts file names in both directions: in addition to calling to_os_pathname on the pathname received from the caller, it also calls from_os_pathname on the pathnames provided by the operating system. Taking the example from the previous chapter, x_listdir would correctly return ['\xe2\x98\x83'] (a UTF-8 encoding of the snowman character), which x_open('\xe2\x98\x83') would correctly open.

Any function in the program that accepts a file name must accept — and expect to receive — a UTF-8-encoded file name. Functions that open the file using Python’s open, or those that call third-party functions that do so, have the responsibility to use to_os_pathname to convert the file name to OS-native form.

Legacy path names

to_os_pathname is useful when calling built-in open() or into code that will eventually call built-in open(). However, sometimes C extensions beyond our control will insist on accepting the file name to open the file using the ordinary C fopen() call. Passing an OS-native Unicode file name on Windows serves no purpose here because it will fail on a string check implemented by the Python bindings for the library. And even if it somehow passed the check, the library is still going to call fopen() rather than _wfopen().

A workaround when dealing with such legacy code is possible by retrieving the Windows “short” 8+3 pathnames, which are always all-ASCII. Using the short paths, it is possible to write a to_legacy_pathname function that accepts a UTF-8 pathname and returns a byte string pathname with both Python open() and the C family of functions such as fopen(). Since short pathnames are a legacy feature of the Win32 API and can be disabled on a per-volume basis, to_legacy_pathname should only be used as a last resort, when it is impossible to open the file with other means.

if not os.path.supports_unicode_filenames:
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        return utf8_pathname
else:
    import ctypes, re
    GetShortPathNameW = ctypes.windll.kernel32.GetShortPathNameW
    has_non_ascii = re.compile(r'[^\0-\x7f]').search
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        if not has_non_ascii(utf8_pathname):
            return utf8_pathname
        unicode_pathname = unicode(utf8_pathname, 'utf-8')
        short_length = GetShortPathNameW(unicode_pathname, None, 0)
        if short_length == 0:
            raise ctypes.WinError()
        short_buf = ctypes.create_unicode_buffer(short_length)
        GetShortPathNameW(unicode_pathname, short_buf, short_length)
        short_pathname_unicode = short_buf.value
        return short_pathname_unicode.encode('ascii')

Summary

If this seems like a lot of thought for something as basic as file names with international characters, you are completely right. Doing this shouldn’t be so hard, and this can be considered an argument for moving to Python 3. However, if you are using C extensions and libraries that accept file names, simply switching to Python 3 will not be enough because the libraries and/or their Python bindings will still need to be modified to correctly handle Unicode file names. A future article will describe approaches taken for porting C and C++ code to become, for lack of a better term, Unicode-file-name-correct. Until then, the to_legacy_pathname() hack can come in quite handy.

Stuffed bell peppers, the Croatian way

Stuffed bell peppers are a staple of cuisines of several southeast-European countries, including Croatia. This recipe, originally published in Croatian, presents how I make them. The translation will hopefully help this lovely Croatian dish reach a wider audience.

In Croatia we typically use the bell peppers of the “babura” variety, but other kinds of bell peppers will do nicely, as long as they are of reasonable size – at least 2 inches in diameter, and 4 inches or more in height.

STUFFED BELL PEPPERS

6 portions
preparation time: about 2 hours, largely unattended

1 onion, chopped
olive oil
salt, pepper
2-3 cloves garlic
1 tbsp paprika
1 pound ground meat, mix of beef and pork
1/2 cup rice
10 bell peppers of medium size
2-3 cups tomato purée (passata di pomodoro)
1 cup wine
water as needed

  1. Put the olive oil in a saucepan over medium heat. When the oil is warm, add onions and cook until soft, about five minutes. When the onions are nearly done, add the garlic, salt, pepper, paprika, and other spices if you like (e.g. ginger, nutmeg, or a dash of cumin). Do not overdo the spices.

  2. While the onions are cooking, wash the bell peppers, cut off the stems, and shake out the seeds. Put the ground meat in a bowl and add the cooked onions. Season with salt and pepper to taste (feel free to try it, a bit of raw meat won’t harm you), add the rice, and mix well.

  3. Stuff the bell peppers with the meat mixture, trying not to pack the meat too tightly. Arrange the peppers in a cooking pot, if possible so that they stand upright holding each other; leave as little room as possible between them. If the peppers do not fit in one layer, cook them in two smaller pots.

  4. Mix the tomato purée and wine and season with salt. Pour the mixture over the peppers in the pot and add water until the peppers are almost fully submerged. If you are using two pots, equally divide the purée and wine between them and then add water. While the peppers are cooking, do not stir them, just occasionally shake the whole pot. Cook for an hour and a half on low heat.

Let the cooked peppers rest for at least an hour. Serve with mashed potatoes and some crusty bread.

Do not throw away the puree in which the peppers were cooking. If some remains uneaten, freeze it and use it as stock for a future dish.

Punjene paprike

Punjene paprike su klasično hrvatsko jelo podjednako popularno na sjeveru i jugu. Ova verzija je kako ih ja radim, na više-manje klasičan način, ali uz pokoji suvremeni začin.

PUNJENE PAPRIKE

za 6 porcija
vrijeme pripreme: oko 2 sata, uglavnom bez nadzora

1 glavica luka, sjeckana
maslinovo ulje
sol, papar
2-3 češnja češnjaka
žlica slatke mljevene paprike
1/2 kg miješanog mljevenog mesa
15dkg riže
10-ak paprika srednje veličine
500-750g pasirane rajčice
2dl vina
voda po potrebi

  1. Prekrijte dno tave maslinovim uljem i zažutite luk na srednjoj vatri. Pred kraj dodajte zgnječeni češnjak, sol, papar, mljevenu papriku i po želji druge začine (na primjer, đumbir, kumin, tajlandski curry ili muškatni oraščić). Sa začinima ne pretjerujte da ne prevladaju nad finim okusom paprike.

  2. Dok se luk prži, operite paprike i skinite im poklopce. Mljeveno meso stavite u zdjelu i pomiješajte ga s prženim lukom. Probajte je li dovoljno slano (iako je sirovo, od zalogaja vam neće ništa biti) i po potrebi dosolite. Dodajte rižu i dobro promiješajte.

  3. Napunite paprike mesnom smjesom, pazeći da ih previše ne nabijate. Posložite paprike u lonac, ako je moguće tako da stoje uspravno pridržavajući jedna drugu; neka između paprika bude što manje razmaka. Ako paprike ne stanu u jedan red, bolje ih je kuhati u dva manja lonca nego slagati u dva reda.

  4. Pomiješajte pasirane rajčice s vinom i posolite. Prelijte paprike tekućinom i dolijte vode dok ne dođe blizu vrha paprika. Ako kuhate u dva lonca, ravnomjerno podijelite pasirane rajčice i vino između njih i zatim dolijte vodu. Dok se paprike kuhaju, nemojte ih miješati, samo povremeno protresite lonac. Neka se kuhaju sat i pol na laganoj vatri.

Pustite kuhane paprike da odstoje barem sat vremena i poslužite ih s pireom i kruhom.

Umak koji se ne pojede nemojte baciti, zamrznite ga i upotrijebite kao fini temeljac za neko buduće jelo.

Data transfer with Netcat

The other day my brother, who works as a system administrator, inquired about a puzzling behavior of GNU Netcat, the popular nc utility. Sometimes described as the TCP/IP Swiss army knife, it can come in handy as an ad hoc file transfer tool, capable of transferring large amounts of data at the speed of disk reads/writes.

Data transfer

Typical usage looks like this:

nc -lp60000 | tar x                 # receiver
tar c dir... | nc otherhost 60000   # sender

It may look strange at first, but it’s easy to type and, once understood, almost impossible to forget. The commands work everywhere and require no specialized server software, only a working network and nc itself. The first command listens on port 60000 and pipes the received data to tar x. The second command provides the data by piping output of tar c to the other machine’s port 60000. Dead simple.

Note that transferring files with Netcat offers no encryption, so it should only be used inside a VPN, and even then not for sensitive data.

Data loss

One surprising behavior of this mode of transfer is that both commands remain hanging after the file transfer is done. This is because neither nc is willing to close the connection, as the other side might still want to say something. As long as one is positive that the transfer is finished (typically confirmed by disk and network activity having ceased), they can be safely interrupted with ^C.

The next step is adding compression into the mix, in order to speed up transfer of huge but easily compressible database dumps.

nc -lp60000 | gunzip -c | tar x               # receiver
tar c dir... | gzip -c | nc otherhost 60000   # sender

At first glance, there should be no difference between this pipeline and the one above, except that this one compresses the content sent over the wire and decompresses received content. However, much to my surprise, the latter command consistently failed to correctly transfer the last file in the tar stream, which would end up truncated. And this is not a case of pressing ^C too soon — truncation occurs no matter how long you wait for the transfer to finish. How is this possible?

It took some strace-ing to diagnose the problem. When the sender nc receives EOF on its standard input, it makes no effort to broadcast the EOF condition over the socket. Some Netcat implementations close (“shut down”) the write end of the socket after receiving local EOF, but GNU Netcat doesn’t. Failure to shut down the socket causes the receiving nc to never “see” the end of file, so it in turn never signals EOF to gunzip. This leaves gunzip hanging, waiting for the next 32K chunk to complete, or for EOF to arrive, neither of which ever happens.

Preventing Netcat data loss

Googling this issue immerses one into a twisted maze of incompatible Netcat variants. Most implementations shut down the socket on EOF by default, but GNU Netcat not only doesn’t do so, it doesn’t appear to have an option to do so! Needless to say, the huge environment where my brother works would never tolerate swapping the Netcat implementation on dozens of live servers, possibly breaking other scripts. A solution needed to be devised that would work with GNU Netcat.

At this point, many people punt and use the -w option to resolve the problem. -w SECONDS instructs nc to exit after the specified number of seconds of network inactivity. In the above example, changing nc -lp60000 to nc -lp60000 -w1 on the receiving end causes nc to exit one second after the data stops arriving. nc exiting causes gunzip to receive EOF on standard input, which prompts it to flush the remaining uncompressed data to tar.

The only problem with the above solution is that there is no way to be sure that the one-second timeout occurred because the data stopped arriving. It could as well be the result of a temporary IO or network glitch. One could increase the timeout to decrease the probability of a prematurely terminated transfer, but this kind of gamble is not a good idea in production.

Fortunately, there is a way around the issue without resorting to -w. GNU Netcat has a --exec option that spawns a command whose standard input and standard output point to the actual network socket. This allows the subcommand to manipulate the socket in any way, and fortuitously results in the socket getting closed after the command exits. With the writing end closing the socket, neither nc is left hanging, and the transfer completes:

nc -lp60000 | gunzip -c | tar x                  # receiver
nc -e 'tar c dir... | gzip -c' otherhost 60000   # sender

Self-delimiting streams

There is one just little thing that needs explaining: why did the transfer consistently work with tar, and consistently failed to work with the combination of tar and gzip?

The answer is in the nature of the stream produced by tar and gzip. Data formats generally come in two flavors with respect to streaming:

  1. Self-delimiting: formats whose payload carries information about its termination. Example of a self-delimiting stream is an HTTP response with the Content-Length header — a client can read the whole response without relying on an out-of-band “end of file” flag. (HTTP clients use this very feature, along with some more advanced ones, to reuse the same network socket for communicating multiple requests with the server.) A well-formed XML document without trailing whitespace is another example of a self-delimiting stream.

  2. Non-self-delimiting: data formats that do not contain intrinsic information about their end. A text file or an HTML document are examples of those.

While a tar archive as a whole is not self-delimiting (nor can they be, since tar allows appending additional members at the end of the archive), its individual pieces are. Each file in the archive is preceded by a header announcing the size of the file. This allows the receiving tar to read the correct number of bytes from the pipe without needing additional end-of-file information. Although tar will happily hang forever waiting for more files to arrive on standard input, every individual file will be read to completion.

On the other hand, gzip does not produce a self-delimiting stream. gunzip reads data in 32-kilobyte chunks for efficient processing. When it receives a shorter chunk, say 10k of data, it doesn’t assume that this is because EOF has occurred, it just waits for the remaining 22K of data for its buffer to fill. gunzip relies on EOF condition to tell it when and if the stream has ended, after which it does flush the “short” buffer it last read. This incomplete buffer is what caused the data loss when compression was added.

This is why the -e solution works so nicely: it not only makes the socket close, it ensures that EOF is signaled across the chain of piped commands, including the receiving gunzip.