Category Archives: Tech

Exploring lock-free Rust 1: Locks

As a learning exercise I set out to implement a simple lock-free algorithm in Rust. It was inspired by a problem posed at job interviews at a company where a friend works. The problem is simple enough that it can be tackled by a beginner, but tricky enough to require some thought to get right – and Rust presents several new challenges compared to the original Java.

This series of articles presents the evolution of a simple Rust lock-free container starting from single-threaded, progressing to a multi-threaded variant with locks, and finally settling on a lock-free implementation, discussing the trade-offs at each step. A basic understanding of Rust and of multi-threading programming is assumed, but the articles might be useful to adventurous beginners at both. Do note, however, that the author is not an expert at lock-free programming, so there might be errors – if you find some, please do leave a comment.

The Exercise

Implement a class providing access to a cached value. In more details:

Write a LazyTransform class that stores a single value, allowing it to be updated as needed. The value is potentially expensive to compute, so the setter method, set_source receives a “source” that will be used to compute the final value using a transformation function received in the LazyTransform constructor. Transformation must not be attempted until requested by get_transformed. Once generated, the value is cached and returned by further invocations of get_transformed, until invalidated by a new call to set_source.

A single-threaded version can be summarized with the following Python:

class LazyTransform:
    def __init__(self, transform_fn):
        self.transform_fn = transform_fn
        self.source = None
        self.value = None

    def set_source(self, new_source):
        self.source = new_source

    def get_transformed(self):
        if self.source is not None:
            newval = self.transform_fn(self.source)
            if newval is not None:
                self.value = newval
                self.source = None
        return self.value

The class must support being called from multiple threads, with the following semantics:

  • set_source and get_transformed can and will be called on the same LazyTransformer instance in parallel;
  • Once set_source completes, future invocations of get_transformed must eventually start returning the new value.
  • Read-heavy usage pattern is expected, so get_transformed must not block regardless of how many times set_source or get_transformed are called in other threads. The one exception is when a new source is detected – it is allowed for get_transformed to block until the transformation finishes before returning the transformed value (and caching it for future calls).
  • The code must be lock-free: neither set_source nor get_transformed should get stuck waiting on each other, even if they are called in quick succession or in parallel by many threads, or both.

Rust API tradeoffs

Before proceeding to parallelization, let’s review how the above interface would map to Rust’s type system. Ideally we’d want to place as few restrictions as possible on the type of values used for the source and the final objects; either could be as simple as a single u32 or a huge heap-allocated object. We know, for example, that both the source and the value type must be Send, because they need to be accessed from threads different from those that create them.

Another necessary restriction is that the final value type must be Clone. Why? Consider how the concept of “returning of cached value”, the return self.value line in the above Python, maps to Rust. In Python the semantics are clear because all of its objects are heap-allocated, and you always get the same instance shared. This is also the specified by the original Java exercise, which returns an Object. But a correct Rust implementation needs to deal with an actual value stored in the cache, and has three options for returning it:

  1. move the object out of the container, typically by making the value field an Option and returning self.value.take();
  2. return a reference to the object, return &self.value;
  3. clone the object and return the cloned value to the caller.

The first option obviously doesn’t work because it would prevent get_transformed to return the cached value more than once. The second option looks feasible until one considers that the returned reference cannot be allowed to outlive the stored value. Since the stored value can be invalidated by a call to set_source, which can happen literally at any time, it is clear that allowing a reference to be returned would be unsound. Indeed, all such attempts are promptly rejected by the borrow checker.

Although cloning at first appears like it would be inefficient for arbitrary objects, it actually provides the greatest flexibility for the user. Light values, such as numeric IDs which are Copy (and hence also Clone), or small strings which are cheap to clone, can be placed in the cache as-is. Heavy values, on the other hand, can be dynamically allocated and accessed as Arc<ActualData>, ensuring that their clone only increments a reference count, providing the semantics one would expect from equivalent Python or Java. If needed, one can even combine the two and store a tuple of a light object and a heavy one.

So, LazyTransform needs to be generic on the value type (T) and the source type (S). But let’s not forget the transformation function received by the constructor. Fixing its type to fn(S) -> T would limit it to stateless functions, and we would like the user to be able to provide an arbitrary closure for transformation. One option would be to accept a generic function object in the constructor and box it in a Box<Fn(S) -> T>, but that would impose a dynamic allocation on each LazyTransform instance, as well as an indirection when invoking the function. If the transformation function is known at compile time and carries no state, it should incur neither storage nor run-time indirection overhead. This is easily achieved by adding a third type parameter, that of the transformation function type.

As a demonstration of the API, here is a single-threaded implementation of the container:

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Option<S>,
    value: Option<T>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: None, value: None,
        }
    }

    pub fn set_source(&mut self, source: S) {
        self.source = Some(source);
    }

    pub fn get_transformed(&mut self) -> Option<T> {
        if let Some(source) = self.source.take() {
            let newval = (self.transform_fn)(source);
            if newval.is_some() {
                self.value = newval;
            }
        }
        self.value.clone()
    }
}

In spirit this is exactly the same thing as the original Python, except sprinkled with a healthy dose of static typing.

Compile-time thread safety

What happens if we try to share an instance of LazyTransform among threads? Rust will prevent that at compile time — invoking a &mut method from multiple threads would require creating multiple &mut references to the same object, which is prevented by the borrow checker. For example, the following doesn’t compile:

fn main() {
    let mut lt = LazyTransform::new(|x: u64| Some(x + 1));
    std::thread::spawn(move || {            // lt moved here
        for i in 0..10_000 {
            lt.set_source(i);
        }
    });
    while lt.get_transformed().is_none() {  // lt used after move
    }
    let val = lt.get_transformed().unwrap();
    assert!(val >= 0 && val < 10_000);
}

lt gets moved into the closure executed by the new thread, but then it is no longer available for use by the main thread. Sending it by reference wouldn’t work because there can exist only one &mut reference to an object, so we wouldn’t be allowed to send the same reference to multiple threads. Allocating LazyTransform dynamically and using Arc to share it among threads wouldn’t help either because Arc only provides shared access to the data it owns.

In Rust, supporting parallel access to a container requires not only a change in implementation, but also in method signatures. This is an intentional design decision – while Python or Java single-threaded code will happily execute when called from multiple threads, providing incorrect results, the Rust version will refuse to compile when the thread-unsafe LazyTransform object is accessed from two threads.

The compiler uses simple rules to decide whether it is safe to share an object between threads:

  1. Methods invoked from more than one thread must accept &self rather than &mut self. This rule is enforced by the borrow checker for single-threaded code as well.
  2. The object must not contain values of types specifically blacklisted for multi-threaded access even through shared references. In Rust terms, its type must “be Sync”, meaning it implements the Sync marker trait, which most objects do. Examples of non-Sync types are Rc or Cell, and both have thread-safe equivalents.

At a glance, the first rule seems to rule out LazyTransform as a multi-threaded type. Both its public methods clearly modify the object, with set_source even doing that in a way that is observable from the outside. Changing the signatures to accept &self instead of &mut self fails to compile because both methods modify the data behind the &self shared reference, which is prohibited. Accessing an object inside &self will also result in further shared references that are read-only.

To modify data, we must find a way to obtain an exclusive mutable reference from the shared reference to self. This is not allowed for ordinary objects because the compiler would have no way to ensure that writes are exclusive, i.e. that while a thread holds a mutable reference to a value, no other thread can read it or write to it. However, if we could statically convince Rust that the reference’s ownership of the data will be exclusive, it would be within the rules to allow the conversion. This is where mutexes come in.

Mutexes

Rust’s Mutex type provides read-write access to the value it protects, using the appropriate operating system primitive to ensure that this can be done by only one thread at a time. Here is an implementation of LazyTransform updated to use a mutex:

use std::sync::Mutex;

struct LazyState<T, S> {
    source: Option<S>,
    value: Option<T>,
}

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    state: Mutex<LazyState<T, S>>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            state: Mutex::new(LazyState { source: None, value: None }),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut state = self.state.lock().unwrap();
        state.source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        let mut state = self.state.lock().unwrap();
        if let Some(new_source) = state.source.take() {
            let new_value = (self.transform_fn)(new_source);
            if new_value.is_some() {
                state.value = new_value;
            }
        }
        state.value.clone()
    }
}

Both methods now operate on &self, relying on the mutex to obtain write access to the data in self.state. As far as method signatures are concerned, this is the final version of the API – all future versions will only differ in implementation.

The storage is now split into transform_fn, which is itself immutable and can be invoked from a shared reference, and state, the mutable part of the object’s state moved to a separate struct and enclosed in a mutex. As can be seen here, Rust’s Mutex is a container that holds and owns the data it protects. While that coupling looks strange at first, it enables the mutex to safely grant read-write access to the data it owns.

Calling Mutex::lock() waits until an exclusive OS lock is acquired, then returns a “guard” object, that both LazyTransform methods store in a local variable called state. The mutex will not be unlocked until the guard goes out of scope. Therefore the existence of a live guard represents a proof that the mutex is locked and therefore provides read-write access to the underlying data.

In Rust’s twist on mutex semantics, the very meaning of the act of locking a mutex is obtaining temporary exclusive write access to its data through a temporary guard object. Despite self being a shared reference, a successful self.state.lock() grants access to &mut LazyState that may last for as long as the mutex is locked (guard exists) and no more. This is the crux of the way Rust prevents data races through static analysis.

Other than the curious mutex design, there is nothing really interesting about the code itself. Once the mutex is locked, both functions do exactly the same thing that their single-threaded counterparts did. While this code is thread-safe in the sense Rust promises, i.e. free from data races, it is still very far from being efficient when invoked in parallel, even ignoring the stringent lock-free requirements. In particular, get_transformed is extremely inefficient in a read-heavy scenario because each call blocks all other calls even when set_source isn’t called at all. When a transformation is in progress, all the other readers are blocked until it is finished.

Fine-grained locking

To minimize the amount of time spent waiting, we can take advantage of the following facts:

  • The methods are operating on two distinct pieces of data, source and value. set_source, for example, doesn’t access value at all. The two fields can be protected with different locks.
  • get_transformed has two distinct modes of operation: a fast one when it only returns the cached value, and the slow one when it detects that the source has changed and it needs to calculate the new value. The vast majority of calls to get_transformed can be expected to belong to the “fast” scenario.

Here is an implementation that uses finer-grained locking to ensure that readers don’t wait for either writers or other readers:

use std::sync::{Mutex, RwLock};

pub struct LazyTransform<T, S, FN> {
    transform_fn: FN,
    source: Mutex<Option<S>>,
    value: RwLock<Option<T>>,
    transform_lock: Mutex<()>,
}

impl<T: Clone, S, FN: Fn(S) -> Option<T>> LazyTransform<T, S, FN> {
    pub fn new(transform_fn: FN) -> LazyTransform<T, S, FN> {
        LazyTransform {
            transform_fn: transform_fn,
            source: Mutex::new(None),
            value: RwLock::new(None),
            transform_lock: Mutex::new(()),
        }
    }

    pub fn set_source(&self, source: S) {
        let mut locked_source = self.source.lock().unwrap();
        *locked_source = Some(source);
    }

    pub fn get_transformed(&self) -> Option<T> {
        if let Ok(_) = self.transform_lock.try_lock() {
            let mut new_source = None;
            if let Ok(mut locked_source) = self.source.try_lock() {
                new_source = locked_source.take();
            }
            if let Some(new_source) = new_source {
                let new_value = (self.transform_fn)(new_source);
                if new_value.is_some() {
                    *self.value.write().unwrap() = new_value.clone();
                    return new_value;
                }
            }
        }
        self.value.read().unwrap().clone()
    }
}

In this implementation there is no longer a “state” structure protected by a coarse mutex, we are back to individual fields. The source field is protected by its own mutex and the value field is protected by a separate RwLock, which is like a mutex, except it allows read access by multiple concurrent readers that don’t block each other. Finally, a new transform_lock field doesn’t protect any particular piece of data, it serves as something resembling a conventional mutex.

set_source locks the source mutex and replaces the source with the new value. It assigns to *locked_source because locked_source is just the variable holding the guard, and assigning Option<S> to it would be a type error. Since the guard provides automatic access to &mut Option<S>, *locked_source at the left-hand side of the assignment serves to both coerce the guard to &mut Option<S> (returned by guard’s implementation of DerefMut::deref_mut) and at the same time to dereference it, so that the value behind the reference is replaced with the new one.

get_transformed is more sophisticated. It first ensures that only a single call attempts to interact with the writer at one time. This is for two reasons: first, to avoid set_source being “attacked” by a potentially large number of readers in a read-heavy scenario. Second, we want to prevent more than one transformation happening in parallel, which would require the result of one expensive transformation to be thrown away. The synchronization is implemented using try_lock, which immediately returns if the lock could not be obtained. In case of failure to lock, get_transformed gives up and returns the cached value, which meets its requirements. If it acquires transform_lock, it proceeds to check whether a new source is available, again with a try_lock and a fallback to returning the cached value. This ensures that get_transformed gets out of the way of set_source as much as possible. If it acquires the source lock, it uses Option::take() to grab the new value, leaving None in its place. If the captured source is not None, meaning a new source was published since the last check, get_transformed performs the transformation, caches its result, and returns a copy.

get_transformed uses a RwLock to ensure that readers don’t wait for each other, but that the update is exclusive. RwLock nicely maps to Rust’s ownership system by RwLock::read returning a guard that provides shared reference to the underlying data, and RwLock::write returning a guard that provides a mutable reference.

This implementation is about as good as it can get with the use of locks. The problem statement, however, requires get_transformed and set_source not to block each other regardless of how often they are invoked. The above implementation will attempt an exclusive lock of source just to check if a new source has appeared. When this lock succeeds, set_source will be blocked for the duration of the lock. In a read-heavy scenario, get_transformed will be called often and by many different threads, and it is easy to imagine it hogging the lock enough to slow down set_source, which must wait to acquire the lock (it cannot use try_lock) in order to do its job.

Changing this requires looking outside the constructs offered by safe Rust, as discussed in the next post.

Error reporting on Linux via Gmail for automated tasks

Have a critical cron’d automated task? You’d like to be notified if something fails? With the ubiquity of smartphones, you can notice an error right away and take action.

Wow, someone wrote to me! It’s from someone named “mdadm”. Must be spam again!

Computers sending emails for various purposes is nothing new. I have a couple of critical cron jobs on my home computer; syncing the family photos to my remote server, backing up the said remote server to my local computer, etc. These are all tasks that are defined in the daily crontab, and without a proper or any alerting system, something can go wrong and you can really find yourself in a bind if it turns out the backup procedure died months ago because the ssh key changed or something. You can either check the backup or automated task every single day to make sure nothing went wrong, or you can setup a robust alerting system that will send you an email if something goes wrong. This is not the only use case, Mdadm can also send you an email if a disk drops from a RAID array etc.

Setting up a Gmail relay system with Postfix

Installing and managing an email service is difficult, and you have to contend with all sorts of issues, is your server blacklisted, do you have the appropriate SPF records, is your IP reverse resolvable to the domain name etc, etc. Most of these requirements are difficult or impossible with a simple home computer behind a router without an FQDN. With the relay, you’ll be able to send an email without having to worry if it’ll end up in spam, or not delivered at all as it will be sent from a real Gmail account. Luckily, it’s extremely simple to set it up:

  • Create a Gmail account.
  • Allow “less secure” apps access your new gmail account. Don’t be fooled by how they’re calling it, we’ll be having full encryption for email transfer.
  • Setup Postfix.

I’ll keep the Postfix related setup high level only:

  • Install Postfix with your package manager and select “Internet site”
  • Edit /etc/postfix/sasl_passwd and add:
[smtp.gmail.com]:587    username@gmail.com:PASSWORD
  • Chmod /etc/postfix/sasl_passwd to 600
  • At the end of /etc/postfix/main.cf add:
relayhost = [smtp.gmail.com]:587 # the relayhost variable is empty by default, just fill in the rest
smtp_use_tls = yes
smtp_sasl_auth_enable = yes
smtp_sasl_security_options =
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt
  • Use postmap to hash and compile the contents of the sasl_password file:
# postmap /etc/postfix/sasl_passwd
  • Restart the postfix service

Your computer should now be able to send emails. Test with a little bit of here document magic:

$ mail -s "Testing email" youremail@example.com << EOF
Testing email :)
EOF

If everything went fine, you should be getting the email promptly from your new gmail account. I haven’t tried with other email providers, but it should all be pretty much the same.

Usage example

Now that you have a working relay, it’s time to put it to good use. Here is a simple template script with two key functions that can be sourced through Bash so you can use it within other scripts without having to copy & paste it around.

#!/bin/bash

# Global variables
NAME=$(basename "$0")
LOG=/var/log/"$NAME".log
EMAIL=youremail@whatever.com
LOCKFILE=/tmp/"$NAME".lock
HOST=$(hostname -s)

# All STDERR is appended to $LOG
exec 2>>$LOG

# An alert function if the locking fails
function lock_failure {
  mail -s "Instance of "$0" is already running on $HOST" $EMAIL << EOF
Instance of "$0" already running on $HOST. Locking failed.
EOF
  exit 1
}

# An alert function if something goes wrong in the main procedure
function failure_alert {
  mail -s "An error has occured with "$0" on $HOST" $EMAIL << EOF
An error has occured with "$0" on $HOST. Procedure failed. Please check "$LOG"
EOF
  exit 1
}


function procedure {
  # If file locking with FD 9 fails, lock_failure is invoked
  (
    flock -n 9 || lock_failure
    (
      # The entire procedure is started in a subshell with set -e. If a command fails
      # the subshell will return a non-zero exit status and will trigger failure_alert
      set -e
      date >> $LOG
      command 1
      command 2
      [...]
    )

    if [ $? != 0 ]; then
      failure_alert
    fi

  ) 9>$LOCKFILE
}

function main {
  procedure
}

main

flock(1) is used to make sure there is only one instance of the script running, and we’re checking the exit status of the commands. If you don’t need instance locking, you can simply forego the lock_failure function. The actual work is contained in another subshell which is terminated if any of the commands in the chain fail and sends an email advising you to check $LOG.

Conclusion

A lot of Linux services like Mdadm or S.M.A.R.T. have a feature to send emails if something goes wrong. For example, I set it up to send me an email if a drive fails inside my RAID 1 array, I just had to enter my email address in a variable called MAILADDR in the mdadm.conf file. A couple of days later, I received an email at 7 AM; ooooh someone emailed me. I had a rude awakening, it was Mr. Mdadm saying that I have a degraded array. It turned out to be the SATA cabling that was at fault, but still. This could have gone unnoticed for who knows how long and if the other disk from the RAID 1 failed later on, I could have had serious data loss. If you want to keep your data long term you can’t take any chances, you need to know if your RAID has blown up and not rely on yourself to check it out periodically, you won’t, you can’t, that’s why we automate.

Be careful when you write these programs. If your script is buggy and starts sending a lot of emails at once for no good reason, Gmail will block your ass faster than you can say “Linux rules!” If you’re blocked by Gmail, you might miss an important email from your computer.

Closure lifetimes in Rust

In a comment on my answer to a StackOverflow question about callbacks in Rust, the commenter asked why it is necessary to specify 'static lifetime when boxing closures. The code in the answer looks similar to this:

struct Processor {
    callback: Box<dyn Fn()>,
}

impl Processor {
    fn new() -> Processor {
        Processor { callback: Box::new(|| ()) }
    }
    fn set_callback<CB: 'static + Fn()>(&mut self, c: CB) {
        self.callback = Box::new(c);
    }
    fn invoke(&self) {
        (self.callback)();
    }
}

It seems redundant to specify the lifetime of a boxed object that we already own. In other places when we create a Box<T>, we don’t need to add 'static to T’s trait bounds. But without the 'static bound the code fails to compile, complaining that “the parameter type CB may not live long enough.” This is a strange error – normally the borrow checker complains of an object not living long enough, but here it specifically refers to the type.

Let’s say Processor::set_callback compiled without the lifetime trait on Fn(). In that case the following usage would be legal as well:

fn crash_rust() {
    let mut p = Processor::new();
    {
        let s = "hi".to_string();
        p.set_callback(|| println!("{}", s.len()));
    }
    // access to destroyed "s"!
    p.invoke();
}

When analyzing set_callback, Rust notices that the returned box could easily outlive the data referenced by the CB closure and requires a harder lifetime bound, even helpfully suggesting 'static as a safe choice. If we add 'static to the bound of CB, set_callback compiles, but crash_rust predictably doesn’t. In case the desire was not to actually crash Rust, it is easy to fix the closure simply by adding move in front of it closure, as is is again helpfully suggested by the compiler. Moving s into the closure makes the closure own it, and it will not be destroyed for as long as the closure is kept alive.

This also explains the error message – it is not c that may not live long enough, it is the references captured by the arbitrary CB closure type. The 'static bound ensures the closure is only allowed to refer to static data which by definition outlives everything. The downside is that it becomes impossible for the closure to refer to any non-static data, even one that outlives Processor. Fixing the closure by moving all captured values inside it is not always possible, sometimes we want the closure to capture by reference because we also need the value elsewhere. For example, we would like the following to compile:

// safe but currently disallowed
{
    let s = "hi".to_string();
    let mut p = Processor::new();
    p.set_callback(|| println!("later {}", s.len()));
    println!("sooner: {}", s.len());
    // safe - "s" lives longer than "p"
    p.invoke();
}

Rust makes it possible to pin the lifetime to one of a specific object. Using this definition of Processor:

struct Processor<'a> {
    // the boxed closure is free to reference any data that
    // doesn't outlive this Processor instance
    callback: Box<dyn 'a + Fn()>,
}

impl<'a> Processor<'a> {
    fn new() -> Processor<'a> {
        Processor { callback: Box::new(|| ()) }
    }
    fn set_callback<CB: 'a + Fn()>(&mut self, c: CB) {
        self.callback = Box::new(c);
    }
    fn invoke(&self) {
        (self.callback)();
    }
}

…allows the safe code to compile, while still disallowing crash_rust.

Access your photo collection from a smartphone

The wonders of the smartphone. For us tech savvy people the current modern variant of the smartphone is pretty neat. You can navigate across the globe with the maps, you can of course listen to music or watch films, take photos, view photos, surf the net, etc. As most parents, I too love to talk about my kid, and I like showing a couple of pics to friends.

In earlier posts I was talking about managing, keeping your collection safe, organized. What if you find yourself in the situation that you want to show a specific picture to someone? You can’t expect to have all the photos you’ve taken over various devices to be present on your smartphone. This seemed like a good idea actually, to have access to all of your pictures on the phone.

My wishlist:

  • The pics themselves be somewhere on the internet, or my home computer so I can connect via a dynamic DNS service.
  • The pictures are scaled and quality lowered, so I don’t spend a lot of mobile data viewing unnecessarily large photographs, ~100 KiB per picture or even less. It should be fine for a smartphone display.
  • It would be extremely nice if the whole thing is encrypted end to end.

Turns out there are a couple of solutions already present on the market. One of them is Plex. A cursory glance at the website seems to indicate that it has what I want. There were some issues, however. First off, it’s a totally closed, proprietary bullshit service. How does it work? Supposedly you install the Plex media server on your computer, and with a client on your smartphone you connect directly to your library of photos, videos and music on your PC. Someone even packaged it for Arch Linux. The entire package is over 130 MiB in size. Unfortunately, this service is a bit of an overkill for my requirements. It’s an entire platform for you to play your mostly pirated stuff from a remote location. Sure, that makes sense for some people, but all I need is a gallery viewer and to somehow fetch the photos over the internet.

My first thought was, do they even encrypt this traffic between the phone and your home computer? Of course they didn’t. They do now, but the encryption was implemented a couple of months ago. Seems they even pitched in for a wildcard SSL certificate of some sort. If it’s a wildcard certificate for their domains, that means they proxy the traffic somehow. But considering their track record with basic security, ie, having no encryption of any kind, I’m not so convinced they actually encrypt the entire stream from end to end. To be fair, they have to support a lot of shady devices, like TVs and such, and there’s no telling what kind of CAs do these devices have and what kind of limitations are imposed.

OK, so I gave up on Plex, there has to be some kind of simpler solution. I noticed that an app I was already using, QuickPic, has native support for Flickr, Picasa, some others, and Owncloud. I really don’t want to give all my photos to a nameless cloud provider, but Owncloud, now that’s something. I have a budget dedicated server in Paris, and I decided to give it a try. I won’t go into details on how to install Owncloud. I’m using the Nginx webserver with PHP-FPM. Since the server isn’t exactly mine, and I have issues with trusting private data to anyone, I decided to use a loopback device on the server with a bigass file, and encrypted it with cryptsetup and is mounted to serve as the data directory of Owncloud. This way, once the server is decommissioned or a disk fails, no one will be able to see the content from the Owncloud data directory.

I don’t really want to go into the Owncloud installation. It’s relatively simple, it supports MySQL, Postgresql, sqlite, needs a reasonably recent PHP version and a webserver like Apache, or Nginx. So all I had to do now is prepare the photos, so I can upload them. And once again, Linux comes to the rescue. First, I copied the entire library to somewhere so I can test out the recoding mechanism. All my photos are in an extremely simple hierarchy. YYYY-MM, so that’s 12 directories per year. So:

for i in * ; do (cd $i; mogrify -auto-orient -quality 55 *); done

After that it’s pretty much as you’d expect. Upload the photos via the Owncloud client for Linux which works pretty much Dropbox. Once it’s uploaded you can setup QuickPic on your android phone to connect to the Owncloud instance. And that’s it, you can now access your full photo collection, you’ll just have to periodically add new pictures into the mix. Keep in mind that Owncloud generates the thumbnails on the fly when QuickPic requests photos, which is pretty cool. It’s not blazing fast, but it’s acceptable. The thumbnails are then cached in the Owncloud’s data directory, meaning that it will be a lot faster the second time you view the same directory.

I’ve also setup a script that automatically recodes all the images I take from my digital camera and uploads it to the server. That way I can have access to all the latest photos, I can view them from anywhere. Everything is reasonably secure, under your control.

QuickPic is a great photo app for smartphones in general, not just for Owncloud used in this way.
QuickPic is a great photo app for smartphones in general, not just for Owncloud used in this way.

Simple symmetric encryption

Encryption. It’s one of those words that programmers and sysadmins dread. Always the complications, always the overhead. There is an entire science and math behind encryption, and if you think about it more closely, it makes sense that it’s so complicated. Imagine that you are in a room full of people and you need to say something to your wife that you don’t want anyone else to understand, but they’re all listening. “Honey, are we having sex tonight? Please? – C’mon, we had sex two weeks ago, what do you want from me?”, the wife answers. But if the conversation goes like “Ubarl, ner jr univat frk gbavtug? Cyrnfr? – P’zba, jr unq frk gjb jrrxf ntb, jung qb lbh jnag sebz zr?”, it would be much harder to understand. This is a simple ROT13, but chances are people won’t understand and chances are you won’t be able to pronounce it anyway. Computer encryption works similarly, but needs to set the encryption keys in plain view of everyone, but the implementation is beyond the scope of this article. Take a look at this article for a better explanation.

The cloud & you

In a recent post I spoke briefly about encryption and the omnipresent cloud, but didn’t really get into it. The article is entertaining the idea that you keep a monthly snapshot of all your pictures or something else valuable on a cloud provider, like Dropbox or Google Drive. The point is, keeping possibly sensitive data somewhere that any bored sysadmin can casually go over your files is a bad idea. All you have is their pinky swear that they won’t do such a thing. Your account can get hacked, as we all saw is perfectly possible with the recent “The Fappening” incident. It’s a bad personal security breach. The best course of action is to nip this scenario in the bud, and simply encrypt your stuff before sending over to a cloud provider.

How to easily encrypt your files? The easiest method is using a symmetrical encryption with openssl. You could use GPG, have a complete set of private/public keys, etc. This complicates matters considerably, and you’re screwed if you lost your private key. If this is a case where you offload an encrypted tarball somewhere, and you lost your equipment, better have that 4096-bit key memorized. What we’ll do instead, is use a regular strong password. Remember, we’re not trying to make it as secure as possible, we’re just making it so not every Tom, Dick and Harry from your friendly cloud provider can view your files if they feel like it. This is by far the fastest and easiest way.

Encrypting:

$ tar c documents | openssl aes-256-cbc -in /dev/stdin -out documents.tar.ssl
enter aes-256-cbc encryption password:
Verifying - enter aes-256-cbc encryption password:

That’s it! Your files have been encrypted. Feel free to throw in z or j to tar because openssl won’t compress data. Also, openssl salts by default, so you don’t have to worry about that. Upload the tarball and you’re done. Of course, keep the password to at least 8 characters, no dictionary words, birthdays, use special characters, etc.

Decrypting:

$ openssl aes-256-cbc -d -in documents.tar.ssl | tar x
enter aes-256-cbc decryption password:

This will decrypt your files. There is one big caveat with this. Say your photographs, or important personal projects consume a lot of space, like 20 GB and have a lot of files. Making a single encrypted tarball every month is OK, but uploading a brand new snapshot from your Cable/ADSL line isn’t. I personally use pycryto, it’s a python script I wrote to recursively encrypt all the files within the current directory, delete the original by default, and replace them with .enc files which are encrypted with your password. The timestamps and permissions are preserved, but in the metadata of the files themselves, not contained in the encrypted files. Even still, it’s very rsyncable like this. I have a copy of my photographs on this very server.

Conclusion – if there is any

Why go through all this trouble? They’re just pictures? That part is true, if it’s only pictures, and not an offsite backup of your important work projects that might not be viewable for everyone. It’s more of a principle. And I realize that this way is not cryptographically the best way possible to encrypt your data, but I feel that it’s good enough so it’s not viewable by default. Plaintext sucks. Also, there’s a pretty good chance no one will ever get to see your files, because they don’t do it in general. I’m a sysadmin too, I have access to sensitive data, but I view it as important cargo. I don’t give a flying fuck what’s in it, I really don’t. It’s all just so unimportant for me to actually take a peak. There’s nothing to gain. I have a job to do. I’ve had various jobs throughout my life, from splitting rocks in a quarry, to basic ship maintenance (sanding the chemicals that make it harder for underwater life to latch on to the ship), hauling around cargo, mostly menial jobs. But I’ve always held the same stance. There is nothing to gain from stealing or cheating anything or anyone, you’ll only get a bad rap if you’re caught, and you have to look yourself in the mirror even if you don’t get caught. Not sure how the people that engage in those activities reconcile with their inner-self.

International file names in cross-platform programs

I work for a company that builds simulation software with the front-end GUI developed mostly in Python. This document is a slightly modified version of a guide written for the GUI developers to ensure that file names with international characters work across the supported platforms. Note that this document is specifically about file names, not file contents, which is a separate topic.

Introduction

Modern operating systems support use of international characters in file and directory names. Users not only routinely expect being able to name their files in their native language, but also being able to manipulate files created by users of other languages.

Historically, most systems implemented file names with byte strings where the value of each byte was restricted to the ASCII range (0-127). When operating systems started supporting non-English scripts, byte values between 128 and 255 got used for accented characters. Since there are more than 128 such characters in European languages, they were grouped in character encodings or code pages, and the interpretation of a specific byte value was determined according to the currently active code page. Thus a file with the name specified in Python as '\xa9ibenik.txt' would appear to an Eastern-European language user as Šibenik.txt, but to a Western-European as ©ibenik.txt. As long as users from different code pages never exchanged files, this trick allowed smuggling non-English letters to files names. And while this worked well enough for localization in European countries, it failed at internationalization, which implies exchange and common storage of files from different languages and existence of bilingual and multilingual environments. In addition to that, single-byte code pages failed to accomodate East Asian languages, which required much more than 128 different characters in a single language. The solution chosen for this issue by operating system vendors was allowing the full Unicode repertoire in file names.

Popular operating systems have settled on two strategies for supporting Unicode file names, one taken by Unix systems, and the other by MS Windows. Unix continued to treat file names as bytes, and deployed a scheme for applications to encode Unicode characters into the byte sequence. Windows, on the other hand, switched to natively representing file names in Unicode, and added new Unicode-aware APIs for manipulating them. Old byte-based APIs continued to be available on Windows for backward compatibility, but could not be used to access files other than those with names representable in the currently active code page.

These design differences require consideration on the part of designers of cross-platform software in order to fully support multilingual file names on all relevant platforms.

Unicode encodings

Unicode is a character set designed to support writing all human languages in present use. It currently includes more than 100 thousand characters, each assigned a numeric code called a code point. Characters from ASCII and ISO 8859-1 (Western-European) character sets retained their previous numeric values in Unicode. Thus the code point 65 corresponds to the letter A, and the code point 169 corresponds to the copyright symbol ©. On the other hand, the letter Š has the value 169 in ISO 8859-2, the value 138 in Windows code page 1250, and code point 352 in Unicode.

Unicode strings are sequences of code points. Since computer storage is normally addressed in fixed-size units such as bytes, code point values need to be mapped to such fixed-size code units, or encoded. Mainstream usage has stabilized on a small number of standard encodings.

UTF-8

UTF-8 is an encoding that maps Unicode characters to sequences of 1-4 bytes. ASCII characters are mapped to their ASCII values, so that any ASCII string is also a valid UTF-8 string with the same meaning. Non-ASCII characters are encoded as sequences of up to four bytes.

Compatibility with ASCII makes UTF-8 convenient for introducing Unicode to previously ASCII-only file formats and APIs. Unix internationalization and modern Internet protocols heavily rely on UTF-8.

UTF-16

The UTF-16 encoding maps Unicode characters to 16-bit numbers. Characters with code points that fit in 16 bits are represented by a single 16-bit number, and others are split into pairs of 16-bit numbers, the so-called surrogates.

Windows system APIs use UTF-16 to represent Unicode, and the documentation often refers to UTF-16 strings as “Unicode strings”. Java and DotNET strings also use the UTF-16 encoding.

UTF-32

The UTF-32 encoding maps characters to 32-bit numbers that directly correspond to their code point values. It is the simplest of the standard encodings, and the most memory-intensive one.

System support for Unicode

Windows

Windows file names are natively stored in Unicode. All relevant Win32 calls work with UTF-16 and accept wchar_t * “wide string” arguments, with char * “ansi” versions provided for backward compatibility. Since file names are internally stored as Unicode, only the Unicode APIs are guaranteed to operate on all possible files. The char based APIs are considered legacy and work on a subset of files, namely those whose names can be expressed in the current code page. Windows provides no native support for accessing Unicode file names using UTF-8.

The Win32 API automatically maps C API calls to wide (UTF-16) or single-byte variants according to the value of the UNICODE preprocessor symbol. Functions standardized by C, C++, and POSIX have types specified by the standard and cannot be automatically mapped to Unicode versions. To simplify porting, Windows provides proprietary alternatives, such as the _wfopen() alternative to C fopen(), or the _wstat() alternative to POSIX stat(). Like Win32 byte-oriented functions, the standard functions only work for files whose names can be represented in the current code page. Opening a Japanese-named file on a German-language workstation is simply not possible using standard functions such as fopen() (except by resorting to unreliable workarounds such as 8+3 paths). This is a very important limitation which affects the design of portable applications.

Standard C++ functions, such as std::fstream::open, have overloads for both char * and wchar_t *. Programmers that want their programs to be able to manipulate any file on the file system must make sure to use the wchar_t * overloads. The char * overloads are also limited to opening non-Unicode file names.

Unix

The Unix C library does support the wchar_t type for accessing file contents as Unicode, but not for specifying file names. The operating system kernel treats file names as byte strings, leaving it up to the user environment to interpret them. This interpretation, known as the “file name encoding”, is defined by the locale, itself configured with LC_* environment variables. Modern systems use UTF-8 locales in order to support multilingual use.

For example, when a user wishes to open a file with Unicode characters, such as Šibenik.txt, the application will encode the file name as a UTF-8 byte string, such as "\xc5\xa0ibenik.txt", and pass that string to fopen(). Later, system functions like readdir() will retrieve the same UTF-8 file name, which the application’s file chooser will display to the user as Šibenik.txt. As long as all programs agree on the use of UTF-8, this scheme supports unrestricted use of Unicode characters in file names.

The important consequence of this design is that storing file names as Unicode in the application and encoding them as UTF-8 when passing them to the system will only allow manipulating files whose names are valid UTF-8 strings. To open an arbitrary file on the file system, one must store file names as byte strings. This is exactly the opposite of the situation on Windows, a fact that portable code must take into account.

Python

Beginning with version 2.0, Python optionally supports Unicode strings. However, most libraries work with byte strings natively (often using UTF-8 to support Unicode), and using Unicode strings is slower and leads to problems when Unicode strings interact with ordinary strings.

On Windows, Python internally uses the legacy byte-based APIs when given byte strings and Windows-specific Unicode APIs when given Unicode strings. This means that Unicode files can be manipulated as long as the programmer remembers to create the correct Unicode string. It is not only impossible to open some files using the bytes API, they are misrepresented by functions such as os.listdir::

>>> with open(u'\N{SNOWMAN}.txt', 'w'):
...   pass   # create a file with Unicode name
... 
>>> os.listdir('.')
['?.txt']
>>> open('?.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '?.txt'

Opening the directory in Windows Explorer reveals that Python created the file with the correct name. It is os.listdir and its constraint to return byte strings when given a byte string that creates the problem. os.listdir(u'.') returns the usable [u'\u2603.txt'].

Python 3

Python 3 strings are Unicode by default, so it automatically calls the Unicode versions of Win32 calls and does not exhibit bugs like the listdir bug shown above. On the other hand, Python 3 needs special provisions to map arbitrary Unix file names to Unicode, as described in PEP 383.

File names in applications

Portable programs that want to enable the user to create arbitrary file names must take care how to create and access them. Using portable IO libraries such as gio and Qt resolves many of these problems automatically, but these libraries carry a lot of weight that is unacceptable in many situations. Also, those libraries often don’t interact well with “traditional” C code that accepts file names. In this chapter we present an implementation strategy that enables correct use of Unicode file names with minimal intrusion to the code base.

Since file names are natively bytes on some platforms and Unicode on others, a cross-platform application must choose between these representations. Using Unicode makes programming somewhat easier on platforms with native Unicode APIs, while using UTF-8 bytes has the advantage on platforms with native bytes APIs.

What representation works best depends on the application’s surroundings and the implementation platform. A Python 3 or Java application running on a web server is probably best served by using Unicode consistently and not bothering with Unix non-UTF-8 file names at all. On the other hand, a GTK application, a Python 2 application, or an application needing to interface with C will be off with UTF-8, which guarantees interoperability with the world of bytes, while retaining lossless conversion to Unicode and back.

This guide presents a programming model based on UTF-8 as the file name representation. UTF-8 was chosen for AVL simulation GUIs due to ease of interoperability with various C APIs, including GTK itself. This choice is also shared by the gio library and other modern Unix-based software. Of course, use of UTF-8 is not limited just to file names, it should be used for representation of all user-visible textual data.

Interacting with the file system from Python

Since Python’s built-in functions such as open and os.listdir accept and correctly handle Unicode file names on Windows, the trick is making sure that they are called with correct arguments. This requires two primitives:

  • to_os_pathname— converts a UTF-8 pathname (file or directory name) to OS-native representation, i.e. Unicode when on Windows. The return value should only be used as argument to built-in open(), or to functions that will eventually call it.

  • from_os_pathname — the exact reverse. Given an OS-native representation of pathname, returns a UTF-8-encoded byte string suitable for use in the GUI.

The implementation of both functions is trivial:

  def to_os_pathname(utf8_pathname):
      """Convert UTF-8 pathname to OS-native representation."""
      if os.path.supports_unicode_filenames:
          return unicode(utf8_pathname, 'utf-8')
      else:
          return pathname

  def from_os_pathname(os_pathname):
      """Convert OS-native pathname to UTF-8 representation."""
      if os.path.supports_unicode_filenames:
          return os_pathname.encode('utf-8')
      else:
          return os_pathname

With these in place, the next step is wrapping file name access with calls to to_os_pathname. Likewise, file names obtained from the system, as with a call to os.listdir must be converted back to UTF-8.

def x_open(utf8_pathname, *args, **kwds):
    return open(to_os_pathname(utf8_pathname), *args, **kwds)

def x_stat(utf8_pathname):
    return os.stat(to_os_pathname(utf8_pathname))
...

# The above pattern can be used to wrap other useful functions from
# the os and os.path modules, e.g. os.stat, os.remove, os.mkdir,
# os.makedirs, os.isfile, os.isdir, os.exists, and os.getcwd.

def x_listdir(utf8_pathname):
    return map(from_os_pathname, os.listdir(to_os_pathname(utf8_pathname)))

The function standing out is x_listdir, which is like os.listdir, except it converts file names in both directions: in addition to calling to_os_pathname on the pathname received from the caller, it also calls from_os_pathname on the pathnames provided by the operating system. Taking the example from the previous chapter, x_listdir would correctly return ['\xe2\x98\x83'] (a UTF-8 encoding of the snowman character), which x_open('\xe2\x98\x83') would correctly open.

Any function in the program that accepts a file name must accept — and expect to receive — a UTF-8-encoded file name. Functions that open the file using Python’s open, or those that call third-party functions that do so, have the responsibility to use to_os_pathname to convert the file name to OS-native form.

Legacy path names

to_os_pathname is useful when calling built-in open() or into code that will eventually call built-in open(). However, sometimes C extensions beyond our control will insist on accepting the file name to open the file using the ordinary C fopen() call. Passing an OS-native Unicode file name on Windows serves no purpose here because it will fail on a string check implemented by the Python bindings for the library. And even if it somehow passed the check, the library is still going to call fopen() rather than _wfopen().

A workaround when dealing with such legacy code is possible by retrieving the Windows “short” 8+3 pathnames, which are always all-ASCII. Using the short paths, it is possible to write a to_legacy_pathname function that accepts a UTF-8 pathname and returns a byte string pathname with both Python open() and the C family of functions such as fopen(). Since short pathnames are a legacy feature of the Win32 API and can be disabled on a per-volume basis, to_legacy_pathname should only be used as a last resort, when it is impossible to open the file with other means.

if not os.path.supports_unicode_filenames:
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        return utf8_pathname
else:
    import ctypes, re
    GetShortPathNameW = ctypes.windll.kernel32.GetShortPathNameW
    has_non_ascii = re.compile(r'[^\0-\x7f]').search
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        if not has_non_ascii(utf8_pathname):
            return utf8_pathname
        unicode_pathname = unicode(utf8_pathname, 'utf-8')
        short_length = GetShortPathNameW(unicode_pathname, None, 0)
        if short_length == 0:
            raise ctypes.WinError()
        short_buf = ctypes.create_unicode_buffer(short_length)
        GetShortPathNameW(unicode_pathname, short_buf, short_length)
        short_pathname_unicode = short_buf.value
        return short_pathname_unicode.encode('ascii')

Summary

If this seems like a lot of thought for something as basic as file names with international characters, you are completely right. Doing this shouldn’t be so hard, and this can be considered an argument for moving to Python 3. However, if you are using C extensions and libraries that accept file names, simply switching to Python 3 will not be enough because the libraries and/or their Python bindings will still need to be modified to correctly handle Unicode file names. A future article will describe approaches taken for porting C and C++ code to become, for lack of a better term, Unicode-file-name-correct. Until then, the to_legacy_pathname() hack can come in quite handy.

Handling large sets of photographs and videos

In 2003 my father bought a digital camera for the family. It was an Olympus C-350 Zoom, 3.2 mpix, 3x optical zoom, a 1.8″ LCD display. At that time, at least here in Croatia, having a digital camera was fairly rare. I’m not saying I had it first in my city, but it wasn’t as commonplace as today. This was such a leap from anything that you owned. You could actually take a picture and upload it to the computer. And the image was usually great if the light conditions were optimal, of course. Indoors, and with low lighting the images were terrible.

Šibenik circa 2003/07
Šibenik circa 2003/07 on a good day
Sorry dude, it's 7:59:34 PM on an August the 18th. There's less sunlight than you think.
Sorry dude, it’s 7:59:34 PM on an August the 18th. There’s less sunlight than you think at this time of the year, so better keep the camera perfectly steady for one fifths a second.

This camera wasn’t cheap. It cost a little less than $500. For Croatian standards of the time it was a fair amount of money. Still is actually, but that is the minimum you have to spend to have a decent camera, it was like that then, it’s still like that now.

I was making pictures of the town, taking it on trips, documenting everything. Since I was always a computer enthusiast I was beginning to worry, what if the hard disk failed? I’d lose all of the photographs I had acquired. There are people that seem to underestimate the importance of photos. You take the photos, they’re nice, but they’re not that valuable right at that moment. Looking back 10 years or more, suddenly the pictures become somehow irreplaceable. They’re a direct window into your past, not the blurry vision of the past that most of us have, but something concrete and immutable. I think this especially applies when you have kids, you’ll want the pictures safe and sound, at least for a little while. Everything gets lost in the end, but why be the guy that loses something that could be classified as a family heirloom?

How not to lose the pictures and how to organize them

Here’s a high-level list of what I’ve found to be good practices, to keep it organized and safe:

  • A clear structure of stored photographs/videos. I’ve found that a single root directory with a simple YYYY-MM does the trick. I dislike categorizing pictures with directory names like summer-vacation-2003, party-at-friends-house-whose-name-I-cant-remember or something to that effect. I think that over time, the good times you had get muddled along the way, and you’ll appreciate a simple year-month format to find something or to remember an occasion. It’s like a time machine, let’s see what I was doing in the spring of 2004, and you can find fun pictures along the way.
  • This goes without saying, backups. Buy an external disk, they go cheap, and you can store a lot of photos there. Your disk can die suddenly and without notice, and all your pictures can simply vanish, never to be seen again. Sure, son, I’d love to show you pictures when I was young, but unfortunately, I couldn’t be bothered to have a backup ready and all the pictures are gone.
  • Disaster recovery – imagine your whole building/house burns to the ground. You get nothing but rubble, and although you were meticulously syncrhonizing to your backup every night to an external HDD, everything is gone. Or more realistically, your house gets broken into and they steal your electronics which contain data that is basically irreplaceable. Create a tarball of all your photographs/videos, encrypt it with a GPG key or passphrase, or with a simple SSL encryption and upload it into the cloud of your choice. Even in the event of a burglary/arson with a regular snapshot of about once per two months, you’ll still be able to recover most of the data when you rebuild your house or buy a new computer in the event of a burglary.
  • Print out a yearly compilation of pictures that you like at your local photo lab. Just pick like 40 of the best, with whatever criteria you deem fit. Who knows if the JPEG standard will be readable in 30 years time, but you can always look at a physical picture you can take with you.
Wow, I just called the cops that my house was burglarized. Now it burned down too? If only I had a disaster recovery plan for my valuable photos.
I just called in a burglary at my house. Now it burned down while getting beers from the store? If only I had a disaster recovery plan for my valuable photos on both desktop computer and portable HDD.

Photos

Most digital cameras, be it video or still frames, have pretty lavish defaults with the image quality. This is a very good thing. I like to get a source file that is close as possible as the device has serialized it to a file. Still, if you take a lot of pictures, you’ll quickly notice that it’s piling up. The first thing to do is delete the technically failed ones. Do not delete the pictures where you think that someone is ugly on it, it may end up great in a certain set of circumstances. You never know.

These days even the shittiest cameras boast with huge pixel numbers, like 10, 15 mega pixels or more with a tiny crappy lens and who knows what kind of sensor. Feel free to downsize it to 5-8 mega pixels, with a JPEG quality of 75-80. Quickly you’ll see that now your images consume a lot less space on the HDD, I’m talking about 30% of the original photo, sometimes even less. I spent a lot of time trying to find exactly how the image is degraded. Some slight aberrations can be seen if you go pixel peeping, but screw that, the photos might have sentimental value on the whole, and you’ve saved a lot of hard drive space that you realistically have available. I recommend using the Imagemagick suite for all your resizing needs. Create a directory where you want the recoded images, like lowres:

$ mogrify -path lowres -auto-orient -quality 80 -resize 8640000@ *.jpg

You can set the number of pixels, in this example it’s 8.64Mpix. Choose a resolution and go with it. I generally use 3600×2400 which is 8640000 in pixels. Mogrify is great for this task, it can encode the images in parallel, so if you have a multi-core computer it really shines since the operations involved are very CPU expensive. You can omit the -path switch, and the files will be processed and placed instead of the file, but be careful as this WILL overwrite the original file(s). Don’t test around on your only copy of the file. You can use the generally more safe convert which takes the same argument with a slight difference, it needs the INFILE and OUTFILE argument:

$ convert -auto-orient -quality 80 -resize 8640000@ mypicture.jpg mypicture-output.jpg

or

$ for JPEGS in *.jpg ; do convert -auto-orient -quality 80 -resize 8640000@ $JPEGS $JPEGS-out; done

The problem with this is that you’ll then have a bunch of IMG_xxxx.jpg-out files. This is the longer way around, but once you’re satisfied with the result, delete the original jpeg files and rename it with a program that mass renames or you can use a perl script called ren, my brother and a buddy of his wrote a long time ago and it still works great for a number of circumstances:

$ ren -c -e 's/\-out//'

This will rename all the files that have the -out to empty string, deleting it from the filename essentially. But this is the long way around, I suggest using mogrify. Mogrify had a very very nasty bug. At one point they decided that it would be cool if you have an Nvidia card and the proprietary drivers installed it would use the GPU for all your encoding needs. That sounds great in theory, but I actually had an Nvidia graphics card with the drivers properly installed. How do I know that? Complex 3D video games worked without issues. And guess what else? It didn’t fucking work. It simply hang there, and didn’t do anything, it would never finish a single image. Did I mention that you can’t fallback on the CPU so easily, I mean at all? I googled around, and multiple bugs were filed. I just tried mogrify now when writing this post, seems they have finally fixed it, and I may go back to using it again, instead of unnecessarily complex python scripts that called concurrent converts which number was based on the number of your physical cores.

Video

A nice feature of modern cameras is its ability to record decent video and audio. The cameras mostly use a very good quality preset for the recordings. On my current SLR camera I get 5-6 megabytes per second for a video. Not only that the files are monstrously huge, they also are sometimes in non-standard containers, have weird video and/or audio encodings. You should really convert it to something decent:

$ ffmpeg -i hugefile.mov -c:v libx264 -preset slow -crf 25 -x264opts keyint=123:min-keyint=20 -c:a libmp3lame -q:a 6 output.mkv

This produces a pretty good quality video. I am strongly against rescaleing the video in any way. Use the original resolution, the displays are advancing at a stable pace, you don’t want to unnecessarily scale down the resolution. You can change the quality with -crf from 18-29 are reasonable options, I discussed it in another post in more detail. Also, it decreases the file size by a factor of 15 or more, virtually without perceptible visual loss. As an added bonus you mux it into an open source container with the h264 family of encoders and the venerable mp3 format for audio. That should work on most computer players by default as well as standalone players hooked up to a TV.

I started this post as more of an in-depth technical overview how to store and encode your multimedia and backing it up. But instead I chose to give a high-level overview of what worked for me over the years. Make backups regularly, have a disaster recovery option present if at all possible, and print out some yearly photos. It’s fun to look over the physical pictures, and can be good fun showing it to visiting friends and family. When deciding how much to shrink the files, always keep in mind that you should compress them as much as possible while keeping the subjective perception of the quality as close as possible to the original. What I mean to say, don’t overdo with the quality settings. What matters is how much space is your archive consuming right now, and how are you able to cope with that amount of data.

Data loss is commonplace. Hard drives fail, do not lose 10+ years of photographs because you didn’t have a decent backup. It’s not so hard. Do it now. Don’t lose a part of your personal history, it’s priceless, and cannot be downloaded from the internet again. Always encrypt your stuff before uploading to the ethereal cloud. Maybe you have sensitive pictures that you wouldn’t want anyone else casually looking over just because they happen to be the sysadmin. You wouldn’t make the same kind of privacy breach in other parts of your life, would you?

Recording a game video with Linux

I’m sure a lot of people have always thought, wow, I’d like to record a video of this to have it around! On Linux! Well, with it’s incredibly easy! OK, not really so easy, you’ll have to handle a few hurdles along the way, but it’s nothing terrible. As an example I’ll be using prboom which is an engine to run Doom 1/2 with the original WADs you have obtained legally, paid for fair and square etc. It uses 3D hardware acceleration, no jumping, crouching and shit like that. It’s great to see Doom 1/2 in in high resolution, it looks pretty good, and very true to the original, and makes the game more than playable.

Requirements

There is a beautiful program called glc. Basically it hooks to the video & audio of the system and dumps a shitload of fastest compressing png files, that is one per frame. Depending on the resolution you use for capturing and the framerate, expect very hardcore output per second to your HDD, somewhere around 50 megabytes per second for a full hd experience, and that’s with the quicklz compression method for glc-capture.

I won’t go into too much detail how to install glc, or prboom. I’m sure it’s simple for your favorite Linux distribution. It was a simple aurget -S on Arch. Now,let’s head on to actually capturing some gameplay. The syntax is very simple glc-capture [options] [program] [programs' arguments]:

The initial video capturing

$ glc-capture -j -s -f 60 -n -z none prboom-plus -width 1920 -height 1080 -vidmode gl -iwad dosgames/DOOM2/DOOM2.WAD -warp 13 -skill 5

This was the tricky part, I had to play around with the options to get it glitch free inside the game. I’ve recorded a video three years ago with glc, can’t remember using some of these options. -j – force-sdl-alsa-drv, got better performance, but maybe unneeded, play around with it -s – so recording starts right away -f – sets the framerate -n – locks FPS, didn’t need this before, but you get a glitch-free recording -z none – no PNG compression, I’ve had better performances without compression The prboom-plus options should be self-explanatory, I’ve used the 1920×1080 resolution so it’s youtube friendly. The -warp is to warp to level 13, and -skill 5 is for nightmare. The file output is named $PROGRAM-$PID-0.glc by default.

OK, the easy part is done, apart from the tricky part. Now you have a huge-ass .glc file on your hard drive that is completely unplayable by any video player known to man. And when I say huge-ass, I mean huge-ass. A 54 second video comes out to 1.79GB, which is 34MB per second in 720p, and for 1080p I had up to 42MB per second! The default png compression used by glc-capture is quicklz. For 1080p I had some better experiences using -z none so it simply dumps the PNGs into the file and that’s it. As you might figure, this will also increase the resulting file size, but it could be well worth the disk space if you don’t have a fast CPU. You’ll get close to a 100 MB/s for a 1080p stream. Use the default compression if in doubt. Experiment.

What do we do with an unplayable, unusable, unuploadable gigantic glc dump on our hard drive? I strongly suggest you encode it somehow. I used to use mencoder for all my encoding needs. Due to the way it’s maintained, or a lack thereof, I switched to ffmpeg which has an active development and used a lot in the backends of various video tube sites around the internet. OK, let’s go, step by step:

Extract the audio track

$ glc-play prboom-plus-12745-0.glc -a 1 -o 1080p.wav

This line dumps the audio track from the glc file, of course it’s a completely uncompressed wave file. -a 1 is for track #1, and -o is for output, naturally.

Pipe the uncompressed video to ffmpeg and encode to a reasonable file format

$ glc-play prboom-plus-12745-0.glc -o - -y 1 | ffmpeg -i - -i 1080p.wav -c:v libx264 -preset slow -crf 25 -x264opts keyint=123:min-keyint=20 -c:a libmp3lame -q:a 6 doom-final-file.mkv

-o - dumps it to STDOUT, -y 1 is for video track 1. Now we have used the all might unix PIPE. I love pipes. In this case ffmpeg uses two files as input, one is STDIN, that is the hardcore raw video file, no pngs, just raw video. The other input is the audio track we dumped earlier. This could be streamlined with a FIFO, but that’s overcomplicating things. The rest of the ffmpeg options are beyond the scope of this article, but they’re a reasonable default. The final argument of ffmpeg is the output file. The container type is determined by the file extension, so you can use mp4, mkv, or whatever you want. After this, the video is finally playable, uploadable, usable. Congrats, you have just recorded your video the Linux way!

If you do want to customize the final video quality, take a look at the the ffmpeg documentation at what these mean. The only thing of interest is the -preset and the -crf. The crf is the “quality” of the video. I was astounded that 2-pass encoding is a thing of the past, and it’s all about the crf now. It goes from 0 to 51. And only a small part of that integer range is actually usable. I simply cannot relay the beautiful wording from the docs, so I’ll just paste it here:

The range of the quantizer scale is 0-51: where 0 is lossless, 23 is default, and 51 is worst possible. A lower value is a higher quality and a subjectively sane range is 18-28. Consider 18 to be visually lossless or nearly so: it should look the same or nearly the same as the input but it isn’t technically lossless.

Details like these can really brighten a person’s day. 18 is visually lossless (and no doubt uses a billion bits per second), but technically only 0 is lossless. So you have a full range from 0 to 18 that is basically useless. Of course, it goes the other way around. After -crf 29 the quality really goes downhill.

The resulting video can be found here or you can see it on YouTube. Excuse my cheating and my dying so fast, this is for demonstration purposes.

Conclusion

I realize there are probably better ways of accomplishing this, you can google around for better solutions. Glc-capture supposedly works with wine too, with some tweaks. I haven’t really tried it, but feel free to leave a comment if someone had any experience with it. This is a simple way to make a recording, you can edit it later once you encode the file to something normal. Glc also supports recording multiple audio tracks so you could also record your voice with a microphone and mash it all together. Good luck with that. :)

Data transfer with Netcat

The other day my brother, who works as a system administrator, inquired about a puzzling behavior of GNU Netcat, the popular nc utility. Sometimes described as the TCP/IP Swiss army knife, it can come in handy as an ad hoc file transfer tool, capable of transferring large amounts of data at the speed of disk reads/writes.

Data transfer

Typical usage looks like this:

nc -lp60000 | tar x                 # receiver
tar c dir... | nc otherhost 60000   # sender

It may look strange at first, but it’s easy to type and, once understood, almost impossible to forget. The commands work everywhere and require no specialized server software, only a working network and nc itself. The first command listens on port 60000 and pipes the received data to tar x. The second command provides the data by piping output of tar c to the other machine’s port 60000. Dead simple.

Note that transferring files with Netcat offers no encryption, so it should only be used inside a VPN, and even then not for sensitive data.

Data loss

One surprising behavior of this mode of transfer is that both commands remain hanging after the file transfer is done. This is because neither nc is willing to close the connection, as the other side might still want to say something. As long as one is positive that the transfer is finished (typically confirmed by disk and network activity having ceased), they can be safely interrupted with ^C.

The next step is adding compression into the mix, in order to speed up transfer of huge but easily compressible database dumps.

nc -lp60000 | gunzip -c | tar x               # receiver
tar c dir... | gzip -c | nc otherhost 60000   # sender

At first glance, there should be no difference between this pipeline and the one above, except that this one compresses the content sent over the wire and decompresses received content. However, much to my surprise, the latter command consistently failed to correctly transfer the last file in the tar stream, which would end up truncated. And this is not a case of pressing ^C too soon — truncation occurs no matter how long you wait for the transfer to finish. How is this possible?

It took some strace-ing to diagnose the problem. When the sender nc receives EOF on its standard input, it makes no effort to broadcast the EOF condition over the socket. Some Netcat implementations close (“shut down”) the write end of the socket after receiving local EOF, but GNU Netcat doesn’t. Failure to shut down the socket causes the receiving nc to never “see” the end of file, so it in turn never signals EOF to gunzip. This leaves gunzip hanging, waiting for the next 32K chunk to complete, or for EOF to arrive, neither of which ever happens.

Preventing Netcat data loss

Googling this issue immerses one into a twisted maze of incompatible Netcat variants. Most implementations shut down the socket on EOF by default, but GNU Netcat not only doesn’t do so, it doesn’t appear to have an option to do so! Needless to say, the huge environment where my brother works would never tolerate swapping the Netcat implementation on dozens of live servers, possibly breaking other scripts. A solution needed to be devised that would work with GNU Netcat.

At this point, many people punt and use the -w option to resolve the problem. -w SECONDS instructs nc to exit after the specified number of seconds of network inactivity. In the above example, changing nc -lp60000 to nc -lp60000 -w1 on the receiving end causes nc to exit one second after the data stops arriving. nc exiting causes gunzip to receive EOF on standard input, which prompts it to flush the remaining uncompressed data to tar.

The only problem with the above solution is that there is no way to be sure that the one-second timeout occurred because the data stopped arriving. It could as well be the result of a temporary IO or network glitch. One could increase the timeout to decrease the probability of a prematurely terminated transfer, but this kind of gamble is not a good idea in production.

Fortunately, there is a way around the issue without resorting to -w. GNU Netcat has a --exec option that spawns a command whose standard input and standard output point to the actual network socket. This allows the subcommand to manipulate the socket in any way, and fortuitously results in the socket getting closed after the command exits. With the writing end closing the socket, neither nc is left hanging, and the transfer completes:

nc -lp60000 | gunzip -c | tar x                  # receiver
nc -e 'tar c dir... | gzip -c' otherhost 60000   # sender

Self-delimiting streams

There is one just little thing that needs explaining: why did the transfer consistently work with tar, and consistently failed to work with the combination of tar and gzip?

The answer is in the nature of the stream produced by tar and gzip. Data formats generally come in two flavors with respect to streaming:

  1. Self-delimiting: formats whose payload carries information about its termination. Example of a self-delimiting stream is an HTTP response with the Content-Length header — a client can read the whole response without relying on an out-of-band “end of file” flag. (HTTP clients use this very feature, along with some more advanced ones, to reuse the same network socket for communicating multiple requests with the server.) A well-formed XML document without trailing whitespace is another example of a self-delimiting stream.

  2. Non-self-delimiting: data formats that do not contain intrinsic information about their end. A text file or an HTML document are examples of those.

While a tar archive as a whole is not self-delimiting (nor can they be, since tar allows appending additional members at the end of the archive), its individual pieces are. Each file in the archive is preceded by a header announcing the size of the file. This allows the receiving tar to read the correct number of bytes from the pipe without needing additional end-of-file information. Although tar will happily hang forever waiting for more files to arrive on standard input, every individual file will be read to completion.

On the other hand, gzip does not produce a self-delimiting stream. gunzip reads data in 32-kilobyte chunks for efficient processing. When it receives a shorter chunk, say 10k of data, it doesn’t assume that this is because EOF has occurred, it just waits for the remaining 22K of data for its buffer to fill. gunzip relies on EOF condition to tell it when and if the stream has ended, after which it does flush the “short” buffer it last read. This incomplete buffer is what caused the data loss when compression was added.

This is why the -e solution works so nicely: it not only makes the socket close, it ensures that EOF is signaled across the chain of piped commands, including the receiving gunzip.

Creating panoramic photos

You might have seen some nice pictures around the web that have been taken with a simple compact camera, but they have an astonishing amount of detail. You may wonder, how do they get such a nice, detailed picture? They simply stitch them together. How, you might ask? Do I need to shell out hundreds of currency in order to obtain the latest from Adobe and the likes? Nope, once again free software to the rescue, and it’s incredibly easy!

Step 1

Take the pictures. Bear in mind that they need to overlap, which should really be obvious. A good rule of thumb is to have at least 50% of the picture to overlap with the previous picture. Remember, no one says you can’t take the photos in the portrait mode. It would be a good idea to lock the white balance to a reasonable preset, so the camera doesn’t decide that picture has gone from “cloudy” to “sunny”. Although, not really necessary, as hugin has very advanced features to compensate. Also, you’ll want to lock the exposure so it doesn’t vary between the shots. Once again, this isn’t a problem for hugin, but it might improve your panorama. You can stack an arbitrary grid of pictures, for example 2×3, 3×3, 4×2, etc. For the sake of this article, I used the almost automatic on my EOS 100D with a 40mm pancake lens:

I used the portrait orientation for taking the pictures. I just snapped them and uploaded to my computer.

Step 2

Install Hugin that undoubtebly comes with your favorite distro, or if you’re a Windows user, simply download from their website. Now, I should point out at this time that Hugin is a very feature-full and complex software. The more advanced features are beyond the scope of this article, and quite frankly they somewhat elude me. Anyway, before I get too side-tracked, fire up Hugin, click on Load images, then on Align, and finally Create panorama, choose where you want the stitched photo to end up. There is no step 3:

Beautiful view of Zagreb
Beautiful view of Zagreb

Hugin took care of the exposure and the white balance. You should really use the tips from above, though.

Conclusion

You’ll tell me, but MrKitty, there is wonderful software out there that is waaay better than Hugin, or Hugin is a very advanced tool that you have no idea how to use. Very much true, but the point of this 2-step tutorial is to point out to people that Linux and the associated software CAN be user friendly, and sometimes even more powerful than their proprietary counterparts. I’ve been using Linux for a while now and I sometimes get the question, but why are you using Linux instead of Windows? There is no easy answer. For starters, I work as a Linux sysadmin for a living, so that’s one, though I don’t really need anything more than Putty. It’s the little things, stuff like Hugin, it’s the plethora of programs that are available with your friendly package manager, the ability to write simple code without the need for big frameworks and the like. Try looping through a couple of files and doing something on Windows. You need specialized software for every little thing you want to do.

But MrKitty, you’re a power user, you sometimes code, you’re a professional in the field, of course you like Linux better! Well, I don’t really have anything against Windows, or Mac, or whatever. But I think everyone is forgetting just how much Windows can be a pain in the ass. I won’t even go for the low shots like BSOD.

Billions of dollars have gone into making this as user friendly as possible
Billions of dollars have gone into making this as user friendly as possible

OK, forget BSOD, there are other stuff that Windows lovers might forget. I’m sure everyone cherishes those sweet moments when you’re battling with drivers. I used to fix computers for money. You wouldn’t believe the stuff I would see. The latest one, a colleague of mine asked me to help him out with a mobile USB dongle. The laptop was running Windows 8, I think. Wow, I really lost the touch with the new Windows, in my mind Windows XP is the latest and greatest. Took me a while to actually find the control panel. OK, the drivers were somehow screwed up, even though Windows 8 was supposed to be supported. There was enough signal, the connection was active. Nothing was loading. Pinging 8.8.8.8 seemed to work, but resolving anything did not, even though the DNS settings were correct. A couple of hours of headbanging and googling revealed a nice forum in Polish with people with the exact same problem, and to my surprise there was a solution at hand. A new and improved driver download from somewhere, creeping at a nice 3 – 10 kilobytes per second and it worked, after tweaking the endless carrier-specific options. So yeah, Windows are really user friendly. I have no idea if it would work on Linux.

Anyway, my mother, age 69 is using Linux and loves it. My wife says she can’t imagine ever using Windows again. :)