About the “ineffectiveness of torture”

A friend of mine posted a commentary about the ongoing torture debate. He was appalled about people arguing that torture doesn’t work. “Why stop there,” he asks. “Does rape work? Let’s have a debate! Is genocide effective? Get some pundits to discuss pros and cons! Does sexual abuse of children yield results? Get some experts on screen!” This line of inquiry prompted me to consider whether the ineffectiveness of torture is ever an acceptable argument against it.

To consider the issue on a broader level, one can imagine “pundits” of other times hotly debating effectiveness of slavery, systemic censorship, forced prison labor, or gladiator battles.

The debate over torture has revealed the ugly fact that torture is widely supported by Americans, and the support has risen in the last several years. I see it as a result of a careful propaganda campaign that made skillful use of the naive portrayals of torture popular in fiction, and especially in pulp. Fiction tends to show torture as the only measure available to defeat infinitely evil opponents. This is so wide-spread that the tvtropes site documents a number of torture-related tropes, such as the “ticking time bomb”. When Americans support torture, they envision this scenario, a fact the agencies promoting their own use of torture happily embraced. If one’s thinking is confined to the ticking time bomb scenario, their moral judgment (and, one might add, decency and good taste) is suspended; they ask themselves nonsensical questions with leading answers, such as “would you torture one person to save a hundred?” (or a thousand, million, etc.) The fact that this scenario is an unrealistic fictional invention fueled by non-fictional propaganda machinery never enters their mind.

Given the way Americans think about torture, it is probably simply easier, and more efficient in the short term, to point out the ineffectiveness of torture than to debunk the ticking-time-bomb image. But in the long run, I would argue that relying on the ineffectiveness argument is an extremely bad idea. Imagine future interrogators and their teams of psychologists and rogue doctors perfecting techniques to use torture effectively and start extracting reliable information from tortured enemies and suspects. An argument against torture hinging on its ineffectiveness would immediately fall apart.

The dilemma whether to use scientific data to back up what is essentially an ethical choice is present elsewhere. For example, lack of measurable differences between races has been cited as argument against racism, sometimes extended to a complete dismissal any notion of race as a “social construct”, like ethnicity. But what if geneticists do discover measurable and important differences between races? Basing what is essentially an ethical judgment on scientific data positions it on thin ice because all scientific views can (and must be allowed to) change. This is why I am against drawing the ineffectiveness argument into the discussion against torture, even if it appears useful in the short term.

Simple symmetric encryption

Encryption. It’s one of those words that programmers and sysadmins dread. Always the complications, always the overhead. There is an entire science and math behind encryption, and if you think about it more closely, it makes sense that it’s so complicated. Imagine that you are in a room full of people and you need to say something to your wife that you don’t want anyone else to understand, but they’re all listening. “Honey, are we having sex tonight? Please? – C’mon, we had sex two weeks ago, what do you want from me?”, the wife answers. But if the conversation goes like “Ubarl, ner jr univat frk gbavtug? Cyrnfr? – P’zba, jr unq frk gjb jrrxf ntb, jung qb lbh jnag sebz zr?”, it would be much harder to understand. This is a simple ROT13, but chances are people won’t understand and chances are you won’t be able to pronounce it anyway. Computer encryption works similarly, but needs to set the encryption keys in plain view of everyone, but the implementation is beyond the scope of this article. Take a look at this article for a better explanation.

The cloud & you

In a recent post I spoke briefly about encryption and the omnipresent cloud, but didn’t really get into it. The article is entertaining the idea that you keep a monthly snapshot of all your pictures or something else valuable on a cloud provider, like Dropbox or Google Drive. The point is, keeping possibly sensitive data somewhere that any bored sysadmin can casually go over your files is a bad idea. All you have is their pinky swear that they won’t do such a thing. Your account can get hacked, as we all saw is perfectly possible with the recent “The Fappening” incident. It’s a bad personal security breach. The best course of action is to nip this scenario in the bud, and simply encrypt your stuff before sending over to a cloud provider.

How to easily encrypt your files? The easiest method is using a symmetrical encryption with openssl. You could use GPG, have a complete set of private/public keys, etc. This complicates matters considerably, and you’re screwed if you lost your private key. If this is a case where you offload an encrypted tarball somewhere, and you lost your equipment, better have that 4096-bit key memorized. What we’ll do instead, is use a regular strong password. Remember, we’re not trying to make it as secure as possible, we’re just making it so not every Tom, Dick and Harry from your friendly cloud provider can view your files if they feel like it. This is by far the fastest and easiest way.

Encrypting:

$ tar c documents | openssl aes-256-cbc -in /dev/stdin -out documents.tar.ssl
enter aes-256-cbc encryption password:
Verifying - enter aes-256-cbc encryption password:

That’s it! Your files have been encrypted. Feel free to throw in z or j to tar because openssl won’t compress data. Also, openssl salts by default, so you don’t have to worry about that. Upload the tarball and you’re done. Of course, keep the password to at least 8 characters, no dictionary words, birthdays, use special characters, etc.

Decrypting:

$ openssl aes-256-cbc -d -in documents.tar.ssl | tar x
enter aes-256-cbc decryption password:

This will decrypt your files. There is one big caveat with this. Say your photographs, or important personal projects consume a lot of space, like 20 GB and have a lot of files. Making a single encrypted tarball every month is OK, but uploading a brand new snapshot from your Cable/ADSL line isn’t. I personally use pycryto, it’s a python script I wrote to recursively encrypt all the files within the current directory, delete the original by default, and replace them with .enc files which are encrypted with your password. The timestamps and permissions are preserved, but in the metadata of the files themselves, not contained in the encrypted files. Even still, it’s very rsyncable like this. I have a copy of my photographs on this very server.

Conclusion – if there is any

Why go through all this trouble? They’re just pictures? That part is true, if it’s only pictures, and not an offsite backup of your important work projects that might not be viewable for everyone. It’s more of a principle. And I realize that this way is not cryptographically the best way possible to encrypt your data, but I feel that it’s good enough so it’s not viewable by default. Plaintext sucks. Also, there’s a pretty good chance no one will ever get to see your files, because they don’t do it in general. I’m a sysadmin too, I have access to sensitive data, but I view it as important cargo. I don’t give a flying fuck what’s in it, I really don’t. It’s all just so unimportant for me to actually take a peak. There’s nothing to gain. I have a job to do. I’ve had various jobs throughout my life, from splitting rocks in a quarry, to basic ship maintenance (sanding the chemicals that make it harder for underwater life to latch on to the ship), hauling around cargo, mostly menial jobs. But I’ve always held the same stance. There is nothing to gain from stealing or cheating anything or anyone, you’ll only get a bad rap if you’re caught, and you have to look yourself in the mirror even if you don’t get caught. Not sure how the people that engage in those activities reconcile with their inner-self.

Encrypt files recursively with openssl

I wrote this program because I had a great idea to offload encrypted versions of my data, but conserving the full directory structure, keeping the permissions and timestamps. This way you can do incremental encrypted snapshots to an untrusted remote server. Always rsync the directory you wish to encrypt someplace else locally, and then run this script.

$ pycrypto.py -e 

This encrypts everything in the "." directory, recursively, and removes the original files. Pass the -n switch to not remove the files.

$ pyrcypto.py -d

This will decrypt everything within "." directory, with the .enc extension. Please be aware that because of the nature of the encryption used, there is no sanity check. If you enter the wrong password while decrypting, the files will be “decrypted” with the wrong password. You’ll get the files, but it won’t be the files you have encrypted in the first place. I’ve considered using a hashing method, but this comprimises security and slows down the process considerably and this was designed to be fast. You can download pycrypto here.

#!/usr/bin/env python2

"""
This script encrypts files in the current (.) directory, including 
hidden files using the AES encryption. The original timestamps are 
preserved and the original files are deleted. A .enc suffix is added 
at the end of each encrypted file. The purpose of the program is to 
encrypt the data, while preserving the original directory structure 
and timestamps so you can safely rsync it to an unsecure location. 
The passphrase needs to be either 16, 24 or 32 bytes long.

You will need to have python-crypto installed on your system, most
distributions have it in their repositories.
"""

from Crypto.Cipher import AES
from stat import *

import os, random, struct, optparse, sys, getpass

# Define options
parser = optparse.OptionParser(usage="%prog --encrypt | --decrypt")
parser.add_option('-e', '--encrypt', dest='enc', action="store_true",
       help='Encrypt the entire current directory including files in'
       ' subtrees and hidden files')
parser.add_option('-d', '--decrypt', dest='dec', action="store_true",
       help='Decrypt the entire current directory including files in'
       ' subtrees and hidden files')
parser.add_option('-v', '--verbose', dest='verb', action="store_true",
       help='Verbose mode')
parser.add_option('-n', '--no-remove', dest='delete', action="store_true",
       help='Do not remove input files once encrypted or decrypted')       
options, files = parser.parse_args()

FILES=[]

def pad_password(pwd):
    "Pad the password to lengths 16, 24, or 32, as needed for AES encryption."
    for size in 16, 24, 32:
        if len(pwd) <= size:
            return pwd + 'x' * (size - len(pwd))
    raise ValueError("password must be 32 characters, or shorter")



def encrypt_file(key, in_filename, out_filename, atime, mtime, perm, chunksize=64*1024):
    if options.verb:
        sys.stdout.write("Encrypting %s\n" % in_filename[2:])
    iv = ''.join(chr(random.randint(0, 0xFF)) for i in range(16))
    encryptor = AES.new(key, AES.MODE_CBC, iv)
    filesize = os.path.getsize(in_filename)

    with open(in_filename, 'rb') as infile:
        with open(out_filename, 'wb') as outfile:
            outfile.write(struct.pack('<Q', filesize))
            outfile.write(iv)

            while True:
                chunk = infile.read(chunksize)
                if len(chunk) == 0:
                    break
                elif len(chunk) % 16 != 0:
                    chunk += ' ' * (16 - len(chunk) % 16)

                outfile.write(encryptor.encrypt(chunk))
    os.utime(out_filename, (atime, mtime))
    os.chmod(out_filename, perm)
    if options.delete == None:
       os.remove(in_filename)


def decrypt_file(key, in_filename, out_filename, atime, mtime, perm, chunksize=24*1024):
    if options.verb:
        sys.stdout.write("Decrypting %s\n" % in_filename[2:])

    with open(in_filename, 'rb') as infile:
        origsize = struct.unpack('<Q', infile.read(struct.calcsize('Q')))[0]
        iv = infile.read(16)
        decryptor = AES.new(key, AES.MODE_CBC, iv)

        with open(out_filename, 'wb') as outfile:
            while True:
                chunk = infile.read(chunksize)
                if len(chunk) == 0:
                    break
                outfile.write(decryptor.decrypt(chunk))

            outfile.truncate(origsize)
    os.utime(out_filename, (atime, mtime))
    os.chmod(out_filename, perm)
    if options.delete == None:
       os.remove(in_filename)

if options.enc and options.dec:
    sys.stderr.write("Please use -e or -d")
    sys.exit(1)

if options.enc == None and options.dec == None:
    sys.stderr.write("Please use -e or -d\n")
    sys.exit(1)

for dirname, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        FILES.append(os.path.join(dirname, filename))

if options.enc:
    pwd = pad_password(getpass.getpass())
    for i in FILES:
       if os.path.islink(i):
           continue 
       encrypt_file(pwd, i, i+".enc",
       os.path.getatime(i), os.path.getmtime(i), os.stat(i)[ST_MODE])

if options.dec:
   pwd = pad_password(getpass.getpass())       
   for i in FILES:
       if os.path.islink(i):
           continue 
       decrypt_file(pwd, i, i[:-4], 
       os.path.getatime(i), os.path.getmtime(i), os.stat(i)[ST_MODE])

International file names in cross-platform programs

I work for a company that builds simulation software with the front-end GUI developed mostly in Python. This document is a slightly modified version of a guide written for the GUI developers to ensure that file names with international characters work across the supported platforms. Note that this document is specifically about file names, not file contents, which is a separate topic.

Introduction

Modern operating systems support use of international characters in file and directory names. Users not only routinely expect being able to name their files in their native language, but also being able to manipulate files created by users of other languages.

Historically, most systems implemented file names with byte strings where the value of each byte was restricted to the ASCII range (0-127). When operating systems started supporting non-English scripts, byte values between 128 and 255 got used for accented characters. Since there are more than 128 such characters in European languages, they were grouped in character encodings or code pages, and the interpretation of a specific byte value was determined according to the currently active code page. Thus a file with the name specified in Python as '\xa9ibenik.txt' would appear to an Eastern-European language user as Šibenik.txt, but to a Western-European as ©ibenik.txt. As long as users from different code pages never exchanged files, this trick allowed smuggling non-English letters to files names. And while this worked well enough for localization in European countries, it failed at internationalization, which implies exchange and common storage of files from different languages and existence of bilingual and multilingual environments. In addition to that, single-byte code pages failed to accomodate East Asian languages, which required much more than 128 different characters in a single language. The solution chosen for this issue by operating system vendors was allowing the full Unicode repertoire in file names.

Popular operating systems have settled on two strategies for supporting Unicode file names, one taken by Unix systems, and the other by MS Windows. Unix continued to treat file names as bytes, and deployed a scheme for applications to encode Unicode characters into the byte sequence. Windows, on the other hand, switched to natively representing file names in Unicode, and added new Unicode-aware APIs for manipulating them. Old byte-based APIs continued to be available on Windows for backward compatibility, but could not be used to access files other than those with names representable in the currently active code page.

These design differences require consideration on the part of designers of cross-platform software in order to fully support multilingual file names on all relevant platforms.

Unicode encodings

Unicode is a character set designed to support writing all human languages in present use. It currently includes more than 100 thousand characters, each assigned a numeric code called a code point. Characters from ASCII and ISO 8859-1 (Western-European) character sets retained their previous numeric values in Unicode. Thus the code point 65 corresponds to the letter A, and the code point 169 corresponds to the copyright symbol ©. On the other hand, the letter Š has the value 169 in ISO 8859-2, the value 138 in Windows code page 1250, and code point 352 in Unicode.

Unicode strings are sequences of code points. Since computer storage is normally addressed in fixed-size units such as bytes, code point values need to be mapped to such fixed-size code units, or encoded. Mainstream usage has stabilized on a small number of standard encodings.

UTF-8

UTF-8 is an encoding that maps Unicode characters to sequences of 1-4 bytes. ASCII characters are mapped to their ASCII values, so that any ASCII string is also a valid UTF-8 string with the same meaning. Non-ASCII characters are encoded as sequences of up to four bytes.

Compatibility with ASCII makes UTF-8 convenient for introducing Unicode to previously ASCII-only file formats and APIs. Unix internationalization and modern Internet protocols heavily rely on UTF-8.

UTF-16

The UTF-16 encoding maps Unicode characters to 16-bit numbers. Characters with code points that fit in 16 bits are represented by a single 16-bit number, and others are split into pairs of 16-bit numbers, the so-called surrogates.

Windows system APIs use UTF-16 to represent Unicode, and the documentation often refers to UTF-16 strings as “Unicode strings”. Java and DotNET strings also use the UTF-16 encoding.

UTF-32

The UTF-32 encoding maps characters to 32-bit numbers that directly correspond to their code point values. It is the simplest of the standard encodings, and the most memory-intensive one.

System support for Unicode

Windows

Windows file names are natively stored in Unicode. All relevant Win32 calls work with UTF-16 and accept wchar_t * “wide string” arguments, with char * “ansi” versions provided for backward compatibility. Since file names are internally stored as Unicode, only the Unicode APIs are guaranteed to operate on all possible files. The char based APIs are considered legacy and work on a subset of files, namely those whose names can be expressed in the current code page. Windows provides no native support for accessing Unicode file names using UTF-8.

The Win32 API automatically maps C API calls to wide (UTF-16) or single-byte variants according to the value of the UNICODE preprocessor symbol. Functions standardized by C, C++, and POSIX have types specified by the standard and cannot be automatically mapped to Unicode versions. To simplify porting, Windows provides proprietary alternatives, such as the _wfopen() alternative to C fopen(), or the _wstat() alternative to POSIX stat(). Like Win32 byte-oriented functions, the standard functions only work for files whose names can be represented in the current code page. Opening a Japanese-named file on a German-language workstation is simply not possible using standard functions such as fopen() (except by resorting to unreliable workarounds such as 8+3 paths). This is a very important limitation which affects the design of portable applications.

Standard C++ functions, such as std::fstream::open, have overloads for both char * and wchar_t *. Programmers that want their programs to be able to manipulate any file on the file system must make sure to use the wchar_t * overloads. The char * overloads are also limited to opening non-Unicode file names.

Unix

The Unix C library does support the wchar_t type for accessing file contents as Unicode, but not for specifying file names. The operating system kernel treats file names as byte strings, leaving it up to the user environment to interpret them. This interpretation, known as the “file name encoding”, is defined by the locale, itself configured with LC_* environment variables. Modern systems use UTF-8 locales in order to support multilingual use.

For example, when a user wishes to open a file with Unicode characters, such as Šibenik.txt, the application will encode the file name as a UTF-8 byte string, such as "\xc5\xa0ibenik.txt", and pass that string to fopen(). Later, system functions like readdir() will retrieve the same UTF-8 file name, which the application’s file chooser will display to the user as Šibenik.txt. As long as all programs agree on the use of UTF-8, this scheme supports unrestricted use of Unicode characters in file names.

The important consequence of this design is that storing file names as Unicode in the application and encoding them as UTF-8 when passing them to the system will only allow manipulating files whose names are valid UTF-8 strings. To open an arbitrary file on the file system, one must store file names as byte strings. This is exactly the opposite of the situation on Windows, a fact that portable code must take into account.

Python

Beginning with version 2.0, Python optionally supports Unicode strings. However, most libraries work with byte strings natively (often using UTF-8 to support Unicode), and using Unicode strings is slower and leads to problems when Unicode strings interact with ordinary strings.

On Windows, Python internally uses the legacy byte-based APIs when given byte strings and Windows-specific Unicode APIs when given Unicode strings. This means that Unicode files can be manipulated as long as the programmer remembers to create the correct Unicode string. It is not only impossible to open some files using the bytes API, they are misrepresented by functions such as os.listdir::

>>> with open(u'\N{SNOWMAN}.txt', 'w'):
...   pass   # create a file with Unicode name
... 
>>> os.listdir('.')
['?.txt']
>>> open('?.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '?.txt'

Opening the directory in Windows Explorer reveals that Python created the file with the correct name. It is os.listdir and its constraint to return byte strings when given a byte string that creates the problem. os.listdir(u'.') returns the usable [u'\u2603.txt'].

Python 3

Python 3 strings are Unicode by default, so it automatically calls the Unicode versions of Win32 calls and does not exhibit bugs like the listdir bug shown above. On the other hand, Python 3 needs special provisions to map arbitrary Unix file names to Unicode, as described in PEP 383.

File names in applications

Portable programs that want to enable the user to create arbitrary file names must take care how to create and access them. Using portable IO libraries such as gio and Qt resolves many of these problems automatically, but these libraries carry a lot of weight that is unacceptable in many situations. Also, those libraries often don’t interact well with “traditional” C code that accepts file names. In this chapter we present an implementation strategy that enables correct use of Unicode file names with minimal intrusion to the code base.

Since file names are natively bytes on some platforms and Unicode on others, a cross-platform application must choose between these representations. Using Unicode makes programming somewhat easier on platforms with native Unicode APIs, while using UTF-8 bytes has the advantage on platforms with native bytes APIs.

What representation works best depends on the application’s surroundings and the implementation platform. A Python 3 or Java application running on a web server is probably best served by using Unicode consistently and not bothering with Unix non-UTF-8 file names at all. On the other hand, a GTK application, a Python 2 application, or an application needing to interface with C will be off with UTF-8, which guarantees interoperability with the world of bytes, while retaining lossless conversion to Unicode and back.

This guide presents a programming model based on UTF-8 as the file name representation. UTF-8 was chosen for AVL simulation GUIs due to ease of interoperability with various C APIs, including GTK itself. This choice is also shared by the gio library and other modern Unix-based software. Of course, use of UTF-8 is not limited just to file names, it should be used for representation of all user-visible textual data.

Interacting with the file system from Python

Since Python’s built-in functions such as open and os.listdir accept and correctly handle Unicode file names on Windows, the trick is making sure that they are called with correct arguments. This requires two primitives:

  • to_os_pathname— converts a UTF-8 pathname (file or directory name) to OS-native representation, i.e. Unicode when on Windows. The return value should only be used as argument to built-in open(), or to functions that will eventually call it.

  • from_os_pathname — the exact reverse. Given an OS-native representation of pathname, returns a UTF-8-encoded byte string suitable for use in the GUI.

The implementation of both functions is trivial:

  def to_os_pathname(utf8_pathname):
      """Convert UTF-8 pathname to OS-native representation."""
      if os.path.supports_unicode_filenames:
          return unicode(utf8_pathname, 'utf-8')
      else:
          return pathname

  def from_os_pathname(os_pathname):
      """Convert OS-native pathname to UTF-8 representation."""
      if os.path.supports_unicode_filenames:
          return os_pathname.encode('utf-8')
      else:
          return os_pathname

With these in place, the next step is wrapping file name access with calls to to_os_pathname. Likewise, file names obtained from the system, as with a call to os.listdir must be converted back to UTF-8.

def x_open(utf8_pathname, *args, **kwds):
    return open(to_os_pathname(utf8_pathname), *args, **kwds)

def x_stat(utf8_pathname):
    return os.stat(to_os_pathname(utf8_pathname))
...

# The above pattern can be used to wrap other useful functions from
# the os and os.path modules, e.g. os.stat, os.remove, os.mkdir,
# os.makedirs, os.isfile, os.isdir, os.exists, and os.getcwd.

def x_listdir(utf8_pathname):
    return map(from_os_pathname, os.listdir(to_os_pathname(utf8_pathname)))

The function standing out is x_listdir, which is like os.listdir, except it converts file names in both directions: in addition to calling to_os_pathname on the pathname received from the caller, it also calls from_os_pathname on the pathnames provided by the operating system. Taking the example from the previous chapter, x_listdir would correctly return ['\xe2\x98\x83'] (a UTF-8 encoding of the snowman character), which x_open('\xe2\x98\x83') would correctly open.

Any function in the program that accepts a file name must accept — and expect to receive — a UTF-8-encoded file name. Functions that open the file using Python’s open, or those that call third-party functions that do so, have the responsibility to use to_os_pathname to convert the file name to OS-native form.

Legacy path names

to_os_pathname is useful when calling built-in open() or into code that will eventually call built-in open(). However, sometimes C extensions beyond our control will insist on accepting the file name to open the file using the ordinary C fopen() call. Passing an OS-native Unicode file name on Windows serves no purpose here because it will fail on a string check implemented by the Python bindings for the library. And even if it somehow passed the check, the library is still going to call fopen() rather than _wfopen().

A workaround when dealing with such legacy code is possible by retrieving the Windows “short” 8+3 pathnames, which are always all-ASCII. Using the short paths, it is possible to write a to_legacy_pathname function that accepts a UTF-8 pathname and returns a byte string pathname with both Python open() and the C family of functions such as fopen(). Since short pathnames are a legacy feature of the Win32 API and can be disabled on a per-volume basis, to_legacy_pathname should only be used as a last resort, when it is impossible to open the file with other means.

if not os.path.supports_unicode_filenames:
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        return utf8_pathname
else:
    import ctypes, re
    GetShortPathNameW = ctypes.windll.kernel32.GetShortPathNameW
    has_non_ascii = re.compile(r'[^\0-\x7f]').search
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        if not has_non_ascii(utf8_pathname):
            return utf8_pathname
        unicode_pathname = unicode(utf8_pathname, 'utf-8')
        short_length = GetShortPathNameW(unicode_pathname, None, 0)
        if short_length == 0:
            raise ctypes.WinError()
        short_buf = ctypes.create_unicode_buffer(short_length)
        GetShortPathNameW(unicode_pathname, short_buf, short_length)
        short_pathname_unicode = short_buf.value
        return short_pathname_unicode.encode('ascii')

Summary

If this seems like a lot of thought for something as basic as file names with international characters, you are completely right. Doing this shouldn’t be so hard, and this can be considered an argument for moving to Python 3. However, if you are using C extensions and libraries that accept file names, simply switching to Python 3 will not be enough because the libraries and/or their Python bindings will still need to be modified to correctly handle Unicode file names. A future article will describe approaches taken for porting C and C++ code to become, for lack of a better term, Unicode-file-name-correct. Until then, the to_legacy_pathname() hack can come in quite handy.

Handling large sets of photographs and videos

In 2003 my father bought a digital camera for the family. It was an Olympus C-350 Zoom, 3.2 mpix, 3x optical zoom, a 1.8″ LCD display. At that time, at least here in Croatia, having a digital camera was fairly rare. I’m not saying I had it first in my city, but it wasn’t as commonplace as today. This was such a leap from anything that you owned. You could actually take a picture and upload it to the computer. And the image was usually great if the light conditions were optimal, of course. Indoors, and with low lighting the images were terrible.

Šibenik circa 2003/07
Šibenik circa 2003/07 on a good day
Sorry dude, it's 7:59:34 PM on an August the 18th. There's less sunlight than you think.
Sorry dude, it’s 7:59:34 PM on an August the 18th. There’s less sunlight than you think at this time of the year, so better keep the camera perfectly steady for one fifths a second.

This camera wasn’t cheap. It cost a little less than $500. For Croatian standards of the time it was a fair amount of money. Still is actually, but that is the minimum you have to spend to have a decent camera, it was like that then, it’s still like that now.

I was making pictures of the town, taking it on trips, documenting everything. Since I was always a computer enthusiast I was beginning to worry, what if the hard disk failed? I’d lose all of the photographs I had acquired. There are people that seem to underestimate the importance of photos. You take the photos, they’re nice, but they’re not that valuable right at that moment. Looking back 10 years or more, suddenly the pictures become somehow irreplaceable. They’re a direct window into your past, not the blurry vision of the past that most of us have, but something concrete and immutable. I think this especially applies when you have kids, you’ll want the pictures safe and sound, at least for a little while. Everything gets lost in the end, but why be the guy that loses something that could be classified as a family heirloom?

How not to lose the pictures and how to organize them

Here’s a high-level list of what I’ve found to be good practices, to keep it organized and safe:

  • A clear structure of stored photographs/videos. I’ve found that a single root directory with a simple YYYY-MM does the trick. I dislike categorizing pictures with directory names like summer-vacation-2003, party-at-friends-house-whose-name-I-cant-remember or something to that effect. I think that over time, the good times you had get muddled along the way, and you’ll appreciate a simple year-month format to find something or to remember an occasion. It’s like a time machine, let’s see what I was doing in the spring of 2004, and you can find fun pictures along the way.
  • This goes without saying, backups. Buy an external disk, they go cheap, and you can store a lot of photos there. Your disk can die suddenly and without notice, and all your pictures can simply vanish, never to be seen again. Sure, son, I’d love to show you pictures when I was young, but unfortunately, I couldn’t be bothered to have a backup ready and all the pictures are gone.
  • Disaster recovery – imagine your whole building/house burns to the ground. You get nothing but rubble, and although you were meticulously syncrhonizing to your backup every night to an external HDD, everything is gone. Or more realistically, your house gets broken into and they steal your electronics which contain data that is basically irreplaceable. Create a tarball of all your photographs/videos, encrypt it with a GPG key or passphrase, or with a simple SSL encryption and upload it into the cloud of your choice. Even in the event of a burglary/arson with a regular snapshot of about once per two months, you’ll still be able to recover most of the data when you rebuild your house or buy a new computer in the event of a burglary.
  • Print out a yearly compilation of pictures that you like at your local photo lab. Just pick like 40 of the best, with whatever criteria you deem fit. Who knows if the JPEG standard will be readable in 30 years time, but you can always look at a physical picture you can take with you.
Wow, I just called the cops that my house was burglarized. Now it burned down too? If only I had a disaster recovery plan for my valuable photos.
I just called in a burglary at my house. Now it burned down while getting beers from the store? If only I had a disaster recovery plan for my valuable photos on both desktop computer and portable HDD.

Photos

Most digital cameras, be it video or still frames, have pretty lavish defaults with the image quality. This is a very good thing. I like to get a source file that is close as possible as the device has serialized it to a file. Still, if you take a lot of pictures, you’ll quickly notice that it’s piling up. The first thing to do is delete the technically failed ones. Do not delete the pictures where you think that someone is ugly on it, it may end up great in a certain set of circumstances. You never know.

These days even the shittiest cameras boast with huge pixel numbers, like 10, 15 mega pixels or more with a tiny crappy lens and who knows what kind of sensor. Feel free to downsize it to 5-8 mega pixels, with a JPEG quality of 75-80. Quickly you’ll see that now your images consume a lot less space on the HDD, I’m talking about 30% of the original photo, sometimes even less. I spent a lot of time trying to find exactly how the image is degraded. Some slight aberrations can be seen if you go pixel peeping, but screw that, the photos might have sentimental value on the whole, and you’ve saved a lot of hard drive space that you realistically have available. I recommend using the Imagemagick suite for all your resizing needs. Create a directory where you want the recoded images, like lowres:

$ mogrify -path lowres -auto-orient -quality 80 -resize 8640000@ *.jpg

You can set the number of pixels, in this example it’s 8.64Mpix. Choose a resolution and go with it. I generally use 3600×2400 which is 8640000 in pixels. Mogrify is great for this task, it can encode the images in parallel, so if you have a multi-core computer it really shines since the operations involved are very CPU expensive. You can omit the -path switch, and the files will be processed and placed instead of the file, but be careful as this WILL overwrite the original file(s). Don’t test around on your only copy of the file. You can use the generally more safe convert which takes the same argument with a slight difference, it needs the INFILE and OUTFILE argument:

$ convert -auto-orient -quality 80 -resize 8640000@ mypicture.jpg mypicture-output.jpg

or

$ for JPEGS in *.jpg ; do convert -auto-orient -quality 80 -resize 8640000@ $JPEGS $JPEGS-out; done

The problem with this is that you’ll then have a bunch of IMG_xxxx.jpg-out files. This is the longer way around, but once you’re satisfied with the result, delete the original jpeg files and rename it with a program that mass renames or you can use a perl script called ren, my brother and a buddy of his wrote a long time ago and it still works great for a number of circumstances:

$ ren -c -e 's/\-out//'

This will rename all the files that have the -out to empty string, deleting it from the filename essentially. But this is the long way around, I suggest using mogrify. Mogrify had a very very nasty bug. At one point they decided that it would be cool if you have an Nvidia card and the proprietary drivers installed it would use the GPU for all your encoding needs. That sounds great in theory, but I actually had an Nvidia graphics card with the drivers properly installed. How do I know that? Complex 3D video games worked without issues. And guess what else? It didn’t fucking work. It simply hang there, and didn’t do anything, it would never finish a single image. Did I mention that you can’t fallback on the CPU so easily, I mean at all? I googled around, and multiple bugs were filed. I just tried mogrify now when writing this post, seems they have finally fixed it, and I may go back to using it again, instead of unnecessarily complex python scripts that called concurrent converts which number was based on the number of your physical cores.

Video

A nice feature of modern cameras is its ability to record decent video and audio. The cameras mostly use a very good quality preset for the recordings. On my current SLR camera I get 5-6 megabytes per second for a video. Not only that the files are monstrously huge, they also are sometimes in non-standard containers, have weird video and/or audio encodings. You should really convert it to something decent:

$ ffmpeg -i hugefile.mov -c:v libx264 -preset slow -crf 25 -x264opts keyint=123:min-keyint=20 -c:a libmp3lame -q:a 6 output.mkv

This produces a pretty good quality video. I am strongly against rescaleing the video in any way. Use the original resolution, the displays are advancing at a stable pace, you don’t want to unnecessarily scale down the resolution. You can change the quality with -crf from 18-29 are reasonable options, I discussed it in another post in more detail. Also, it decreases the file size by a factor of 15 or more, virtually without perceptible visual loss. As an added bonus you mux it into an open source container with the h264 family of encoders and the venerable mp3 format for audio. That should work on most computer players by default as well as standalone players hooked up to a TV.

I started this post as more of an in-depth technical overview how to store and encode your multimedia and backing it up. But instead I chose to give a high-level overview of what worked for me over the years. Make backups regularly, have a disaster recovery option present if at all possible, and print out some yearly photos. It’s fun to look over the physical pictures, and can be good fun showing it to visiting friends and family. When deciding how much to shrink the files, always keep in mind that you should compress them as much as possible while keeping the subjective perception of the quality as close as possible to the original. What I mean to say, don’t overdo with the quality settings. What matters is how much space is your archive consuming right now, and how are you able to cope with that amount of data.

Data loss is commonplace. Hard drives fail, do not lose 10+ years of photographs because you didn’t have a decent backup. It’s not so hard. Do it now. Don’t lose a part of your personal history, it’s priceless, and cannot be downloaded from the internet again. Always encrypt your stuff before uploading to the ethereal cloud. Maybe you have sensitive pictures that you wouldn’t want anyone else casually looking over just because they happen to be the sysadmin. You wouldn’t make the same kind of privacy breach in other parts of your life, would you?

Match file timestamp with EXIF data

Over the years I’ve collected a lot of pictures, coming close to 20000. Most of these pictures have the exif metadata embedded in the JPEG files. Alas, I was careless with some of the photographs, and when copying over from filesystem to filesystem, creating backups etc., the timestamps got overwritten. So now I had loads of files that had a timestamp of 22nd January 2010 for example. They were most definitely not taken at that date, but rather they were copied then and no preserve timestamp flag was enabled at the time of the cp issued. I googled for a quick solution to my problems, but I could not find anything that would be simple to use, and not clogged up with bullshit. So, I cracked my knuckles and delved into the world of Python:

#!/usr/bin/python2

# You need the exifread module installed
import exifread, time, datetime, os, sys

def collect_data(file):
    tags = exifread.process_file(open(file), 'rb')
    for tag in tags.keys():
        if tag in ('EXIF DateTimeOriginal'):
            return "%s" % (tags[tag])

for file in sys.argv[1:]:
    try:
        phototaken = int(datetime.datetime.strptime(collect_data(file), '%Y:%m:%d %H:%M:%S').strftime("%s"))
        os.utime(file,(phototaken,phototaken))
    except Exception, e:
        sys.stderr.write(str(e))
        sys.stderr.write('Failed on ' + str(file) + '\n')

Basically it takes each file, reads the exif metadata for photo taken and invokes the os.utime() function to set the timestamp to that date. You’ll need the exifread module for Python, this is the simplest one I could find that can do what I needed it to do. I hope someone will find this script useful. You can start it simply with $ exify *.JPG. You can download it here.

Stuffed bell peppers, the Croatian way

Stuffed bell peppers are a staple of cuisines of several southeast-European countries, including Croatia. This recipe, originally published in Croatian, presents how I make them. The translation will hopefully help this lovely Croatian dish reach a wider audience.

In Croatia we typically use the bell peppers of the “babura” variety, but other kinds of bell peppers will do nicely, as long as they are of reasonable size – at least 2 inches in diameter, and 4 inches or more in height.

STUFFED BELL PEPPERS

6 portions
preparation time: about 2 hours, largely unattended

1 onion, chopped
olive oil
salt, pepper
2-3 cloves garlic
1 tbsp paprika
1 pound ground meat, mix of beef and pork
1/2 cup rice
10 bell peppers of medium size
2-3 cups tomato purée (passata di pomodoro)
1 cup wine
water as needed

  1. Put the olive oil in a saucepan over medium heat. When the oil is warm, add onions and cook until soft, about five minutes. When the onions are nearly done, add the garlic, salt, pepper, paprika, and other spices if you like (e.g. ginger, nutmeg, or a dash of cumin). Do not overdo the spices.

  2. While the onions are cooking, wash the bell peppers, cut off the stems, and shake out the seeds. Put the ground meat in a bowl and add the cooked onions. Season with salt and pepper to taste (feel free to try it, a bit of raw meat won’t harm you), add the rice, and mix well.

  3. Stuff the bell peppers with the meat mixture, trying not to pack the meat too tightly. Arrange the peppers in a cooking pot, if possible so that they stand upright holding each other; leave as little room as possible between them. If the peppers do not fit in one layer, cook them in two smaller pots.

  4. Mix the tomato purée and wine and season with salt. Pour the mixture over the peppers in the pot and add water until the peppers are almost fully submerged. If you are using two pots, equally divide the purée and wine between them and then add water. While the peppers are cooking, do not stir them, just occasionally shake the whole pot. Cook for an hour and a half on low heat.

Let the cooked peppers rest for at least an hour. Serve with mashed potatoes and some crusty bread.

Do not throw away the puree in which the peppers were cooking. If some remains uneaten, freeze it and use it as stock for a future dish.

Recording a game video with Linux

I’m sure a lot of people have always thought, wow, I’d like to record a video of this to have it around! On Linux! Well, with it’s incredibly easy! OK, not really so easy, you’ll have to handle a few hurdles along the way, but it’s nothing terrible. As an example I’ll be using prboom which is an engine to run Doom 1/2 with the original WADs you have obtained legally, paid for fair and square etc. It uses 3D hardware acceleration, no jumping, crouching and shit like that. It’s great to see Doom 1/2 in in high resolution, it looks pretty good, and very true to the original, and makes the game more than playable.

Requirements

There is a beautiful program called glc. Basically it hooks to the video & audio of the system and dumps a shitload of fastest compressing png files, that is one per frame. Depending on the resolution you use for capturing and the framerate, expect very hardcore output per second to your HDD, somewhere around 50 megabytes per second for a full hd experience, and that’s with the quicklz compression method for glc-capture.

I won’t go into too much detail how to install glc, or prboom. I’m sure it’s simple for your favorite Linux distribution. It was a simple aurget -S on Arch. Now,let’s head on to actually capturing some gameplay. The syntax is very simple glc-capture [options] [program] [programs' arguments]:

The initial video capturing

$ glc-capture -j -s -f 60 -n -z none prboom-plus -width 1920 -height 1080 -vidmode gl -iwad dosgames/DOOM2/DOOM2.WAD -warp 13 -skill 5

This was the tricky part, I had to play around with the options to get it glitch free inside the game. I’ve recorded a video three years ago with glc, can’t remember using some of these options. -j – force-sdl-alsa-drv, got better performance, but maybe unneeded, play around with it -s – so recording starts right away -f – sets the framerate -n – locks FPS, didn’t need this before, but you get a glitch-free recording -z none – no PNG compression, I’ve had better performances without compression The prboom-plus options should be self-explanatory, I’ve used the 1920×1080 resolution so it’s youtube friendly. The -warp is to warp to level 13, and -skill 5 is for nightmare. The file output is named $PROGRAM-$PID-0.glc by default.

OK, the easy part is done, apart from the tricky part. Now you have a huge-ass .glc file on your hard drive that is completely unplayable by any video player known to man. And when I say huge-ass, I mean huge-ass. A 54 second video comes out to 1.79GB, which is 34MB per second in 720p, and for 1080p I had up to 42MB per second! The default png compression used by glc-capture is quicklz. For 1080p I had some better experiences using -z none so it simply dumps the PNGs into the file and that’s it. As you might figure, this will also increase the resulting file size, but it could be well worth the disk space if you don’t have a fast CPU. You’ll get close to a 100 MB/s for a 1080p stream. Use the default compression if in doubt. Experiment.

What do we do with an unplayable, unusable, unuploadable gigantic glc dump on our hard drive? I strongly suggest you encode it somehow. I used to use mencoder for all my encoding needs. Due to the way it’s maintained, or a lack thereof, I switched to ffmpeg which has an active development and used a lot in the backends of various video tube sites around the internet. OK, let’s go, step by step:

Extract the audio track

$ glc-play prboom-plus-12745-0.glc -a 1 -o 1080p.wav

This line dumps the audio track from the glc file, of course it’s a completely uncompressed wave file. -a 1 is for track #1, and -o is for output, naturally.

Pipe the uncompressed video to ffmpeg and encode to a reasonable file format

$ glc-play prboom-plus-12745-0.glc -o - -y 1 | ffmpeg -i - -i 1080p.wav -c:v libx264 -preset slow -crf 25 -x264opts keyint=123:min-keyint=20 -c:a libmp3lame -q:a 6 doom-final-file.mkv

-o - dumps it to STDOUT, -y 1 is for video track 1. Now we have used the all might unix PIPE. I love pipes. In this case ffmpeg uses two files as input, one is STDIN, that is the hardcore raw video file, no pngs, just raw video. The other input is the audio track we dumped earlier. This could be streamlined with a FIFO, but that’s overcomplicating things. The rest of the ffmpeg options are beyond the scope of this article, but they’re a reasonable default. The final argument of ffmpeg is the output file. The container type is determined by the file extension, so you can use mp4, mkv, or whatever you want. After this, the video is finally playable, uploadable, usable. Congrats, you have just recorded your video the Linux way!

If you do want to customize the final video quality, take a look at the the ffmpeg documentation at what these mean. The only thing of interest is the -preset and the -crf. The crf is the “quality” of the video. I was astounded that 2-pass encoding is a thing of the past, and it’s all about the crf now. It goes from 0 to 51. And only a small part of that integer range is actually usable. I simply cannot relay the beautiful wording from the docs, so I’ll just paste it here:

The range of the quantizer scale is 0-51: where 0 is lossless, 23 is default, and 51 is worst possible. A lower value is a higher quality and a subjectively sane range is 18-28. Consider 18 to be visually lossless or nearly so: it should look the same or nearly the same as the input but it isn’t technically lossless.

Details like these can really brighten a person’s day. 18 is visually lossless (and no doubt uses a billion bits per second), but technically only 0 is lossless. So you have a full range from 0 to 18 that is basically useless. Of course, it goes the other way around. After -crf 29 the quality really goes downhill.

The resulting video can be found here or you can see it on YouTube. Excuse my cheating and my dying so fast, this is for demonstration purposes.

Conclusion

I realize there are probably better ways of accomplishing this, you can google around for better solutions. Glc-capture supposedly works with wine too, with some tweaks. I haven’t really tried it, but feel free to leave a comment if someone had any experience with it. This is a simple way to make a recording, you can edit it later once you encode the file to something normal. Glc also supports recording multiple audio tracks so you could also record your voice with a microphone and mash it all together. Good luck with that. :)

A different view

Some years ago my sister-in-law asked me to write an essay she had for her Croatian homework. I said, I suck at writing essays, but if it’s an interesting topic, why the heck not. After all, I’m sometimes known for my ability to be a smartass which can come in handy. The topic of the essay was “The universe and my place within it”. Wow, now that’s something I can do. I wrote it all down, and her teacher was genuinely impressed with it, so much that she had pinned it in the school lobby as the best essay written. Supposedly the essay is still pinned there, 4 years later, but that’s speculation on her part.

However, we were caught. Our genius plan has been foiled by her teacher. She immediately knew she didn’t actually write a word of it, and had asked her a simple question; what exactly is a galaxy? The sister-in-law, then 15, replied with a blank stare. The teacher made a proposition: “Would you like a passing grade, or write your own?” She chose the barely passing grade. Anyway, looking back at the piece I wrote, it’s kinda cute and decided to share it here, translated to English. Since this is somewhat lost in the translation, the protagonist is the sister-in-law and is set from her point of view after returning home from school.


During a windy and cold, starry winter night while coming home from school I saw the Milky Way in the sky. This wasn’t the first time I had laid my eyes on it, but it was the first time I appreciated the implications of such a sight. It’s a view to the center of our galaxy, in which we are nothing but an insignificant planet in our galactic neighborhood, just one of many. I felt how I was standing on one of those tiny rocks, the gravitational pull of the entire planet pulling me down to the center of the rock that is enormous to us, but a speck of dust compared to other astronomical bodies. The seemingly infinite number of stars only in our galaxy, all with their planets moving of their own accord can leave a person breathless.

You can ask yourself a lot of questions. The first question that pops into one’s mind is “are we alone in the universe?” The answer is, of course we’re not alone in the universe! Out of all the countless planets that are woven throughout the universe, it’s impossible that the Earth is the only one blessed with the prerequisites for life. Where are the little green men, then? No one said they were anywhere close. Life, at least the way we know it could be extremely rare. It’s possible that a planet like Earth is only present in one out of a billion galaxies. Even taking into account such a grim approximation it would mean there are at least 200 to 300 alien civilizations in the visible universe that are asking themselves the very same questions I’m asking while gazing into the beautiful night sky.

The added problem of our alien brethren is the time and the vast distances involved. An alien civilization could have been at arms reach a mere 3 billion years ago, long dead by now with their sun dying a violent death which can now be seen as the remains of a long-gone alien solar system through telescopes and pretty pictures on the internet. The roles could be reversed, in 10 billion years we’ll be the lovely false color pictures. The distances and times involved are so colossal that it begs the question can technology within one system, our universe, ever be outside the bounds of said system so we could conceivably call our galactic neighbors to lunch.

And this thing is totally nearby
The commute is terrible, and this thing is totally nearby

You can’t rule out the possibility of a consciousness and intelligence so alien to us that even with our most intense efforts we can never hope to detect them, so different that it’s not even possible in a sense. They might not be living on a planet somewhere and asking themselves dumb questions like us. Those kind of analogies could be applied here, to Earth. You’re going somewhere, minding your own business, when you encounter a “lower” life-form, a worm or something similar. Can we help it in some way, step on it by accident or with intent? We can even simply not perceive it in any way, shape or form and go about your merry way. Can a worm discern us in a sensible kind of way, inside its own limited perception of the real world, whatever that may be? It could very likely be that we’re the worm in a scenario where a “higher” life-form is asking similar questions about us. Is this what they call a deity?

The Milky Way is abruptly replaced by concrete and shingles of my own home. Oh, I’ve arrived. The level of inspiration gathered by a simple gaze at the center of our galaxy is unbeatable, albeit it almost completely vanishes as soon as you go through the doorstep. Watching other Suns, other worlds, even if it is through your mind is now replaced by other simpler and in a way harder questions and problems. In any case, I can’t wait for my next walk through the cosmos!

Punjene paprike

Punjene paprike su klasično hrvatsko jelo podjednako popularno na sjeveru i jugu. Ova verzija je kako ih ja radim, na više-manje klasičan način, ali uz pokoji suvremeni začin.

PUNJENE PAPRIKE

za 6 porcija
vrijeme pripreme: oko 2 sata, uglavnom bez nadzora

1 glavica luka, sjeckana
maslinovo ulje
sol, papar
2-3 češnja češnjaka
žlica slatke mljevene paprike
1/2 kg miješanog mljevenog mesa
15dkg riže
10-ak paprika srednje veličine
500-750g pasirane rajčice
2dl vina
voda po potrebi

  1. Prekrijte dno tave maslinovim uljem i zažutite luk na srednjoj vatri. Pred kraj dodajte zgnječeni češnjak, sol, papar, mljevenu papriku i po želji druge začine (na primjer, đumbir, kumin, tajlandski curry ili muškatni oraščić). Sa začinima ne pretjerujte da ne prevladaju nad finim okusom paprike.

  2. Dok se luk prži, operite paprike i skinite im poklopce. Mljeveno meso stavite u zdjelu i pomiješajte ga s prženim lukom. Probajte je li dovoljno slano (iako je sirovo, od zalogaja vam neće ništa biti) i po potrebi dosolite. Dodajte rižu i dobro promiješajte.

  3. Napunite paprike mesnom smjesom, pazeći da ih previše ne nabijate. Posložite paprike u lonac, ako je moguće tako da stoje uspravno pridržavajući jedna drugu; neka između paprika bude što manje razmaka. Ako paprike ne stanu u jedan red, bolje ih je kuhati u dva manja lonca nego slagati u dva reda.

  4. Pomiješajte pasirane rajčice s vinom i posolite. Prelijte paprike tekućinom i dolijte vode dok ne dođe blizu vrha paprika. Ako kuhate u dva lonca, ravnomjerno podijelite pasirane rajčice i vino između njih i zatim dolijte vodu. Dok se paprike kuhaju, nemojte ih miješati, samo povremeno protresite lonac. Neka se kuhaju sat i pol na laganoj vatri.

Pustite kuhane paprike da odstoje barem sat vremena i poslužite ih s pireom i kruhom.

Umak koji se ne pojede nemojte baciti, zamrznite ga i upotrijebite kao fini temeljac za neko buduće jelo.