Simple symmetric encryption

Encryption. It’s one of those words that programmers and sysadmins dread. Always the complications, always the overhead. There is an entire science and math behind encryption, and if you think about it more closely, it makes sense that it’s so complicated. Imagine that you are in a room full of people and you need to say something to your wife that you don’t want anyone else to understand, but they’re all listening. “Honey, are we having sex tonight? Please? – C’mon, we had sex two weeks ago, what do you want from me?”, the wife answers. But if the conversation goes like “Ubarl, ner jr univat frk gbavtug? Cyrnfr? – P’zba, jr unq frk gjb jrrxf ntb, jung qb lbh jnag sebz zr?”, it would be much harder to understand. This is a simple ROT13, but chances are people won’t understand and chances are you won’t be able to pronounce it anyway. Computer encryption works similarly, but needs to set the encryption keys in plain view of everyone, but the implementation is beyond the scope of this article. Take a look at this article for a better explanation.

The cloud & you

In a recent post I spoke briefly about encryption and the omnipresent cloud, but didn’t really get into it. The article is entertaining the idea that you keep a monthly snapshot of all your pictures or something else valuable on a cloud provider, like Dropbox or Google Drive. The point is, keeping possibly sensitive data somewhere that any bored sysadmin can casually go over your files is a bad idea. All you have is their pinky swear that they won’t do such a thing. Your account can get hacked, as we all saw is perfectly possible with the recent “The Fappening” incident. It’s a bad personal security breach. The best course of action is to nip this scenario in the bud, and simply encrypt your stuff before sending over to a cloud provider.

How to easily encrypt your files? The easiest method is using a symmetrical encryption with openssl. You could use GPG, have a complete set of private/public keys, etc. This complicates matters considerably, and you’re screwed if you lost your private key. If this is a case where you offload an encrypted tarball somewhere, and you lost your equipment, better have that 4096-bit key memorized. What we’ll do instead, is use a regular strong password. Remember, we’re not trying to make it as secure as possible, we’re just making it so not every Tom, Dick and Harry from your friendly cloud provider can view your files if they feel like it. This is by far the fastest and easiest way.

Encrypting:

$ tar c documents | openssl aes-256-cbc -in /dev/stdin -out documents.tar.ssl
enter aes-256-cbc encryption password:
Verifying - enter aes-256-cbc encryption password:

That’s it! Your files have been encrypted. Feel free to throw in z or j to tar because openssl won’t compress data. Also, openssl salts by default, so you don’t have to worry about that. Upload the tarball and you’re done. Of course, keep the password to at least 8 characters, no dictionary words, birthdays, use special characters, etc.

Decrypting:

$ openssl aes-256-cbc -d -in documents.tar.ssl | tar x
enter aes-256-cbc decryption password:

This will decrypt your files. There is one big caveat with this. Say your photographs, or important personal projects consume a lot of space, like 20 GB and have a lot of files. Making a single encrypted tarball every month is OK, but uploading a brand new snapshot from your Cable/ADSL line isn’t. I personally use pycryto, it’s a python script I wrote to recursively encrypt all the files within the current directory, delete the original by default, and replace them with .enc files which are encrypted with your password. The timestamps and permissions are preserved, but in the metadata of the files themselves, not contained in the encrypted files. Even still, it’s very rsyncable like this. I have a copy of my photographs on this very server.

Conclusion – if there is any

Why go through all this trouble? They’re just pictures? That part is true, if it’s only pictures, and not an offsite backup of your important work projects that might not be viewable for everyone. It’s more of a principle. And I realize that this way is not cryptographically the best way possible to encrypt your data, but I feel that it’s good enough so it’s not viewable by default. Plaintext sucks. Also, there’s a pretty good chance no one will ever get to see your files, because they don’t do it in general. I’m a sysadmin too, I have access to sensitive data, but I view it as important cargo. I don’t give a flying fuck what’s in it, I really don’t. It’s all just so unimportant for me to actually take a peak. There’s nothing to gain. I have a job to do. I’ve had various jobs throughout my life, from splitting rocks in a quarry, to basic ship maintenance (sanding the chemicals that make it harder for underwater life to latch on to the ship), hauling around cargo, mostly menial jobs. But I’ve always held the same stance. There is nothing to gain from stealing or cheating anything or anyone, you’ll only get a bad rap if you’re caught, and you have to look yourself in the mirror even if you don’t get caught. Not sure how the people that engage in those activities reconcile with their inner-self.

Encrypt files recursively with openssl

I wrote this program because I had a great idea to offload encrypted versions of my data, but conserving the full directory structure, keeping the permissions and timestamps. This way you can do incremental encrypted snapshots to an untrusted remote server. Always rsync the directory you wish to encrypt someplace else locally, and then run this script.

$ pycrypto.py -e 

This encrypts everything in the "." directory, recursively, and removes the original files. Pass the -n switch to not remove the files.

$ pyrcypto.py -d

This will decrypt everything within "." directory, with the .enc extension. Please be aware that because of the nature of the encryption used, there is no sanity check. If you enter the wrong password while decrypting, the files will be “decrypted” with the wrong password. You’ll get the files, but it won’t be the files you have encrypted in the first place. I’ve considered using a hashing method, but this comprimises security and slows down the process considerably and this was designed to be fast. You can download pycrypto here.

#!/usr/bin/env python2

"""
This script encrypts files in the current (.) directory, including 
hidden files using the AES encryption. The original timestamps are 
preserved and the original files are deleted. A .enc suffix is added 
at the end of each encrypted file. The purpose of the program is to 
encrypt the data, while preserving the original directory structure 
and timestamps so you can safely rsync it to an unsecure location. 
The passphrase needs to be either 16, 24 or 32 bytes long.

You will need to have python-crypto installed on your system, most
distributions have it in their repositories.
"""

from Crypto.Cipher import AES
from stat import *

import os, random, struct, optparse, sys, getpass

# Define options
parser = optparse.OptionParser(usage="%prog --encrypt | --decrypt")
parser.add_option('-e', '--encrypt', dest='enc', action="store_true",
       help='Encrypt the entire current directory including files in'
       ' subtrees and hidden files')
parser.add_option('-d', '--decrypt', dest='dec', action="store_true",
       help='Decrypt the entire current directory including files in'
       ' subtrees and hidden files')
parser.add_option('-v', '--verbose', dest='verb', action="store_true",
       help='Verbose mode')
parser.add_option('-n', '--no-remove', dest='delete', action="store_true",
       help='Do not remove input files once encrypted or decrypted')       
options, files = parser.parse_args()

FILES=[]

def pad_password(pwd):
    "Pad the password to lengths 16, 24, or 32, as needed for AES encryption."
    for size in 16, 24, 32:
        if len(pwd) <= size:
            return pwd + 'x' * (size - len(pwd))
    raise ValueError("password must be 32 characters, or shorter")



def encrypt_file(key, in_filename, out_filename, atime, mtime, perm, chunksize=64*1024):
    if options.verb:
        sys.stdout.write("Encrypting %s\n" % in_filename[2:])
    iv = ''.join(chr(random.randint(0, 0xFF)) for i in range(16))
    encryptor = AES.new(key, AES.MODE_CBC, iv)
    filesize = os.path.getsize(in_filename)

    with open(in_filename, 'rb') as infile:
        with open(out_filename, 'wb') as outfile:
            outfile.write(struct.pack('<Q', filesize))
            outfile.write(iv)

            while True:
                chunk = infile.read(chunksize)
                if len(chunk) == 0:
                    break
                elif len(chunk) % 16 != 0:
                    chunk += ' ' * (16 - len(chunk) % 16)

                outfile.write(encryptor.encrypt(chunk))
    os.utime(out_filename, (atime, mtime))
    os.chmod(out_filename, perm)
    if options.delete == None:
       os.remove(in_filename)


def decrypt_file(key, in_filename, out_filename, atime, mtime, perm, chunksize=24*1024):
    if options.verb:
        sys.stdout.write("Decrypting %s\n" % in_filename[2:])

    with open(in_filename, 'rb') as infile:
        origsize = struct.unpack('<Q', infile.read(struct.calcsize('Q')))[0]
        iv = infile.read(16)
        decryptor = AES.new(key, AES.MODE_CBC, iv)

        with open(out_filename, 'wb') as outfile:
            while True:
                chunk = infile.read(chunksize)
                if len(chunk) == 0:
                    break
                outfile.write(decryptor.decrypt(chunk))

            outfile.truncate(origsize)
    os.utime(out_filename, (atime, mtime))
    os.chmod(out_filename, perm)
    if options.delete == None:
       os.remove(in_filename)

if options.enc and options.dec:
    sys.stderr.write("Please use -e or -d")
    sys.exit(1)

if options.enc == None and options.dec == None:
    sys.stderr.write("Please use -e or -d\n")
    sys.exit(1)

for dirname, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        FILES.append(os.path.join(dirname, filename))

if options.enc:
    pwd = pad_password(getpass.getpass())
    for i in FILES:
       if os.path.islink(i):
           continue 
       encrypt_file(pwd, i, i+".enc",
       os.path.getatime(i), os.path.getmtime(i), os.stat(i)[ST_MODE])

if options.dec:
   pwd = pad_password(getpass.getpass())       
   for i in FILES:
       if os.path.islink(i):
           continue 
       decrypt_file(pwd, i, i[:-4], 
       os.path.getatime(i), os.path.getmtime(i), os.stat(i)[ST_MODE])

International file names in cross-platform programs

I work for a company that builds simulation software with the front-end GUI developed mostly in Python. This document is a slightly modified version of a guide written for the GUI developers to ensure that file names with international characters work across the supported platforms. Note that this document is specifically about file names, not file contents, which is a separate topic.

Introduction

Modern operating systems support use of international characters in file and directory names. Users not only routinely expect being able to name their files in their native language, but also being able to manipulate files created by users of other languages.

Historically, most systems implemented file names with byte strings where the value of each byte was restricted to the ASCII range (0-127). When operating systems started supporting non-English scripts, byte values between 128 and 255 got used for accented characters. Since there are more than 128 such characters in European languages, they were grouped in character encodings or code pages, and the interpretation of a specific byte value was determined according to the currently active code page. Thus a file with the name specified in Python as '\xa9ibenik.txt' would appear to an Eastern-European language user as Šibenik.txt, but to a Western-European as ©ibenik.txt. As long as users from different code pages never exchanged files, this trick allowed smuggling non-English letters to files names. And while this worked well enough for localization in European countries, it failed at internationalization, which implies exchange and common storage of files from different languages and existence of bilingual and multilingual environments. In addition to that, single-byte code pages failed to accomodate East Asian languages, which required much more than 128 different characters in a single language. The solution chosen for this issue by operating system vendors was allowing the full Unicode repertoire in file names.

Popular operating systems have settled on two strategies for supporting Unicode file names, one taken by Unix systems, and the other by MS Windows. Unix continued to treat file names as bytes, and deployed a scheme for applications to encode Unicode characters into the byte sequence. Windows, on the other hand, switched to natively representing file names in Unicode, and added new Unicode-aware APIs for manipulating them. Old byte-based APIs continued to be available on Windows for backward compatibility, but could not be used to access files other than those with names representable in the currently active code page.

These design differences require consideration on the part of designers of cross-platform software in order to fully support multilingual file names on all relevant platforms.

Unicode encodings

Unicode is a character set designed to support writing all human languages in present use. It currently includes more than 100 thousand characters, each assigned a numeric code called a code point. Characters from ASCII and ISO 8859-1 (Western-European) character sets retained their previous numeric values in Unicode. Thus the code point 65 corresponds to the letter A, and the code point 169 corresponds to the copyright symbol ©. On the other hand, the letter Š has the value 169 in ISO 8859-2, the value 138 in Windows code page 1250, and code point 352 in Unicode.

Unicode strings are sequences of code points. Since computer storage is normally addressed in fixed-size units such as bytes, code point values need to be mapped to such fixed-size code units, or encoded. Mainstream usage has stabilized on a small number of standard encodings.

UTF-8

UTF-8 is an encoding that maps Unicode characters to sequences of 1-4 bytes. ASCII characters are mapped to their ASCII values, so that any ASCII string is also a valid UTF-8 string with the same meaning. Non-ASCII characters are encoded as sequences of up to four bytes.

Compatibility with ASCII makes UTF-8 convenient for introducing Unicode to previously ASCII-only file formats and APIs. Unix internationalization and modern Internet protocols heavily rely on UTF-8.

UTF-16

The UTF-16 encoding maps Unicode characters to 16-bit numbers. Characters with code points that fit in 16 bits are represented by a single 16-bit number, and others are split into pairs of 16-bit numbers, the so-called surrogates.

Windows system APIs use UTF-16 to represent Unicode, and the documentation often refers to UTF-16 strings as “Unicode strings”. Java and DotNET strings also use the UTF-16 encoding.

UTF-32

The UTF-32 encoding maps characters to 32-bit numbers that directly correspond to their code point values. It is the simplest of the standard encodings, and the most memory-intensive one.

System support for Unicode

Windows

Windows file names are natively stored in Unicode. All relevant Win32 calls work with UTF-16 and accept wchar_t * “wide string” arguments, with char * “ansi” versions provided for backward compatibility. Since file names are internally stored as Unicode, only the Unicode APIs are guaranteed to operate on all possible files. The char based APIs are considered legacy and work on a subset of files, namely those whose names can be expressed in the current code page. Windows provides no native support for accessing Unicode file names using UTF-8.

The Win32 API automatically maps C API calls to wide (UTF-16) or single-byte variants according to the value of the UNICODE preprocessor symbol. Functions standardized by C, C++, and POSIX have types specified by the standard and cannot be automatically mapped to Unicode versions. To simplify porting, Windows provides proprietary alternatives, such as the _wfopen() alternative to C fopen(), or the _wstat() alternative to POSIX stat(). Like Win32 byte-oriented functions, the standard functions only work for files whose names can be represented in the current code page. Opening a Japanese-named file on a German-language workstation is simply not possible using standard functions such as fopen() (except by resorting to unreliable workarounds such as 8+3 paths). This is a very important limitation which affects the design of portable applications.

Standard C++ functions, such as std::fstream::open, have overloads for both char * and wchar_t *. Programmers that want their programs to be able to manipulate any file on the file system must make sure to use the wchar_t * overloads. The char * overloads are also limited to opening non-Unicode file names.

Unix

The Unix C library does support the wchar_t type for accessing file contents as Unicode, but not for specifying file names. The operating system kernel treats file names as byte strings, leaving it up to the user environment to interpret them. This interpretation, known as the “file name encoding”, is defined by the locale, itself configured with LC_* environment variables. Modern systems use UTF-8 locales in order to support multilingual use.

For example, when a user wishes to open a file with Unicode characters, such as Šibenik.txt, the application will encode the file name as a UTF-8 byte string, such as "\xc5\xa0ibenik.txt", and pass that string to fopen(). Later, system functions like readdir() will retrieve the same UTF-8 file name, which the application’s file chooser will display to the user as Šibenik.txt. As long as all programs agree on the use of UTF-8, this scheme supports unrestricted use of Unicode characters in file names.

The important consequence of this design is that storing file names as Unicode in the application and encoding them as UTF-8 when passing them to the system will only allow manipulating files whose names are valid UTF-8 strings. To open an arbitrary file on the file system, one must store file names as byte strings. This is exactly the opposite of the situation on Windows, a fact that portable code must take into account.

Python

Beginning with version 2.0, Python optionally supports Unicode strings. However, most libraries work with byte strings natively (often using UTF-8 to support Unicode), and using Unicode strings is slower and leads to problems when Unicode strings interact with ordinary strings.

On Windows, Python internally uses the legacy byte-based APIs when given byte strings and Windows-specific Unicode APIs when given Unicode strings. This means that Unicode files can be manipulated as long as the programmer remembers to create the correct Unicode string. It is not only impossible to open some files using the bytes API, they are misrepresented by functions such as os.listdir::

>>> with open(u'\N{SNOWMAN}.txt', 'w'):
...   pass   # create a file with Unicode name
... 
>>> os.listdir('.')
['?.txt']
>>> open('?.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '?.txt'

Opening the directory in Windows Explorer reveals that Python created the file with the correct name. It is os.listdir and its constraint to return byte strings when given a byte string that creates the problem. os.listdir(u'.') returns the usable [u'\u2603.txt'].

Python 3

Python 3 strings are Unicode by default, so it automatically calls the Unicode versions of Win32 calls and does not exhibit bugs like the listdir bug shown above. On the other hand, Python 3 needs special provisions to map arbitrary Unix file names to Unicode, as described in PEP 383.

File names in applications

Portable programs that want to enable the user to create arbitrary file names must take care how to create and access them. Using portable IO libraries such as gio and Qt resolves many of these problems automatically, but these libraries carry a lot of weight that is unacceptable in many situations. Also, those libraries often don’t interact well with “traditional” C code that accepts file names. In this chapter we present an implementation strategy that enables correct use of Unicode file names with minimal intrusion to the code base.

Since file names are natively bytes on some platforms and Unicode on others, a cross-platform application must choose between these representations. Using Unicode makes programming somewhat easier on platforms with native Unicode APIs, while using UTF-8 bytes has the advantage on platforms with native bytes APIs.

What representation works best depends on the application’s surroundings and the implementation platform. A Python 3 or Java application running on a web server is probably best served by using Unicode consistently and not bothering with Unix non-UTF-8 file names at all. On the other hand, a GTK application, a Python 2 application, or an application needing to interface with C will be off with UTF-8, which guarantees interoperability with the world of bytes, while retaining lossless conversion to Unicode and back.

This guide presents a programming model based on UTF-8 as the file name representation. UTF-8 was chosen for AVL simulation GUIs due to ease of interoperability with various C APIs, including GTK itself. This choice is also shared by the gio library and other modern Unix-based software. Of course, use of UTF-8 is not limited just to file names, it should be used for representation of all user-visible textual data.

Interacting with the file system from Python

Since Python’s built-in functions such as open and os.listdir accept and correctly handle Unicode file names on Windows, the trick is making sure that they are called with correct arguments. This requires two primitives:

  • to_os_pathname— converts a UTF-8 pathname (file or directory name) to OS-native representation, i.e. Unicode when on Windows. The return value should only be used as argument to built-in open(), or to functions that will eventually call it.

  • from_os_pathname — the exact reverse. Given an OS-native representation of pathname, returns a UTF-8-encoded byte string suitable for use in the GUI.

The implementation of both functions is trivial:

  def to_os_pathname(utf8_pathname):
      """Convert UTF-8 pathname to OS-native representation."""
      if os.path.supports_unicode_filenames:
          return unicode(utf8_pathname, 'utf-8')
      else:
          return pathname

  def from_os_pathname(os_pathname):
      """Convert OS-native pathname to UTF-8 representation."""
      if os.path.supports_unicode_filenames:
          return os_pathname.encode('utf-8')
      else:
          return os_pathname

With these in place, the next step is wrapping file name access with calls to to_os_pathname. Likewise, file names obtained from the system, as with a call to os.listdir must be converted back to UTF-8.

def x_open(utf8_pathname, *args, **kwds):
    return open(to_os_pathname(utf8_pathname), *args, **kwds)

def x_stat(utf8_pathname):
    return os.stat(to_os_pathname(utf8_pathname))
...

# The above pattern can be used to wrap other useful functions from
# the os and os.path modules, e.g. os.stat, os.remove, os.mkdir,
# os.makedirs, os.isfile, os.isdir, os.exists, and os.getcwd.

def x_listdir(utf8_pathname):
    return map(from_os_pathname, os.listdir(to_os_pathname(utf8_pathname)))

The function standing out is x_listdir, which is like os.listdir, except it converts file names in both directions: in addition to calling to_os_pathname on the pathname received from the caller, it also calls from_os_pathname on the pathnames provided by the operating system. Taking the example from the previous chapter, x_listdir would correctly return ['\xe2\x98\x83'] (a UTF-8 encoding of the snowman character), which x_open('\xe2\x98\x83') would correctly open.

Any function in the program that accepts a file name must accept — and expect to receive — a UTF-8-encoded file name. Functions that open the file using Python’s open, or those that call third-party functions that do so, have the responsibility to use to_os_pathname to convert the file name to OS-native form.

Legacy path names

to_os_pathname is useful when calling built-in open() or into code that will eventually call built-in open(). However, sometimes C extensions beyond our control will insist on accepting the file name to open the file using the ordinary C fopen() call. Passing an OS-native Unicode file name on Windows serves no purpose here because it will fail on a string check implemented by the Python bindings for the library. And even if it somehow passed the check, the library is still going to call fopen() rather than _wfopen().

A workaround when dealing with such legacy code is possible by retrieving the Windows “short” 8+3 pathnames, which are always all-ASCII. Using the short paths, it is possible to write a to_legacy_pathname function that accepts a UTF-8 pathname and returns a byte string pathname with both Python open() and the C family of functions such as fopen(). Since short pathnames are a legacy feature of the Win32 API and can be disabled on a per-volume basis, to_legacy_pathname should only be used as a last resort, when it is impossible to open the file with other means.

if not os.path.supports_unicode_filenames:
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        return utf8_pathname
else:
    import ctypes, re
    GetShortPathNameW = ctypes.windll.kernel32.GetShortPathNameW
    has_non_ascii = re.compile(r'[^\0-\x7f]').search
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        if not has_non_ascii(utf8_pathname):
            return utf8_pathname
        unicode_pathname = unicode(utf8_pathname, 'utf-8')
        short_length = GetShortPathNameW(unicode_pathname, None, 0)
        if short_length == 0:
            raise ctypes.WinError()
        short_buf = ctypes.create_unicode_buffer(short_length)
        GetShortPathNameW(unicode_pathname, short_buf, short_length)
        short_pathname_unicode = short_buf.value
        return short_pathname_unicode.encode('ascii')

Summary

If this seems like a lot of thought for something as basic as file names with international characters, you are completely right. Doing this shouldn’t be so hard, and this can be considered an argument for moving to Python 3. However, if you are using C extensions and libraries that accept file names, simply switching to Python 3 will not be enough because the libraries and/or their Python bindings will still need to be modified to correctly handle Unicode file names. A future article will describe approaches taken for porting C and C++ code to become, for lack of a better term, Unicode-file-name-correct. Until then, the to_legacy_pathname() hack can come in quite handy.

Handling large sets of photographs and videos

In 2003 my father bought a digital camera for the family. It was an Olympus C-350 Zoom, 3.2 mpix, 3x optical zoom, a 1.8″ LCD display. At that time, at least here in Croatia, having a digital camera was fairly rare. I’m not saying I had it first in my city, but it wasn’t as commonplace as today. This was such a leap from anything that you owned. You could actually take a picture and upload it to the computer. And the image was usually great if the light conditions were optimal, of course. Indoors, and with low lighting the images were terrible.

Šibenik circa 2003/07
Šibenik circa 2003/07 on a good day
Sorry dude, it's 7:59:34 PM on an August the 18th. There's less sunlight than you think.
Sorry dude, it’s 7:59:34 PM on an August the 18th. There’s less sunlight than you think at this time of the year, so better keep the camera perfectly steady for one fifths a second.

This camera wasn’t cheap. It cost a little less than $500. For Croatian standards of the time it was a fair amount of money. Still is actually, but that is the minimum you have to spend to have a decent camera, it was like that then, it’s still like that now.

I was making pictures of the town, taking it on trips, documenting everything. Since I was always a computer enthusiast I was beginning to worry, what if the hard disk failed? I’d lose all of the photographs I had acquired. There are people that seem to underestimate the importance of photos. You take the photos, they’re nice, but they’re not that valuable right at that moment. Looking back 10 years or more, suddenly the pictures become somehow irreplaceable. They’re a direct window into your past, not the blurry vision of the past that most of us have, but something concrete and immutable. I think this especially applies when you have kids, you’ll want the pictures safe and sound, at least for a little while. Everything gets lost in the end, but why be the guy that loses something that could be classified as a family heirloom?

How not to lose the pictures and how to organize them

Here’s a high-level list of what I’ve found to be good practices, to keep it organized and safe:

  • A clear structure of stored photographs/videos. I’ve found that a single root directory with a simple YYYY-MM does the trick. I dislike categorizing pictures with directory names like summer-vacation-2003, party-at-friends-house-whose-name-I-cant-remember or something to that effect. I think that over time, the good times you had get muddled along the way, and you’ll appreciate a simple year-month format to find something or to remember an occasion. It’s like a time machine, let’s see what I was doing in the spring of 2004, and you can find fun pictures along the way.
  • This goes without saying, backups. Buy an external disk, they go cheap, and you can store a lot of photos there. Your disk can die suddenly and without notice, and all your pictures can simply vanish, never to be seen again. Sure, son, I’d love to show you pictures when I was young, but unfortunately, I couldn’t be bothered to have a backup ready and all the pictures are gone.
  • Disaster recovery – imagine your whole building/house burns to the ground. You get nothing but rubble, and although you were meticulously syncrhonizing to your backup every night to an external HDD, everything is gone. Or more realistically, your house gets broken into and they steal your electronics which contain data that is basically irreplaceable. Create a tarball of all your photographs/videos, encrypt it with a GPG key or passphrase, or with a simple SSL encryption and upload it into the cloud of your choice. Even in the event of a burglary/arson with a regular snapshot of about once per two months, you’ll still be able to recover most of the data when you rebuild your house or buy a new computer in the event of a burglary.
  • Print out a yearly compilation of pictures that you like at your local photo lab. Just pick like 40 of the best, with whatever criteria you deem fit. Who knows if the JPEG standard will be readable in 30 years time, but you can always look at a physical picture you can take with you.
Wow, I just called the cops that my house was burglarized. Now it burned down too? If only I had a disaster recovery plan for my valuable photos.
I just called in a burglary at my house. Now it burned down while getting beers from the store? If only I had a disaster recovery plan for my valuable photos on both desktop computer and portable HDD.

Photos

Most digital cameras, be it video or still frames, have pretty lavish defaults with the image quality. This is a very good thing. I like to get a source file that is close as possible as the device has serialized it to a file. Still, if you take a lot of pictures, you’ll quickly notice that it’s piling up. The first thing to do is delete the technically failed ones. Do not delete the pictures where you think that someone is ugly on it, it may end up great in a certain set of circumstances. You never know.

These days even the shittiest cameras boast with huge pixel numbers, like 10, 15 mega pixels or more with a tiny crappy lens and who knows what kind of sensor. Feel free to downsize it to 5-8 mega pixels, with a JPEG quality of 75-80. Quickly you’ll see that now your images consume a lot less space on the HDD, I’m talking about 30% of the original photo, sometimes even less. I spent a lot of time trying to find exactly how the image is degraded. Some slight aberrations can be seen if you go pixel peeping, but screw that, the photos might have sentimental value on the whole, and you’ve saved a lot of hard drive space that you realistically have available. I recommend using the Imagemagick suite for all your resizing needs. Create a directory where you want the recoded images, like lowres:

$ mogrify -path lowres -auto-orient -quality 80 -resize 8640000@ *.jpg

You can set the number of pixels, in this example it’s 8.64Mpix. Choose a resolution and go with it. I generally use 3600×2400 which is 8640000 in pixels. Mogrify is great for this task, it can encode the images in parallel, so if you have a multi-core computer it really shines since the operations involved are very CPU expensive. You can omit the -path switch, and the files will be processed and placed instead of the file, but be careful as this WILL overwrite the original file(s). Don’t test around on your only copy of the file. You can use the generally more safe convert which takes the same argument with a slight difference, it needs the INFILE and OUTFILE argument:

$ convert -auto-orient -quality 80 -resize 8640000@ mypicture.jpg mypicture-output.jpg

or

$ for JPEGS in *.jpg ; do convert -auto-orient -quality 80 -resize 8640000@ $JPEGS $JPEGS-out; done

The problem with this is that you’ll then have a bunch of IMG_xxxx.jpg-out files. This is the longer way around, but once you’re satisfied with the result, delete the original jpeg files and rename it with a program that mass renames or you can use a perl script called ren, my brother and a buddy of his wrote a long time ago and it still works great for a number of circumstances:

$ ren -c -e 's/\-out//'

This will rename all the files that have the -out to empty string, deleting it from the filename essentially. But this is the long way around, I suggest using mogrify. Mogrify had a very very nasty bug. At one point they decided that it would be cool if you have an Nvidia card and the proprietary drivers installed it would use the GPU for all your encoding needs. That sounds great in theory, but I actually had an Nvidia graphics card with the drivers properly installed. How do I know that? Complex 3D video games worked without issues. And guess what else? It didn’t fucking work. It simply hang there, and didn’t do anything, it would never finish a single image. Did I mention that you can’t fallback on the CPU so easily, I mean at all? I googled around, and multiple bugs were filed. I just tried mogrify now when writing this post, seems they have finally fixed it, and I may go back to using it again, instead of unnecessarily complex python scripts that called concurrent converts which number was based on the number of your physical cores.

Video

A nice feature of modern cameras is its ability to record decent video and audio. The cameras mostly use a very good quality preset for the recordings. On my current SLR camera I get 5-6 megabytes per second for a video. Not only that the files are monstrously huge, they also are sometimes in non-standard containers, have weird video and/or audio encodings. You should really convert it to something decent:

$ ffmpeg -i hugefile.mov -c:v libx264 -preset slow -crf 25 -x264opts keyint=123:min-keyint=20 -c:a libmp3lame -q:a 6 output.mkv

This produces a pretty good quality video. I am strongly against rescaleing the video in any way. Use the original resolution, the displays are advancing at a stable pace, you don’t want to unnecessarily scale down the resolution. You can change the quality with -crf from 18-29 are reasonable options, I discussed it in another post in more detail. Also, it decreases the file size by a factor of 15 or more, virtually without perceptible visual loss. As an added bonus you mux it into an open source container with the h264 family of encoders and the venerable mp3 format for audio. That should work on most computer players by default as well as standalone players hooked up to a TV.

I started this post as more of an in-depth technical overview how to store and encode your multimedia and backing it up. But instead I chose to give a high-level overview of what worked for me over the years. Make backups regularly, have a disaster recovery option present if at all possible, and print out some yearly photos. It’s fun to look over the physical pictures, and can be good fun showing it to visiting friends and family. When deciding how much to shrink the files, always keep in mind that you should compress them as much as possible while keeping the subjective perception of the quality as close as possible to the original. What I mean to say, don’t overdo with the quality settings. What matters is how much space is your archive consuming right now, and how are you able to cope with that amount of data.

Data loss is commonplace. Hard drives fail, do not lose 10+ years of photographs because you didn’t have a decent backup. It’s not so hard. Do it now. Don’t lose a part of your personal history, it’s priceless, and cannot be downloaded from the internet again. Always encrypt your stuff before uploading to the ethereal cloud. Maybe you have sensitive pictures that you wouldn’t want anyone else casually looking over just because they happen to be the sysadmin. You wouldn’t make the same kind of privacy breach in other parts of your life, would you?

Match file timestamp with EXIF data

Over the years I’ve collected a lot of pictures, coming close to 20000. Most of these pictures have the exif metadata embedded in the JPEG files. Alas, I was careless with some of the photographs, and when copying over from filesystem to filesystem, creating backups etc., the timestamps got overwritten. So now I had loads of files that had a timestamp of 22nd January 2010 for example. They were most definitely not taken at that date, but rather they were copied then and no preserve timestamp flag was enabled at the time of the cp issued. I googled for a quick solution to my problems, but I could not find anything that would be simple to use, and not clogged up with bullshit. So, I cracked my knuckles and delved into the world of Python:

#!/usr/bin/python2

# You need the exifread module installed
import exifread, time, datetime, os, sys

def collect_data(file):
    tags = exifread.process_file(open(file), 'rb')
    for tag in tags.keys():
        if tag in ('EXIF DateTimeOriginal'):
            return "%s" % (tags[tag])

for file in sys.argv[1:]:
    try:
        phototaken = int(datetime.datetime.strptime(collect_data(file), '%Y:%m:%d %H:%M:%S').strftime("%s"))
        os.utime(file,(phototaken,phototaken))
    except Exception, e:
        sys.stderr.write(str(e))
        sys.stderr.write('Failed on ' + str(file) + '\n')

Basically it takes each file, reads the exif metadata for photo taken and invokes the os.utime() function to set the timestamp to that date. You’ll need the exifread module for Python, this is the simplest one I could find that can do what I needed it to do. I hope someone will find this script useful. You can start it simply with $ exify *.JPG. You can download it here.