All posts by Hrvoje

About the “ineffectiveness of torture”

A friend of mine posted a commentary about the ongoing torture debate. He was appalled about people arguing that torture doesn’t work. “Why stop there,” he asks. “Does rape work? Let’s have a debate! Is genocide effective? Get some pundits to discuss pros and cons! Does sexual abuse of children yield results? Get some experts on screen!” This line of inquiry prompted me to consider whether the ineffectiveness of torture is ever an acceptable argument against it.

To consider the issue on a broader level, one can imagine “pundits” of other times hotly debating effectiveness of slavery, systemic censorship, forced prison labor, or gladiator battles.

The debate over torture has revealed the ugly fact that torture is widely supported by Americans, and the support has risen in the last several years. I see it as a result of a careful propaganda campaign that made skillful use of the naive portrayals of torture popular in fiction, and especially in pulp. Fiction tends to show torture as the only measure available to defeat infinitely evil opponents. This is so wide-spread that the tvtropes site documents a number of torture-related tropes, such as the “ticking time bomb”. When Americans support torture, they envision this scenario, a fact the agencies promoting their own use of torture happily embraced. If one’s thinking is confined to the ticking time bomb scenario, their moral judgment (and, one might add, decency and good taste) is suspended; they ask themselves nonsensical questions with leading answers, such as “would you torture one person to save a hundred?” (or a thousand, million, etc.) The fact that this scenario is an unrealistic fictional invention fueled by non-fictional propaganda machinery never enters their mind.

Given the way Americans think about torture, it is probably simply easier, and more efficient in the short term, to point out the ineffectiveness of torture than to debunk the ticking-time-bomb image. But in the long run, I would argue that relying on the ineffectiveness argument is an extremely bad idea. Imagine future interrogators and their teams of psychologists and rogue doctors perfecting techniques to use torture effectively and start extracting reliable information from tortured enemies and suspects. An argument against torture hinging on its ineffectiveness would immediately fall apart.

The dilemma whether to use scientific data to back up what is essentially an ethical choice is present elsewhere. For example, lack of measurable differences between races has been cited as argument against racism, sometimes extended to a complete dismissal any notion of race as a “social construct”, like ethnicity. But what if geneticists do discover measurable and important differences between races? Basing what is essentially an ethical judgment on scientific data positions it on thin ice because all scientific views can (and must be allowed to) change. This is why I am against drawing the ineffectiveness argument into the discussion against torture, even if it appears useful in the short term.

International file names in cross-platform programs

I work for a company that builds simulation software with the front-end GUI developed mostly in Python. This document is a slightly modified version of a guide written for the GUI developers to ensure that file names with international characters work across the supported platforms. Note that this document is specifically about file names, not file contents, which is a separate topic.

Introduction

Modern operating systems support use of international characters in file and directory names. Users not only routinely expect being able to name their files in their native language, but also being able to manipulate files created by users of other languages.

Historically, most systems implemented file names with byte strings where the value of each byte was restricted to the ASCII range (0-127). When operating systems started supporting non-English scripts, byte values between 128 and 255 got used for accented characters. Since there are more than 128 such characters in European languages, they were grouped in character encodings or code pages, and the interpretation of a specific byte value was determined according to the currently active code page. Thus a file with the name specified in Python as '\xa9ibenik.txt' would appear to an Eastern-European language user as Šibenik.txt, but to a Western-European as ©ibenik.txt. As long as users from different code pages never exchanged files, this trick allowed smuggling non-English letters to files names. And while this worked well enough for localization in European countries, it failed at internationalization, which implies exchange and common storage of files from different languages and existence of bilingual and multilingual environments. In addition to that, single-byte code pages failed to accomodate East Asian languages, which required much more than 128 different characters in a single language. The solution chosen for this issue by operating system vendors was allowing the full Unicode repertoire in file names.

Popular operating systems have settled on two strategies for supporting Unicode file names, one taken by Unix systems, and the other by MS Windows. Unix continued to treat file names as bytes, and deployed a scheme for applications to encode Unicode characters into the byte sequence. Windows, on the other hand, switched to natively representing file names in Unicode, and added new Unicode-aware APIs for manipulating them. Old byte-based APIs continued to be available on Windows for backward compatibility, but could not be used to access files other than those with names representable in the currently active code page.

These design differences require consideration on the part of designers of cross-platform software in order to fully support multilingual file names on all relevant platforms.

Unicode encodings

Unicode is a character set designed to support writing all human languages in present use. It currently includes more than 100 thousand characters, each assigned a numeric code called a code point. Characters from ASCII and ISO 8859-1 (Western-European) character sets retained their previous numeric values in Unicode. Thus the code point 65 corresponds to the letter A, and the code point 169 corresponds to the copyright symbol ©. On the other hand, the letter Š has the value 169 in ISO 8859-2, the value 138 in Windows code page 1250, and code point 352 in Unicode.

Unicode strings are sequences of code points. Since computer storage is normally addressed in fixed-size units such as bytes, code point values need to be mapped to such fixed-size code units, or encoded. Mainstream usage has stabilized on a small number of standard encodings.

UTF-8

UTF-8 is an encoding that maps Unicode characters to sequences of 1-4 bytes. ASCII characters are mapped to their ASCII values, so that any ASCII string is also a valid UTF-8 string with the same meaning. Non-ASCII characters are encoded as sequences of up to four bytes.

Compatibility with ASCII makes UTF-8 convenient for introducing Unicode to previously ASCII-only file formats and APIs. Unix internationalization and modern Internet protocols heavily rely on UTF-8.

UTF-16

The UTF-16 encoding maps Unicode characters to 16-bit numbers. Characters with code points that fit in 16 bits are represented by a single 16-bit number, and others are split into pairs of 16-bit numbers, the so-called surrogates.

Windows system APIs use UTF-16 to represent Unicode, and the documentation often refers to UTF-16 strings as “Unicode strings”. Java and DotNET strings also use the UTF-16 encoding.

UTF-32

The UTF-32 encoding maps characters to 32-bit numbers that directly correspond to their code point values. It is the simplest of the standard encodings, and the most memory-intensive one.

System support for Unicode

Windows

Windows file names are natively stored in Unicode. All relevant Win32 calls work with UTF-16 and accept wchar_t * “wide string” arguments, with char * “ansi” versions provided for backward compatibility. Since file names are internally stored as Unicode, only the Unicode APIs are guaranteed to operate on all possible files. The char based APIs are considered legacy and work on a subset of files, namely those whose names can be expressed in the current code page. Windows provides no native support for accessing Unicode file names using UTF-8.

The Win32 API automatically maps C API calls to wide (UTF-16) or single-byte variants according to the value of the UNICODE preprocessor symbol. Functions standardized by C, C++, and POSIX have types specified by the standard and cannot be automatically mapped to Unicode versions. To simplify porting, Windows provides proprietary alternatives, such as the _wfopen() alternative to C fopen(), or the _wstat() alternative to POSIX stat(). Like Win32 byte-oriented functions, the standard functions only work for files whose names can be represented in the current code page. Opening a Japanese-named file on a German-language workstation is simply not possible using standard functions such as fopen() (except by resorting to unreliable workarounds such as 8+3 paths). This is a very important limitation which affects the design of portable applications.

Standard C++ functions, such as std::fstream::open, have overloads for both char * and wchar_t *. Programmers that want their programs to be able to manipulate any file on the file system must make sure to use the wchar_t * overloads. The char * overloads are also limited to opening non-Unicode file names.

Unix

The Unix C library does support the wchar_t type for accessing file contents as Unicode, but not for specifying file names. The operating system kernel treats file names as byte strings, leaving it up to the user environment to interpret them. This interpretation, known as the “file name encoding”, is defined by the locale, itself configured with LC_* environment variables. Modern systems use UTF-8 locales in order to support multilingual use.

For example, when a user wishes to open a file with Unicode characters, such as Šibenik.txt, the application will encode the file name as a UTF-8 byte string, such as "\xc5\xa0ibenik.txt", and pass that string to fopen(). Later, system functions like readdir() will retrieve the same UTF-8 file name, which the application’s file chooser will display to the user as Šibenik.txt. As long as all programs agree on the use of UTF-8, this scheme supports unrestricted use of Unicode characters in file names.

The important consequence of this design is that storing file names as Unicode in the application and encoding them as UTF-8 when passing them to the system will only allow manipulating files whose names are valid UTF-8 strings. To open an arbitrary file on the file system, one must store file names as byte strings. This is exactly the opposite of the situation on Windows, a fact that portable code must take into account.

Python

Beginning with version 2.0, Python optionally supports Unicode strings. However, most libraries work with byte strings natively (often using UTF-8 to support Unicode), and using Unicode strings is slower and leads to problems when Unicode strings interact with ordinary strings.

On Windows, Python internally uses the legacy byte-based APIs when given byte strings and Windows-specific Unicode APIs when given Unicode strings. This means that Unicode files can be manipulated as long as the programmer remembers to create the correct Unicode string. It is not only impossible to open some files using the bytes API, they are misrepresented by functions such as os.listdir::

>>> with open(u'\N{SNOWMAN}.txt', 'w'):
...   pass   # create a file with Unicode name
... 
>>> os.listdir('.')
['?.txt']
>>> open('?.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '?.txt'

Opening the directory in Windows Explorer reveals that Python created the file with the correct name. It is os.listdir and its constraint to return byte strings when given a byte string that creates the problem. os.listdir(u'.') returns the usable [u'\u2603.txt'].

Python 3

Python 3 strings are Unicode by default, so it automatically calls the Unicode versions of Win32 calls and does not exhibit bugs like the listdir bug shown above. On the other hand, Python 3 needs special provisions to map arbitrary Unix file names to Unicode, as described in PEP 383.

File names in applications

Portable programs that want to enable the user to create arbitrary file names must take care how to create and access them. Using portable IO libraries such as gio and Qt resolves many of these problems automatically, but these libraries carry a lot of weight that is unacceptable in many situations. Also, those libraries often don’t interact well with “traditional” C code that accepts file names. In this chapter we present an implementation strategy that enables correct use of Unicode file names with minimal intrusion to the code base.

Since file names are natively bytes on some platforms and Unicode on others, a cross-platform application must choose between these representations. Using Unicode makes programming somewhat easier on platforms with native Unicode APIs, while using UTF-8 bytes has the advantage on platforms with native bytes APIs.

What representation works best depends on the application’s surroundings and the implementation platform. A Python 3 or Java application running on a web server is probably best served by using Unicode consistently and not bothering with Unix non-UTF-8 file names at all. On the other hand, a GTK application, a Python 2 application, or an application needing to interface with C will be off with UTF-8, which guarantees interoperability with the world of bytes, while retaining lossless conversion to Unicode and back.

This guide presents a programming model based on UTF-8 as the file name representation. UTF-8 was chosen for AVL simulation GUIs due to ease of interoperability with various C APIs, including GTK itself. This choice is also shared by the gio library and other modern Unix-based software. Of course, use of UTF-8 is not limited just to file names, it should be used for representation of all user-visible textual data.

Interacting with the file system from Python

Since Python’s built-in functions such as open and os.listdir accept and correctly handle Unicode file names on Windows, the trick is making sure that they are called with correct arguments. This requires two primitives:

  • to_os_pathname— converts a UTF-8 pathname (file or directory name) to OS-native representation, i.e. Unicode when on Windows. The return value should only be used as argument to built-in open(), or to functions that will eventually call it.

  • from_os_pathname — the exact reverse. Given an OS-native representation of pathname, returns a UTF-8-encoded byte string suitable for use in the GUI.

The implementation of both functions is trivial:

  def to_os_pathname(utf8_pathname):
      """Convert UTF-8 pathname to OS-native representation."""
      if os.path.supports_unicode_filenames:
          return unicode(utf8_pathname, 'utf-8')
      else:
          return pathname

  def from_os_pathname(os_pathname):
      """Convert OS-native pathname to UTF-8 representation."""
      if os.path.supports_unicode_filenames:
          return os_pathname.encode('utf-8')
      else:
          return os_pathname

With these in place, the next step is wrapping file name access with calls to to_os_pathname. Likewise, file names obtained from the system, as with a call to os.listdir must be converted back to UTF-8.

def x_open(utf8_pathname, *args, **kwds):
    return open(to_os_pathname(utf8_pathname), *args, **kwds)

def x_stat(utf8_pathname):
    return os.stat(to_os_pathname(utf8_pathname))
...

# The above pattern can be used to wrap other useful functions from
# the os and os.path modules, e.g. os.stat, os.remove, os.mkdir,
# os.makedirs, os.isfile, os.isdir, os.exists, and os.getcwd.

def x_listdir(utf8_pathname):
    return map(from_os_pathname, os.listdir(to_os_pathname(utf8_pathname)))

The function standing out is x_listdir, which is like os.listdir, except it converts file names in both directions: in addition to calling to_os_pathname on the pathname received from the caller, it also calls from_os_pathname on the pathnames provided by the operating system. Taking the example from the previous chapter, x_listdir would correctly return ['\xe2\x98\x83'] (a UTF-8 encoding of the snowman character), which x_open('\xe2\x98\x83') would correctly open.

Any function in the program that accepts a file name must accept — and expect to receive — a UTF-8-encoded file name. Functions that open the file using Python’s open, or those that call third-party functions that do so, have the responsibility to use to_os_pathname to convert the file name to OS-native form.

Legacy path names

to_os_pathname is useful when calling built-in open() or into code that will eventually call built-in open(). However, sometimes C extensions beyond our control will insist on accepting the file name to open the file using the ordinary C fopen() call. Passing an OS-native Unicode file name on Windows serves no purpose here because it will fail on a string check implemented by the Python bindings for the library. And even if it somehow passed the check, the library is still going to call fopen() rather than _wfopen().

A workaround when dealing with such legacy code is possible by retrieving the Windows “short” 8+3 pathnames, which are always all-ASCII. Using the short paths, it is possible to write a to_legacy_pathname function that accepts a UTF-8 pathname and returns a byte string pathname with both Python open() and the C family of functions such as fopen(). Since short pathnames are a legacy feature of the Win32 API and can be disabled on a per-volume basis, to_legacy_pathname should only be used as a last resort, when it is impossible to open the file with other means.

if not os.path.supports_unicode_filenames:
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        return utf8_pathname
else:
    import ctypes, re
    GetShortPathNameW = ctypes.windll.kernel32.GetShortPathNameW
    has_non_ascii = re.compile(r'[^\0-\x7f]').search
    def to_legacy_pathname(utf8_pathname):
        """Convert UTF-8 pathname to legacy byte-string pathname."""
        if not has_non_ascii(utf8_pathname):
            return utf8_pathname
        unicode_pathname = unicode(utf8_pathname, 'utf-8')
        short_length = GetShortPathNameW(unicode_pathname, None, 0)
        if short_length == 0:
            raise ctypes.WinError()
        short_buf = ctypes.create_unicode_buffer(short_length)
        GetShortPathNameW(unicode_pathname, short_buf, short_length)
        short_pathname_unicode = short_buf.value
        return short_pathname_unicode.encode('ascii')

Summary

If this seems like a lot of thought for something as basic as file names with international characters, you are completely right. Doing this shouldn’t be so hard, and this can be considered an argument for moving to Python 3. However, if you are using C extensions and libraries that accept file names, simply switching to Python 3 will not be enough because the libraries and/or their Python bindings will still need to be modified to correctly handle Unicode file names. A future article will describe approaches taken for porting C and C++ code to become, for lack of a better term, Unicode-file-name-correct. Until then, the to_legacy_pathname() hack can come in quite handy.

Stuffed bell peppers, the Croatian way

Stuffed bell peppers are a staple of cuisines of several southeast-European countries, including Croatia. This recipe, originally published in Croatian, presents how I make them. The translation will hopefully help this lovely Croatian dish reach a wider audience.

In Croatia we typically use the bell peppers of the “babura” variety, but other kinds of bell peppers will do nicely, as long as they are of reasonable size – at least 2 inches in diameter, and 4 inches or more in height.

STUFFED BELL PEPPERS

6 portions
preparation time: about 2 hours, largely unattended

1 onion, chopped
olive oil
salt, pepper
2-3 cloves garlic
1 tbsp paprika
1 pound ground meat, mix of beef and pork
1/2 cup rice
10 bell peppers of medium size
2-3 cups tomato purée (passata di pomodoro)
1 cup wine
water as needed

  1. Put the olive oil in a saucepan over medium heat. When the oil is warm, add onions and cook until soft, about five minutes. When the onions are nearly done, add the garlic, salt, pepper, paprika, and other spices if you like (e.g. ginger, nutmeg, or a dash of cumin). Do not overdo the spices.

  2. While the onions are cooking, wash the bell peppers, cut off the stems, and shake out the seeds. Put the ground meat in a bowl and add the cooked onions. Season with salt and pepper to taste (feel free to try it, a bit of raw meat won’t harm you), add the rice, and mix well.

  3. Stuff the bell peppers with the meat mixture, trying not to pack the meat too tightly. Arrange the peppers in a cooking pot, if possible so that they stand upright holding each other; leave as little room as possible between them. If the peppers do not fit in one layer, cook them in two smaller pots.

  4. Mix the tomato purée and wine and season with salt. Pour the mixture over the peppers in the pot and add water until the peppers are almost fully submerged. If you are using two pots, equally divide the purée and wine between them and then add water. While the peppers are cooking, do not stir them, just occasionally shake the whole pot. Cook for an hour and a half on low heat.

Let the cooked peppers rest for at least an hour. Serve with mashed potatoes and some crusty bread.

Do not throw away the puree in which the peppers were cooking. If some remains uneaten, freeze it and use it as stock for a future dish.

Punjene paprike

Punjene paprike su klasično hrvatsko jelo podjednako popularno na sjeveru i jugu. Ova verzija je kako ih ja radim, na više-manje klasičan način, ali uz pokoji suvremeni začin.

PUNJENE PAPRIKE

za 6 porcija
vrijeme pripreme: oko 2 sata, uglavnom bez nadzora

1 glavica luka, sjeckana
maslinovo ulje
sol, papar
2-3 češnja češnjaka
žlica slatke mljevene paprike
1/2 kg miješanog mljevenog mesa
15dkg riže
10-ak paprika srednje veličine
500-750g pasirane rajčice
2dl vina
voda po potrebi

  1. Prekrijte dno tave maslinovim uljem i zažutite luk na srednjoj vatri. Pred kraj dodajte zgnječeni češnjak, sol, papar, mljevenu papriku i po želji druge začine (na primjer, đumbir, kumin, tajlandski curry ili muškatni oraščić). Sa začinima ne pretjerujte da ne prevladaju nad finim okusom paprike.

  2. Dok se luk prži, operite paprike i skinite im poklopce. Mljeveno meso stavite u zdjelu i pomiješajte ga s prženim lukom. Probajte je li dovoljno slano (iako je sirovo, od zalogaja vam neće ništa biti) i po potrebi dosolite. Dodajte rižu i dobro promiješajte.

  3. Napunite paprike mesnom smjesom, pazeći da ih previše ne nabijate. Posložite paprike u lonac, ako je moguće tako da stoje uspravno pridržavajući jedna drugu; neka između paprika bude što manje razmaka. Ako paprike ne stanu u jedan red, bolje ih je kuhati u dva manja lonca nego slagati u dva reda.

  4. Pomiješajte pasirane rajčice s vinom i posolite. Prelijte paprike tekućinom i dolijte vode dok ne dođe blizu vrha paprika. Ako kuhate u dva lonca, ravnomjerno podijelite pasirane rajčice i vino između njih i zatim dolijte vodu. Dok se paprike kuhaju, nemojte ih miješati, samo povremeno protresite lonac. Neka se kuhaju sat i pol na laganoj vatri.

Pustite kuhane paprike da odstoje barem sat vremena i poslužite ih s pireom i kruhom.

Umak koji se ne pojede nemojte baciti, zamrznite ga i upotrijebite kao fini temeljac za neko buduće jelo.

Data transfer with Netcat

The other day my brother, who works as a system administrator, inquired about a puzzling behavior of GNU Netcat, the popular nc utility. Sometimes described as the TCP/IP Swiss army knife, it can come in handy as an ad hoc file transfer tool, capable of transferring large amounts of data at the speed of disk reads/writes.

Data transfer

Typical usage looks like this:

nc -lp60000 | tar x                 # receiver
tar c dir... | nc otherhost 60000   # sender

It may look strange at first, but it’s easy to type and, once understood, almost impossible to forget. The commands work everywhere and require no specialized server software, only a working network and nc itself. The first command listens on port 60000 and pipes the received data to tar x. The second command provides the data by piping output of tar c to the other machine’s port 60000. Dead simple.

Note that transferring files with Netcat offers no encryption, so it should only be used inside a VPN, and even then not for sensitive data.

Data loss

One surprising behavior of this mode of transfer is that both commands remain hanging after the file transfer is done. This is because neither nc is willing to close the connection, as the other side might still want to say something. As long as one is positive that the transfer is finished (typically confirmed by disk and network activity having ceased), they can be safely interrupted with ^C.

The next step is adding compression into the mix, in order to speed up transfer of huge but easily compressible database dumps.

nc -lp60000 | gunzip -c | tar x               # receiver
tar c dir... | gzip -c | nc otherhost 60000   # sender

At first glance, there should be no difference between this pipeline and the one above, except that this one compresses the content sent over the wire and decompresses received content. However, much to my surprise, the latter command consistently failed to correctly transfer the last file in the tar stream, which would end up truncated. And this is not a case of pressing ^C too soon — truncation occurs no matter how long you wait for the transfer to finish. How is this possible?

It took some strace-ing to diagnose the problem. When the sender nc receives EOF on its standard input, it makes no effort to broadcast the EOF condition over the socket. Some Netcat implementations close (“shut down”) the write end of the socket after receiving local EOF, but GNU Netcat doesn’t. Failure to shut down the socket causes the receiving nc to never “see” the end of file, so it in turn never signals EOF to gunzip. This leaves gunzip hanging, waiting for the next 32K chunk to complete, or for EOF to arrive, neither of which ever happens.

Preventing Netcat data loss

Googling this issue immerses one into a twisted maze of incompatible Netcat variants. Most implementations shut down the socket on EOF by default, but GNU Netcat not only doesn’t do so, it doesn’t appear to have an option to do so! Needless to say, the huge environment where my brother works would never tolerate swapping the Netcat implementation on dozens of live servers, possibly breaking other scripts. A solution needed to be devised that would work with GNU Netcat.

At this point, many people punt and use the -w option to resolve the problem. -w SECONDS instructs nc to exit after the specified number of seconds of network inactivity. In the above example, changing nc -lp60000 to nc -lp60000 -w1 on the receiving end causes nc to exit one second after the data stops arriving. nc exiting causes gunzip to receive EOF on standard input, which prompts it to flush the remaining uncompressed data to tar.

The only problem with the above solution is that there is no way to be sure that the one-second timeout occurred because the data stopped arriving. It could as well be the result of a temporary IO or network glitch. One could increase the timeout to decrease the probability of a prematurely terminated transfer, but this kind of gamble is not a good idea in production.

Fortunately, there is a way around the issue without resorting to -w. GNU Netcat has a --exec option that spawns a command whose standard input and standard output point to the actual network socket. This allows the subcommand to manipulate the socket in any way, and fortuitously results in the socket getting closed after the command exits. With the writing end closing the socket, neither nc is left hanging, and the transfer completes:

nc -lp60000 | gunzip -c | tar x                  # receiver
nc -e 'tar c dir... | gzip -c' otherhost 60000   # sender

Self-delimiting streams

There is one just little thing that needs explaining: why did the transfer consistently work with tar, and consistently failed to work with the combination of tar and gzip?

The answer is in the nature of the stream produced by tar and gzip. Data formats generally come in two flavors with respect to streaming:

  1. Self-delimiting: formats whose payload carries information about its termination. Example of a self-delimiting stream is an HTTP response with the Content-Length header — a client can read the whole response without relying on an out-of-band “end of file” flag. (HTTP clients use this very feature, along with some more advanced ones, to reuse the same network socket for communicating multiple requests with the server.) A well-formed XML document without trailing whitespace is another example of a self-delimiting stream.

  2. Non-self-delimiting: data formats that do not contain intrinsic information about their end. A text file or an HTML document are examples of those.

While a tar archive as a whole is not self-delimiting (nor can they be, since tar allows appending additional members at the end of the archive), its individual pieces are. Each file in the archive is preceded by a header announcing the size of the file. This allows the receiving tar to read the correct number of bytes from the pipe without needing additional end-of-file information. Although tar will happily hang forever waiting for more files to arrive on standard input, every individual file will be read to completion.

On the other hand, gzip does not produce a self-delimiting stream. gunzip reads data in 32-kilobyte chunks for efficient processing. When it receives a shorter chunk, say 10k of data, it doesn’t assume that this is because EOF has occurred, it just waits for the remaining 22K of data for its buffer to fill. gunzip relies on EOF condition to tell it when and if the stream has ended, after which it does flush the “short” buffer it last read. This incomplete buffer is what caused the data loss when compression was added.

This is why the -e solution works so nicely: it not only makes the socket close, it ensures that EOF is signaled across the chain of piped commands, including the receiving gunzip.