The other day my brother, who works as a system administrator, inquired about a puzzling behavior of GNU Netcat, the popular nc
utility. Sometimes described as the TCP/IP Swiss army knife, it can come in handy as an ad hoc file transfer tool, capable of transferring large amounts of data at the speed of disk reads/writes.
Data transfer
Typical usage looks like this:
nc -lp60000 | tar x # receiver
tar c dir... | nc otherhost 60000 # sender
It may look strange at first, but it’s easy to type and, once understood, almost impossible to forget. The commands work everywhere and require no specialized server software, only a working network and nc
itself. The first command listens on port 60000 and pipes the received data to tar x
. The second command provides the data by piping output of tar c
to the other machine’s port 60000. Dead simple.
Note that transferring files with Netcat offers no encryption, so it should only be used inside a VPN, and even then not for sensitive data.
Data loss
One surprising behavior of this mode of transfer is that both commands remain hanging after the file transfer is done. This is because neither nc
is willing to close the connection, as the other side might still want to say something. As long as one is positive that the transfer is finished (typically confirmed by disk and network activity having ceased), they can be safely interrupted with ^C.
The next step is adding compression into the mix, in order to speed up transfer of huge but easily compressible database dumps.
nc -lp60000 | gunzip -c | tar x # receiver
tar c dir... | gzip -c | nc otherhost 60000 # sender
At first glance, there should be no difference between this pipeline and the one above, except that this one compresses the content sent over the wire and decompresses received content. However, much to my surprise, the latter command consistently failed to correctly transfer the last file in the tar
stream, which would end up truncated. And this is not a case of pressing ^C too soon — truncation occurs no matter how long you wait for the transfer to finish. How is this possible?
It took some strace-ing to diagnose the problem. When the sender nc
receives EOF on its standard input, it makes no effort to broadcast the EOF condition over the socket. Some Netcat implementations close (“shut down”) the write end of the socket after receiving local EOF, but GNU Netcat doesn’t. Failure to shut down the socket causes the receiving nc
to never “see” the end of file, so it in turn never signals EOF to gunzip
. This leaves gunzip
hanging, waiting for the next 32K chunk to complete, or for EOF to arrive, neither of which ever happens.
Preventing Netcat data loss
Googling this issue immerses one into a twisted maze of incompatible Netcat variants. Most implementations shut down the socket on EOF by default, but GNU Netcat not only doesn’t do so, it doesn’t appear to have an option to do so! Needless to say, the huge environment where my brother works would never tolerate swapping the Netcat implementation on dozens of live servers, possibly breaking other scripts. A solution needed to be devised that would work with GNU Netcat.
At this point, many people punt and use the -w
option to resolve the problem. -w SECONDS
instructs nc
to exit after the specified number of seconds of network inactivity. In the above example, changing nc -lp60000
to nc -lp60000 -w1
on the receiving end causes nc
to exit one second after the data stops arriving. nc
exiting causes gunzip
to receive EOF on standard input, which prompts it to flush the remaining uncompressed data to tar
.
The only problem with the above solution is that there is no way to be sure that the one-second timeout occurred because the data stopped arriving. It could as well be the result of a temporary IO or network glitch. One could increase the timeout to decrease the probability of a prematurely terminated transfer, but this kind of gamble is not a good idea in production.
Fortunately, there is a way around the issue without resorting to -w
. GNU Netcat has a --exec
option that spawns a command whose standard input and standard output point to the actual network socket. This allows the subcommand to manipulate the socket in any way, and fortuitously results in the socket getting closed after the command exits. With the writing end closing the socket, neither nc
is left hanging, and the transfer completes:
nc -lp60000 | gunzip -c | tar x # receiver
nc -e 'tar c dir... | gzip -c' otherhost 60000 # sender
Self-delimiting streams
There is one just little thing that needs explaining: why did the transfer consistently work with tar
, and consistently failed to work with the combination of tar
and gzip
?
The answer is in the nature of the stream produced by tar
and gzip
. Data formats generally come in two flavors with respect to streaming:
- Self-delimiting: formats whose payload carries information about its termination. Example of a self-delimiting stream is an HTTP response with the
Content-Length
header — a client can read the whole response without relying on an out-of-band “end of file” flag. (HTTP clients use this very feature, along with some more advanced ones, to reuse the same network socket for communicating multiple requests with the server.) A well-formed XML document without trailing whitespace is another example of a self-delimiting stream.
-
Non-self-delimiting: data formats that do not contain intrinsic information about their end. A text file or an HTML document are examples of those.
While a tar
archive as a whole is not self-delimiting (nor can they be, since tar
allows appending additional members at the end of the archive), its individual pieces are. Each file in the archive is preceded by a header announcing the size of the file. This allows the receiving tar
to read the correct number of bytes from the pipe without needing additional end-of-file information. Although tar
will happily hang forever waiting for more files to arrive on standard input, every individual file will be read to completion.
On the other hand, gzip
does not produce a self-delimiting stream. gunzip
reads data in 32-kilobyte chunks for efficient processing. When it receives a shorter chunk, say 10k of data, it doesn’t assume that this is because EOF has occurred, it just waits for the remaining 22K of data for its buffer to fill. gunzip
relies on EOF condition to tell it when and if the stream has ended, after which it does flush the “short” buffer it last read. This incomplete buffer is what caused the data loss when compression was added.
This is why the -e
solution works so nicely: it not only makes the socket close, it ensures that EOF is signaled across the chain of piped commands, including the receiving gunzip
.