Have fun with Unix
Just one line of code, but lots of confusion. What does this program do?
Who wrote this code?
The code won the "Best One-Liner" prize at the IOCCC in 1984. It was written by David Korn, who is also the author of Korn Shell (ksh).
Let's run it
The code compiles just fine with gcc
on Linux, giving a couple of harmless warnings about implicit declaration of printf
and omitted main()
return type. After running, it prints one single word:
% ./korn
unix
Where does this come from? A quick glance at the source shows apparently some NUL
characters, "six"
, "have"
and "fun"
, but unix
in this code looks more like an implicitly-declared variable than a character string.
What is UNIX and where does it come from?
The obvious and boring answer is that UNIX is an operating system that comes from Bell Labs, but what we're looking for is the symbol unix
and its value in the program. Let's run the code through the preprocessor now.
% cpp korn.c
# 1 "korn.c"
# 1 ""
# 1 ""
# 31 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "" 2
# 1 "korn.c"
The output shows that the preprocessor has substituted unix
with 1
in the code. But why is it doing that?
The online GNU documentation for cpp says "it is common to find unix
defined on Unix systems", quoting historical reasons -- so that the code could contain clauses like #ifdef unix
... #endif
for conditional compilation. Luckily, it is possible to make cpp
output all #define
directives during its execution.
% cpp -dM korn.c | grep unix
#define __unix__ 1
#define __unix 1
#define unix 1
Here's the confirmation -- unix
is a system-specific predefined macro that has the value of 1
. Let's make the substitution in the source too:
int
Why so many NULs?
We will begin with clearing the confusion about multiple "\0"
s found in the string. Turns out "\021"
does not mean a NUL
followed by a '2'
and a '1'
, but is an escape sequence that represents a byte with the value represented as an octal (8-based) number. This is also true for "\012"
. These two are better written as 0x11
and 0x0A
; the former is defined as "Device control 1" and (as we will see soon) is not really important here, the latter is just a newline "\n"
.
int
Confusing pointers
The code uses the commutativity of addition to cleverly obfuscate the character strings passed to the printf()
function. Let's take a closer look at the technique now.
C allows to access elements of the arrays using square brackets: array[index]
. Since array names behave like pointers, the elements can also be accessed like this: *(array+index)
. Now, as the addition operation is commutative, it is also possible to write the former as *(index+array)
, and, as a consequence, as index[array]
. Just this trick alone is enough to confuse programmers not used to seeing constructs like 5["abcdef"]
, and here it is wrapped with yet another layer of obfuscation.
Knowing all this, it is possible to write the first parameter, 1["\x11%six\n\0"]
, in a more clear way as "\x11%six\n\0"[1]
. Since strings in C are just zero-indexed character arrays, [1]
means just skipping the 0th character '\x11'
altogether -- so the result will be the percent sign here. Then, the ampersand &
is used to take the pointer (memory address), which is then passed to printf()
. In the end, the format string passed as the first parameter will be "%six\n\0"
.
Time to take the second argument apart now. The %s
at the beginning of the format string says the next argument is going to be a standard character string, null-terminated.
After what we have went through here, the expression (1)["have"]+"fun"-0x60
is pretty easy to take apart. First there is (1)["have"]
, which can be rewritten as "have"[1]
and then as 'a'
. Next there's "fun"
, and 0x60
. A look at the ASCII table shows that 'a'
has the value of 0x61
-- so the expression can be simplified to "fun"+0x61-0x60
, and then to "fun"+1
-- which evaluates to "un"
.
The final version
int
This is the code with all the obfuscations removed. There's no "fun" at all here, what remains is just an "un" which is the first half of "unix" that's written to the standard output.
An exercise for the reader
Can you have fun without UNIX? What happens when the code is compiled on a non-UNIX machine? What if unix
is defined as 0?