Arcane Sentiment: August 2008

Brute logic

There may be a risk to pursuing elegance too much: you can learn to ignore solutions that, while straightforward, are not particularly interesting. I recently added a feature I'd wanted for years to a program I maintain. The implementation didn't require any clever insights, or fall naturally out of the right view of the problem, nor was there anything particularly ugly about it. It was just ten or so lines of unexciting logic, with no surprises and no cleverness. So even though I knew this feature would be handy, I put it off because I didn't see a good way to do it. But it didn't require a good way; the obvious one was fine. If I were less hung up on elegance, I'd have written it years ago.

This is not the first time I've had this problem. To ignore anything that's not beautiful or interesting may be a good guide for hard problems, but for easy ones it's paralysing. Sometimes you just have to write the code.

Efficiency above all else

I hack C++ at work. When it comes to collections, this means using the Standard Template Library, which can be frustrating, because STL was designed with efficiency in mind, and convenience decidedly not in mind. For instance, many of STL's collection operations are defined not on collections but on pairs of iterators, which are to arbitrary collections as pointers are to arrays. That's so STL can handle arrays in the familiar, efficient way, but the extra abstraction makes them even more inconvenient than pointers. Iterator code is full of things like if (find(collection.begin(), collection.end(), x) != collection.end()) ..., but there are no versions that simply take collections instead, even though they're easy to define and they make for much clearer code. That's just not considered important in C++ culture.

Sometimes the obsession with efficiency makes for some strange omissions. STL includes a stack class, but it doesn't support pop. It has a pop method, but that just removes the top element; it doesn't actually return it, which is almost always what you want. What's going on? The STL documentation explains:

One might wonder why pop() returns void, instead of value_type. That is, why must one use top() and pop() to examine and remove the top element, instead of combining the two in a single member function? In fact, there is a good reason for this design. If pop() returned the top element, it would have to return by value rather than by reference: return by reference would create a dangling pointer. Return by value, however, is inefficient: it involves at least one redundant copy constructor call. Since it is impossible for pop() to return a value in such a way as to be both efficient and correct, it is more sensible for it to return no value at all and to require clients to use top() to inspect the value at the top of the stack.

This is a peculiar inversion: stack doesn't have the most useful form of pop, not because the language is too weak to express it, but because it's too powerful. Since C++ has types with copy constructors, which might be inefficient to return from pop, nothing can be returned - not even the primitive types which can be returned efficiently, and which (especially pointers) are most common in collections anyway.

OK, stack really ought to just support both operations. Just as all the functions accepting pairs of iterators should be overloaded to accept collections as well. But these are matters of convenience, not efficiency, so in C++ they don't matter.

Better Rebol info

My Google-fu is weak. Rebol's standard documentation may not be good, but there are better alternatives, and I didn't find them. Fortunately commenters supplied links and hints on what to search for, leading to at least:

A better tutorial that shows off Rebol's GUI library. How many languages can do one-liners with GUIs? How many language designers even consider the possibility, despite its obvious value?
Some articles on Rebol's semantics, including a metacircular interpreter. They're not entirely clear but they may be the best available.
...and even a surviving copy of Joe Marshall's Rebol-to-Scheme compiler. (But that's for Rebol 1.0; the semantics have changed since.)

If I'm reading this right, Rebol's bind (which is implicit in evaluation) makes closures, but it doesn't store the environment once per function - it stores it once per word! That presumably makes func (λ) more expensive, but allows lexical scoping of code that's passed around without being compiled first. Which is very common in Rebol - one of the main ways to extend the language is to define interpreters that operate on blocks. Because blocks are self-evaluating, there's no syntactic overhead to this - it's just like FEXPRs. Rebol appears to have actual FEXPRs too, via unevaluated arguments, but they're not needed much.

Next (when I get around to it): investigating the libraries.

Why I haven't looked at Rebol much

I looked at Rebol a few years ago, but lost interest pretty quickly. Recently commenter Edoc recommended it, so I took another look, and remembered why I lost interest last time.

The first problem is that the docs are frustratingly introductory. Everything I can find is aimed at explaining the language to beginners, with lots of repetition and very little precision. Would it be so hard to explain the language kernel in a way someone who knows languages can understand? The closest I can find to a description of the semantics is the overview of evaluation and one entry in the FAQ. There are grandiose statements about how different Rebol is from other languages, and how it's supposed to take over the world, but not much explanation.

Okay, even from the beginner documentation you can learn something. Rebol is intended for making little DSLs - so, like Lisp, it separates the syntax from the language. Programs are defined in terms of trees, not text, and the trees are easily available to interpreters. But Rebol also tries to preserve a familiar appearance of programs - infix, and without too many parentheses. The usual way to do this is to keep the trees as abstract as practical, and have the parser transform complex syntax into a simple underlying representation. Rebol does the opposite. It preserves most of the syntax in its parse trees, so the parser is simple but the interpreter is complicated. In Rebol, 2 * sin :x reads as a four-element sequence: [2 * sin :x]. The advantage is that code closely resembles its source text; the disadvantage is that even parsed code is rather complicated and awkward to operate on.

The parser is not actually all that simple, because Rebol has syntax for a great many data types. In addition to the four kinds of symbols and three kinds of sequences used for programs, there are a variety of other types with syntax. The complete list is:

Word (symbol; 4 kinds): foo: bar :baz 'error-out-of-metasyntactic-variables
Block (vector): [1 two [three] "IV"]
Paren (same as Block, but different eval semantics): (alpha [beta * gamma] delta)
Refinement (generalized field accessor): /x
Path (field access expression): a/b/c
Integer: -55 16'777'216
"Decimal" (actually binary floating point; there's nothing decimal about them): 10.0 2,718'281'828'4 6.023E23
String: "Hello,^(line)world!" {Hello, world!}
Binary: #{F00DF00DF00D}
Email: user@example.com
File (pathname): %/etc/passwd %rebol%20stuff\test.r
Issue: #123-456-789 (how is this usefully different from String?)
Tag: <a href="http://www.rebol.com">
URL: file:///etc/passwd
Character: #";" #"字"
Date: 2008-07-26 26-Jul-2008 7/26/2008
Time: 20:51:33
Tuple (of 3 or more small integers): 127.0.0.1
Money: EUR$0.02 $0.03
Pair (2-D point): 1024x768

There are only a few types that don't have syntax:

Logic (boolean): true false yes no on off (these are constants, not syntax)
None (nil): none (constant)
List
Hash (hashtable)
Image (bitmap)

Curiously, much of the syntax is defined by special kinds of tokens rather than with special characters. It's not just numbers, but dates, times, and even pairs. This makes it a bit hard to tell how a token will be parsed, and thus whether an unusual symbol needs to be escaped. It seems to me that Rebol goes a bit overboard on this. I'm not sure why dates or email addresses need to be handled by the parser rather than by the few operations that use them. The syntactic support is slightly convenient for some small programs, but I have a hard time believing it helps with larger ones.

Of course a language's semantics are more important than its syntax, and it's tricky to figure them out from a tutorial. The obvious alternative is to read the source - but here's the other reason I stopped looking at Rebol: it's not open-source. “REBOL/Core is free and will always be free”, says the license page, but they only mean gratis. Quite frustrating to someone who might understand it by looking at eval, or to anyone considering using the language. Joe Marshall (who used to work on Rebol) apparently wrote a Rebol-to-Scheme compiler, but the actual code seems to have vanished from the face of the Internet. (Update: Here's a copy. Thanks, anonymous!) Fortunately, bits of a Rebol interpreter survive in Joe's talk on the difficulties of doing tail-call optimization. From these, and from the documentation, I think I understand Rebol's eval now, though not with enough confidence to explain it formally - go read that talk if you want to know.

The odd thing is that it operates on sequences, so it has to parse as it evaluates. The two operations which consume arguments (assignments and function calls) evaluate as many as they need from the rest of the sequence, not knowing where an expression ends until they've evaluated it. (This is why tail-call optimization is hard: you can't tell whether you're in a tail context until the last moment.) After evaluating, if there's nothing left, then you just did a tail call, and its result is the value of the block. If there is something left, and it begins with an infix operator, then what you just evaluated was its first argument, so you evaluate another argument and call the infix operator. (This means all infix operators are left-associative.) This is what it takes to interpret unparenthesised prefix with variable arity, but it's certainly awkward.

Here's something disturbing: Rebol 1.0 had lexical scope, but Rebol 2.0 doesn't. Blocks are just self-evaluating, and it's up to functions (or should I say FEXPRs?) like if to evaluate them in some environment. (However, this article suggests words know something about their environment.) I don't know how assignment handles nested scopes. It is worrying, especially in a poorly specified language with no reference implementation, to see what appears to be a well-known mistake, and the possibility of several others. I don't trust my intuition when evaluating a language, but I'm beginning to get the impression that Rebol has a lot of cute features on a rather shaky foundation.

Trying to reconstruct a language from introductory documentation is a frustrating task, so I'll stop here. Unfortunately I've exhausted my patience before learning much about its library or how it works in practice, so I still don't know what the good details were that Edoc was talking about. Maybe some other time.

Array mapped hash tries and the nature of data structures

Clojure provides several immutable collection types: ordinary lists, associative arrays, and "a hash map and vector both based upon array mapped hash tries". Array mapped hash tries? What are those?

The short answer is that they're hashtables whose bodies are tries (prefix trees) made of bitmapped sparse arrays. ("Array mapped" refers, rather confusingly, to the last part.) The long answer can be found in two of Phil Bagwell's papers: Fast and Space Efficient Trie Searches and Ideal Hash Trees.

Nowadays most programmers are used to thinking of hashtables as a primitive data structure, but they're not as simple as they look. Ordinary hashtables combine three ideas in their implementation: arrays, hashing, and reprobing. Arrays are the most space-efficient collection type, so naturally you want to use them for everything, but they have the limitation that they can only be indexed with (a contiguous range of) integers. Hashing fixes this - it turns anything into an integer, suitable for indexing arrays or whatever data structure you please, at the cost of introducing collisions. Reprobing deals with those collisions, without the inefficiency of chaining. The combination has the coveted trait of fast constant-time lookup, so almost everyone uses it, but it's not the only way to build a hashtable.

Each of those three concepts can be used independently of the others. In particular, hashes can be used to index other structures than arrays. This is what Clojure does. Rather than backing its hashtables with (reprobed) arrays, it backs them with tries, treating the hashcode as a sequence of 5-bit integers. Tries are less efficient than arrays, of course, but they have the advantage of nondestructive update in log time. Clojure's collections are immutable, so efficient nondestructive update is a must.

(Mini-rant: I refuse to call it "persistence". There are plenty of other words for this - "nondestructive", "functional", "incremental". Do we have to appropriate a term with a well-established meaning that is both important and likely to be used in the same context?)

Tries have a singularly confusing name, which may explain how little attention they get. But they also have a space problem: each node may have a significant number of children, which must be accessed quickly, so an array is the obvious choice. But most nodes have only a few children, so arrays would waste a lot of space. Bagwell's (and Clojure's) solution is to use bitmapped sparse arrays. Each node of the tree has a bitmap showing which children are present - and then a small array containing only those entries. For example, in a base-32 trie, a node with children 1, 5, and 11 would be represented by a bitmap 100000100010 and a 3-element array containing pointers to the children. This represents a 32-element sparse array in only 4 words! It's not even inefficient to index into, as you can determine the index of an entry by counting the 1 bits below it in the bitmap. This is supported in hardware on most architectures - but not, alas, on x86.

But even with hardware support, these tries are considerably slower than ordinary hashtables - they're fast log time, not fast constant time. There is an unavoidable price to nondestructive updates: they require keeping more information around than destructive ones. We functional programmers are used to thinking of state as a valuable (if hazardous) optimization tool, but we often mistake the reason. The main performance benefit of side effects isn't from the obvious operations like nconc, that use state to recycle old objects and slightly reduce allocation. Much more important is the freedom they give us, to use different data structures and different algorithms, and to express our programs in different ways.