How toy problems are different

A lot of simple text-processing problems involve operating on the words of the input. Norvig's spellchecker, for instance, begins by extracting the words:

def words(text): return re.findall('[a-z]+', text.lower())

So do some of the old Shootout problems (though none of the currently active ones, and they don't keep old ones around, so if you want to see an example you'll have to use the WayBackMachine). This sets my library-senses tingling. If so many programs need to split text into words, shouldn't words be in the standard library?

There are two problems here. The lesser is that the definition of "word" varies a bit - sometimes it's defined by alphanumeric characters, sometimes by whitespace. This isn't a showstopper, because some reasonable definition will probably work for a large fraction of the potential uses. The greater problem is that splitting things into words is much less common in real programs than in toy examples. Words are like Fibonacci numbers: they're familiar and easy to ask for, so they occur unusually often in made-up requirements. Neither is common in real programs - probably not enough so to warrant including them in a language's library.

This is one of the pitfalls of testing a language against toy problems. Unlike real problems, they're strongly biased toward simplicity of requirements, which affects what they demand from a language. Real programs tend to spend a lot of code on I/O and error handling, neither of which turns up much in toy problems. Is it a coincidence that these two areas are awkward to express in most languages?

No comments:

Post a Comment

It's OK to comment on old posts.