The name of things
by TheWatcher on Dec.27, 2013, under General
Yes, I really suck at this thing. I blame the fact that the Earth spins too rapidly, inconsiderate ball of damp rock.
A couple of days ago someone mentioned the (broken) behaviour of a system they developed when people are added to it, but their names do not confirm to an expected format. This reminded me that a while ago a fellow member of the Secret Cabal of Shadowy Associates linked me to a blog post about programmer’s misconceptions and errors when dealing with names, specifically names of individuals. The entire thing actually irritated the everliving crap out of me: it is a prime example of a smug bastard saying “here’s a problem I’ve identified, and I know how to solve it… but I’m not going to tell you how! Aren’t I awesome?” ((My initial reaction to it was ‘”[I] have theoretically designed [whatever that means] their systems to allow all names to work in them” – Okay genius, how? Share your wisdom with us lesser mortals!))
Part of the problem is that, at the root of it, he’s technically correct: most systems you’ll run into have a horribly westernised concept of given and family names (forenames/surnames, whatever you want to call them) that falls down the instant the system has to deal with non-WEIRD individuals. Working for a university, I constantly run into cases where students with names that do not conform to the traditional western scheme have been kludged into systems that enforce it ((Embarrassingly, one of these is a very old system I developed before I was Enlighened. Some day I hope to go back and fix that thing…)). That blog author takes things to extremes, and if all his statements are taken as accurate, there is pretty much no way to realistically create a system that handles all of them sensibly. But it is quite possible to cover the majority of issues, and unlike the blog post linked, I am going to actually give some thoughts about ways to handle these things in real situations.
For a start, use Unicode: at the very least UTF-8 is widely supported either directly or via libraries in pretty much every language worth using (and several that aren’t, like PHP). If you’re still using ISO/IEC 8859-* and similar single-byte character encodings your code is broken. Seriously. You’re not doing anyone any favours holding onto that shit. Your code may work fine for very specific situations, it may even work fine most of the time, but for anything but toy programs you will eventually run into cases where you simply can not handle some characters, and fixing it will be an unholy mess of kludges and spiders. Avoid it from the beginning; use Unicode throughout.
Next, give up on the concept of given names and family names. Do not attempt to split the name data along any seemingly ‘sensible’ lines: there will inevitably be a naming scheme out there that will not work with your rules. Yes, this sort of thing will be entirely contrary to years of conventional western wisdom, and it means that things like sorting by last name don’t work – but if the individual has no last name (yes, it does happen), or comes from a culture that reverses or discards the forenames-followed-by-surname idiom ((While collaborating with some Japanese developers some years ago, I spent a lot of time being referred to as Mr Chris because of this…)), that sorting would be invalid, or at least inaccurate, either way.
Do not simply assume that names will only ever consist of simple alphabetic characters, either. Hyphens and apostrophes are widely used even in western names, but some cultures can use a variety of other punctuation marks in names (and not just the ones commonly found on western keyboards!) Limiting the characters the user can input is artificial and undermines the whole point of trying to be better at supporting names: sanitise the input before you use it ((Which you’re doing with all your input, right?)), but don’t artificially limit it.
These steps will handle the vast majority of names out there. Searching and sorting are trickier, but storing an individual’s name in one UTF-8 encoded string means that pretty much every language and cultural naming scheme will work. Allow for optional names, and you can handle even the most weird edge-cases.