Mrs DoBlog and I are anxiously awaiting the arrival of a mini-DoBlog any day now. So we have spent some time flicking through baby nameÂ books seeking inspiration for a name other than DoBlog 2.0.
In doing so I have been yet again reminded of the challenges faced by information quality professionals when trying to unpick a concatenated string of text in a field that is labelled “Name”. The challenges are manifold:
- Name formats differ fromÂ to culture to culture – and it is not a Western/Asian divide as some people might assume at first.
- Master Data for name spellings is notoriously difficult to obtain. My wife and I compared spellings of some common names in two books of baby names and the variations were staggering, with a number of spellings we are very familiar with (including my own name) not listed in either.
- Often Family Names (surnames) can be used as Given Names (first names) such as Darcy (D’Arcy) or Jackson (Jackson) or Casey.
- Often people pick names for their children based on where they were born or where they were conceived (Brooklyn Beckham, the son of footballer David Beckham is a good example).
- Non-name words can appear in names, such as “Meat Loaf” or “Bear Grylls“
- Douglas Adams famously named a character in the Hitchhiker’s Guide to the Galaxy after one of the “dominant life forms” – a car called a “Ford Prefect“
- Names don’t always fit into an assumed varchar(30) or even varchar(100) field.
- It is possible to have a one character Given name and a one character Family name.
- Two character Family names are more common than we think.
- Unicode characters, hyphens, spaces, apostrophes are all VALID in names – particularly if they are diacritical marks which change the meaning of words in particular languages.
- And then you have people who change their names to silly things to be “different” or “special”,Â but who create interesting statistical challenges for data profilers and parsing tools.
Among the examples I found flicking through one of our baby name books last evening where “Alpha” and “Beta”. Personally I think it sends the wrong signals to name your children after letters of the Greek alphabet, but I’m sure it is helpful if you have had twins to keep them in order.
I also found “Bairn” given as a Scots Gaelic name for a baby girl. I had to laugh at this as “Bairn” is actually a Scots dialect word for Child. Even Wikipedia recognises this and has a redirect from “Bairn” to “child“.Â But it does remind me of the terribly sexist “joke” where the father asks the doctor after the birth whether it is a boy or a child his wife has just delivered.
The trouble with names, from an information quality point of view, is that they are inherently personal things which people have a strong attachment to. So getting spellings wrong can have negative effects on your business and your relationship with your customers (like my on-going gripe with Vodafone). But often companies need to accept the “fuzziness” of identity in order to match records and meet the needs of Anti-money laundering or similar regulations or simply to create a single view of their customers. But the EU Data Protection regulations require organisations to hold data accurately – with accuracy being defined from the point of view of the data subject.
So, when you head has stopped spinning from managing all the Alphas, Betas, Brooklyns, and Ford Prefects, as an Information Quality practitioner you are faced with juggling the needs of Customer Intimacy, the demands of Data Protection, and a range of other legal requirements when you are deciding how to clean your name data up.
Jim Harris’ excellent series of posts on Data ProfilingÂ Â gives a great run through of how data profiling tools can help you figure out what is in those strings of text in that field labelled “Name”. However, you should exercise caution in your assumptions about what a name might be and might look like.
For example, allegedly the longest Name in the world is
Now, that’s 802 character’s long (including the Mr). It also doesn’t fit very easily into a <given name><middle_initial><Family_name> format which most of us would probably start with as our template for parsing a name string. Note also, that he was a “senior”, so there is another one of this name out there somewhere. Perhaps he just goes by the name “Mr Adolph Ingram”. I’d also hate to see what a matching process would make of that name (how many match keys would need to be created?)
There are some interesting comments about this name on the Stackoverflow.com website. Some of them are helpful pointers to the different structures of names that exist out there. Others show the risks that are run in designing and developing systems based on a particular cultural bias or perception of what a name is (a lot of commenters refer to US government forms and how people with long names usually have a form that they use for “official purposes”. This is not necessarily a “safe” assumption… not every government form in each country is the same. Indeed, many Irish Government forms don’t have enough space for my address or my wife’sÂ first name and compound family name).
In a case that will be put up on IQTrainwrecks.com, the impact of cultural assumptions about what a “valid” name is can be seen in this story of a woman who trouble boarding a plane because of her name.
Fans of Star Trek:Deep Space 9 will recall how the actorÂ who playedÂ Â Doctor Bashir changed his credit name from Siddig el Fadil to “Alexander Siddig” because (it is claimed) fans couldn’t pronounce Siddig el Fadil properly. The full version of his name would be an interesting challenge for a data profiling toolÂ in the hands of an Information Quality professional and certainly challenges the <Given_name><middle_name><family_name> format used in Anglo-Saxon cultures.
Wikipedia has an interesting page of references for unusual and long names which I would recommend at the very least as a tool to blow away any assumptions you have of what’s in a name.
No set of name MasterÂ Data in a reference dictionary will ever be complete or fully accurate. When balancing the needs for accuracy and correctness of data versus the needs to match and consolidate data (either for internal business purposes like CRM or for legally mandated purposes such as AML or PEP processes), you need to give some thought to how you will weight and manage your priorities within the data. Furthermore, assumptions you might make about the “correct” structure of a name could actually create information quality problems for you.
For now, Mrs DoBlog and I will continue to see if we can find a name that fits the impending arrival. But it has been made a lot harder because of my insights into the fun a name can cause for an information quality team.
I’m angling for something very traditional and Irish…. just to really confuse people and break Soundex keys.