There is an old saying that the word “Assume” makes an “Ass” out of “You” and “Me”.
Yet we see (and make) assumptions every day when it comes to assessing the quality (or otherwise) of information. Anglo-Saxon biassed peoples (US, English speaking Europe etc) often assume that names are structured Firstname Surname. “Daragh” = First Name, “O Brien” = Surname. The cultural bias here is well documented by people like Graham Rhind (who advises the use of “Given Name/Family Name” constructs on web forms etc. to improve cross-cultural usability.
But what if you see “George Michael” written down (without the context of labels for each name part) with a reference to “singer”? Would this relate to the pop singer George Michael, or the bass baritone singer Michael George?
One of the common ‘rules of thumb’ with telephone numbers is that, when you are trying to create the full ‘internationalised’ version of a telephone number (+[international access code] [local area code] [local number]) you take the number as written ‘locally’ and drop the leading zero. Of course, like most conventional wisdom a little scrutiny causes this rule of thumb to fall apart.
For example, in the Czech Republic there is no ‘leading zero’ as it is actually part of the international access code (which actually makes more sense to me…). One might assume that Europe, with the standardisation ethos of the European Union would all have plumped for “0” as a leading digit on local area codes. Not so, as Portugal doesn’t use any leading digit on their area codes. Some countries that used to be part of the USSR (like Russia, Belarus and Azerbijan) use 8 instead of 0.
You might not be safe in assuming that you just need to consider the first digit of the local area code. Hungary has a 2-digit prefix (06), so you would need to parse in 2 characters in the string to remove the correct digits. Just stripping the leading zero will result in a totally embuggered piece of information.
Also, everyone assumes that a telephone number will consist only of numbers. However, there are a few instances where the code required to dial out from a country (the International Direct Dial code) is actually alphanumeric in that it contains either the * (star) or # (hash key/pound key). Our buddies in Belarus are an example of this, where to dial out from Belarus you need to dial “8**10” (which even more confusingly is often written “8~10”.
So what does this mean for people who are assessing or seeking to improve the quality of telephone number data in their systems?
Well, first off it means you need to have some context to understand the correct business rules to apply. For example, the rules I would apply to assessing the quality (and likely defects) in a telephone number from Ireland would be different to what I’d need to apply to telephone numbers relating to Belarus. In an Irish telephone number it would be correct to strip out instances of “**” and then validate the rest of the string based on its length (if stripping the ** made it too short to be a telephone number then we would need to tag it as duff data and remove it). With data relating to Belarus it might simply be that the person filling in the form (the source of the data) got confused about what codes to use.
Secondly, it means you need to put some thought into the design of information capture processes to reduce the chances of errors occuring. Defining a structure with seperate fields, linking the international access code to a country drop down (and a library of business rules for how to interpret and ‘standardize’ subsequent inputs) would not be too difficult – it would just require investment of effort in researching the rules and maintaining them once deployed. Here’s a link to a useful resource I’ve found (note that I can’t vouch for the frequency of updates to this site, but I’ve found it a fun way to figure out what the rules might be for various countries). Also, Wikipedia has a good piece on Telephone number plans. Graham Rhind also has some good links to references for telephone number format rules
Looking at the data of a telephone number in isolation will most likely result in you screwing up some of the data (if you have international telephone number). Having the country information for that data (is the number in France or Belarus) allows you to construct appropriate rules and make your assumptions in the appropriate context to reduce your risks of error.
Ultimately, blundering in with a crude rule of thumb and simply stripping any leading zeros you find because that is the assumption you’ve made will result in you making an ass out of you and your data.
Which raises an interesting question…
Imagine you have been given a spreadsheet of telephone numbers that you have been told are international numbers in the ‘local’ formats for the respective countries. You open the spreadsheet and there are no leading zeros (because Excel -and most other spreadsheets- assumes that numbers don’t begin with zero and strip it out). What to you do to get the data back to a format that you can actually use?
Answers on a post card (or in the comments) please.