February 27, 2008

Final post and update on IBTS issues

OK. This is (hopefully) my final post on the IBTS issues. I may post their response to my queries about why I received a letter and why my data was in New York. I may not. So here we go..

First off, courtesy of a source who enquired about the investigation, the Data Protection Commissioner has finished their investigation and the IBTS seems to have done everything as correct as they could, in the eyes of the DPC with regard to managing risk and tending to the security of the data. The issue of why the data was not anonymised seems to be dealt with on the grounds that the fields with personal data could not be isolated in the log files. The DPC finding was that the data provided was not excessive in the circumstances.

[Update: Here's a link to the Data Protection Commissioner's report. ]

This suggests to me that the log files effectively amounted to long strings of text which would have needed to be parsed to extract given name/family name/telephone number/address details, or else the fields in the log tables are named strangely and unintuitively (not as uncommon as you might think) and the IBTS does not have a mapping of the fields to the data that they contain.

In either case, parsing software is not that expensive (in the grand scheme of things) and a wide array of data quality tools provide very powerful parsing capabilities at moderate costs. I think of Informatica’s Data Quality Workbench (a product originally developed in Ireland), Trillium Software’s offerings or the nice tools from Datanomic.

Many of these tools (or others from similar vendors) can also help identify the type of data in fields so that organisations can identify what information they have where in their systems. “Ah, field x_system_operator_label actually has names in it!… now what?”.

If the log files effectively contained totally unintelligible data, one would need to ask what the value of it for testing would be, unless the project involved the parsing of this data in some way to make it ‘useable’? As such, one must assume that there was some inherent structure/pattern to the data that information quality tools would be able to interpret.

Given that according to the DPC the NYBC were selected after a public tender process to provide a data extraction tool this would suggest that there was some structure to the data that could be interpreted. It also (for me) raises the question as to whether any data had been extracted in a structured format from the log files?

Also the “the data is secure because we couldn’t figure out where it was in the file so no-one else will” defence is not the strongest plank to stand on. Using any of the tools described above (or similar ones that exist in the open source space, or can be assembled from tools such as Python or TCL/TK or put together in JAVA) it would be possible to parse out key data from a string of text without a lot of ‘technical’ expertise (Ok, if you are ‘home rolling’ a solution using TCL or Python you’d need to be up to speed on techie things, but not that much). Some context data might be needed (such as a list of possible firstnames and a list of lastnames, but that type of data is relatively easy to put together. Of course, it would need to be considered worth the effort and the laptop itself was probably worth more than irish data would be to a NYC criminal.

The response from the DPC that I’ve seen doesn’t address the question of whether NYBC failed to act in a manner consistent with their duty of care by letting the data out of a controlled environment (it looks like there was a near blind reliance on the security of the encryption). However, that is more a fault of the NYBC than the IBTS… I suspect more attention will be paid to physical control of data issues in future. While the EU model contract arrangements regarding encryption are all well and good, sometimes it serves to exceed the minimum standards set.

The other part of this post relates to the letter template that Fitz kindly offered to put together for visitors here. Fitz lives over at http://tugofwar.spaces.live.com if anyone is interested. I’ve gussied up the text he posted elsewhere on this site into a word doc for download ==> Template Letter.

Fitz invites people to take this letter as a starting point and edit it as they see fit. My suggestion is to edit it to reflect an accurate statement of your situation. For example… if you haven’t received a letter from the IBTS then just jump to the end and request a copy of your personal data from the IBTS (it will cost you a few quid to get it), if you haven’t phoned their help-line don’t mention it in the letter etc…. keep it real to you rather than looking like a totally formulaic letter.

On a lighter note, a friend of mine has received multiple letters from the Road Safety Authority telling him he’s missed his driving test and will now forfeit his fee. Thing is, he passed his test three years ago. Which begs the question (apart from the question of why they are sending him letters now)… why the RSA still has his application details given that data should only be retained for as long as it is required for the stated purpose for which it was collected? And why have the RSA failed to maintain the information accurately (it is wrong in at least one significant way).

IBTS… returning to the scene of the crime

Some days I wake up feeling like Lt. Columbo. I bound out of bed assured in myself that, throughout the day I’ll be niggled by, or rather niggle others with, ‘just one more question’.

Today was not one of those days. But you’d be surprised what can happen while going about the morning ablutions. “Over 171000 (174618 in total) records sent to New York. Sheesh. That’s a lot. Particularly for a sub-set of the database reflecting records that were updated between 2nd July 2007 and 11th October 2007. That’s a lot of people giving blood or having blood tests, particularly during a short period. The statistics for blood donation in Ireland must be phenomenal. I’m surprised we can drag our anaemic carcasses from the leaba and do anything; thank god for steak sandwiches, breakfast rolls and pints of Guinness!”, I hummed to myself as I scrubbed the dentation and hacked the night’s stubble off the otherwise babysoft and unblemished chin (apologies - read Twenty Major’s book from cover to cover yesterday and the rich prose rubbed off on me).

“I wonder where I’d get some stats for blood donation in Ireland. If only there was some form of Service or agency that managed these things. Oh.. hang on…, what’s that Internet? Silly me.”

So I took a look at the IBTS annual report for 2006 to see if there was any evidence of back slapping and awards for our doubtlessly Olympian donation efforts.

According to the the IBTS, “Only 4% of our population are regular donors” (source: Chairperson’s statement on page 3 of the report). Assuming the population in 2006 (pre census data publication) was around 4.5 million (including children), this would suggest a maximum regular donor pool of 180,000. If we take the CSO data breaking out population by age, and make a crude guess on the % of 15-24 year olds that are over 18 (we’ll assume 60%) then the pool shrinks further… to around 3.1 million, giving a regular donor pool of 124000 approx.

Hmm… that’s less than the number of records sent as test data to New York based on a sub-set of the database. But my estimations could be wrong.

The IBTS Annual Report for 2006 tells us (on page 13) that

The average age of the donors who gave blood
in 2006 was 38 years and 43,678 or 46% of our
donors were between the ages of 18 and 35
years.

OK. So let’s stop piddling around with assumptions based on the 4% of population hypothesis. Here’s a simpler sum to work out… If X = 46% of Y, calculate Y.

(43678/46)X100 = 94952 people giving blood in total in 2006. Oh. That’s even less than the other number. And that’s for a full year. Not a sample date range. That is <56% of the figure quoted by the IBTS. Of course, this may be the number of unique people donating rather than a count of individual instances of donation… if people donated more than once the figure could be higher.

The explanation may also lie with the fact that transaction data was included in the extract given to the NYBC (and record of a donation could be a transaction). As a result there may be more than one row of data for each person who had their data sent to New York (unless in 2007 there was a magical doubling of the numbers of people giving blood).

According to the IBTS press release:

The transaction files are generated when any modification is made to any record in Progesa and the relevant period was 2nd July 2007 to 11th October 2007 when 171,324 donor records and 3,294 patient blood group records were updated.

(the emphasis is mine).

The key element of that sentence is “any modification is made to any record”. Any change. At all. So, the question I would pose now is what modifications are made to records in Progresa? Are, for example, records of SMS messages sent to the donor pool kept associated with donor records? Are, for example, records of mailings sent to donors kept associated? Is an audit trail of changes to personal data kept? If so, why and for how long? (Data can only be kept for as long as it is needed). Who has access rights to modify records in the Progresa system? Does any access of personal data create a log record? I know that the act of donating blood is not the primary trigger here… apart from anything else, the numbers just don’t add up.

It would also suggest that the data was sent in a ‘flat file’ structure with personal data repeated in the file for each row of transaction data.

How many distinct person records were sent to NYBC in New York? Was it

  • A defined subset of the donors on the Progresa system who have been ‘double counted in the headlines due to transaction records being included in the file? ….or
  • All donors?
  • Something in between?

If the IBTS can’t answer that, perhaps they might be able to provide information on the average number of transactions logged per unique identified person in their database during the period July to October 2007?

Of course, this brings the question arc back to the simplest question of all… while production transaction records might have been required, why were ‘live’ personal details required for this software development project and why was anonymised or ‘defused’ personal data not used?

To conclude…
Poor quality information may have leaked out of the IBTS as regards the total numbers of people affected by this data breach. The volume of records they claim to have sent cannot (at least by me) be reconciled with the statistics for blood donations. They are not even close.

The happy path news here is that the total number of people could be a lot less. If we assume ‘double dipping’ as a result of more than one modification of a donor record, then the worst case scenario is that almost their entire ‘active’ donor list has been lost. The best case scenario is that a subset of that list has gone walkies. It really does boil down to how many rows of transaction information were included alongside each personal record.

However, it is clear that, despite how it may have been spun in the media, the persons affected by this are NOT necessarily confined to the pool of people who may have donated blood or had blood tests peformed between July 2007 and October 2007. Any modification to data about you in the Progresa System would have created a transaction record. We have no information on what these modifications might entail or how many modifications might have occured, on average, per person during that period.

In that context the maximum pool of people potentially affected becomes anyone who has given blood or had blood tests and might have a record on the Progressa system.

That is the crappy path scenario.

Reality is probably somewhere in between.

But, in the final analysis, it should be clear that real personal data should never have been used and providing such data to NYBC was most likely in breach of the IBTS’s own data protection policy.