Monday, 15 April 2013

Record retention and proprietary data formats

My recent experience with an application upgrade left me considering the true implications of using proprietary data formats. And I have realized that they are an often overlooked topic, but with profound and significant implications that are often not addressed.

Say you live in a country where the law requires you to keep electronic records for 14 years. Do you think it is an exaggeration? Sarbanes-Oxley says auditors must keep all audit or review work papers from 5 to 7 years.You are carefully archiving and backing up all that data. You are even copying the data to fresh tapes from time to time, to avoid changes in tape technology leaving you unable to read that perfectly preserved tape -or making it very hard, or having to depend on an external service to restore it.

But I've not seen a lot of people make themselves the question, once you restore the data, which program you'll use to read it? Which operating system will that program run on? Which machine will run that operating system?

First, what is a proprietary data format? Simple, anything that is not properly documented in a way that would allow anyone with general programming skills to write a program to extract data from a file.

Note that I'm leaving patents out of the discussion here. Patents create additional difficulties when you want to deal with a data format, but do not completely lock you out of it. It merely makes things more expensive, but you'll definitely be able to read your data, even if you have to deal with patent issues, which are another different discussion altogether.

Patented or not, an undocumented data format is a form of customer lock in. The most powerful there is, in fact. It means that you depend on the supplier of the programs that read and write that data forever. But the lock in does not stop here. It also means that you are linking your choices of platform, hardware, software, operating system, middleware, or anything else your supplier has decided that is a dependency to read your data.

In the last few years, virtualization has helped somewhat with the hardware part. But still does not remove it completely, in that there could be custom hardware or dongles attached to the machine. Yes, it can get even worse. Copy protection schemes are an additional complication, in that they make it even more difficult for you to get at your data on the long term.

So in the end, the "data retention" and "data archiving" activities are really trying to hit a moving target, one that is very, very difficult to actually hit. Most of the plans that I've seen only focus on some specific problems, but all of them fail to deliver an end to end solution that really address the ability to read the legacy data on the long term.

I suppose that at this point, most of the people reading this is going back to check their data retention and archiving plans and looking for gaping holes in the plans. You found them? Ok, keep reading then.

A true data archiving solution has to address all the problems of the hardware and software necessary to retrieve the data over the retention period. If any of the steps is missing, the whole plan is not worth spending in. Unless of course you want your plan to be used as mean for auditors to thick the corresponding box in their checklist. It is ok for the plan to say "this only covers xxx years of retention, we need to review it in the next yyy years to make sure daat is still retrievable", it is at least much better and more realistic than saying "this plan will ensure that the data can be retrieved in the following zzz years" without even considering that way before zzz years have passed the hardware and software used will become unsupported, or the software supplier could disappear without no one able to read the proprietary data format.

There is an easy way of visualizing this. Instead of talking about the business side of record retention, think about your personal data. All your photos and videos of your relatives and loved ones, taken over the years. All the memories that they contain, they are irreplaceable and also they are something you're likely to want to access in the long term future.

Sure, photos are ok. They are in paper, or perhaps in JPG files, which are by the way very well documented. But what about video? Go and check your video camera. It is probably using some standard format, but some of them use weird combination of audio and video codecs, with the camera manufacturer providing a disk with the codecs. What will happen when the camera manufacturer goes out of business or stops supporting that specific camera model? How you will be able to read the video files and convert to something else? That should make you think about data retention from the right point of view. And dismiss anything that is in an undocumented file format.