Database Performance Tuning: Kill the flat text file

Perhaps I'm not lucky and I always end up working with old systems, but I've become increasingly irritated by a practice that should have died years ago: the flat text file interface. In the stone age era, this was the standard way of exchanging data between systems, as the plethora of options that integration technologies have been creating in the last 20 years or so were not available.

Probably because of its ubiquitous availability, this practice was noticed by the office tool creators (yes, Lotus 1-2-3, I'm looking at you) and they built nice functions in the applications to deal with flat text files. Spreadsheets became common office tools, and people grew up using text files as their only way of getting data from systems. Perhaps because of that, a whole generation of office workers assumed that this was "the way" of doing things. When those office workers face a technology problem, they of course tend to prefer to use the tools and techniques they know better. After all, if you can open in Excel a flat text file and understand its contents, what's wrong with using that to talk to other applications?

The answer is, a lot of things. And some, but not all of them, related to performance. But that's not stopping the flat text files flying around. Even in this year, new functionality is built into applications that rely on flat text file interfaces to move information across applications even if much better ways of doing it exists.

Let's review the most common sins incurred by the practitioners of the flat text file interface approach.

Internationalization

30 years ago the world was 7-bit ASCII. Then information systems became global, standardized and multilingual. Probably, English is no longer the dominant language of computer users across the world. Yet the flat text file does not have any means of specify encoding, nor most legacy tools are able to handle international character sets correctly - remember, they were designed and created when the world was 7-bit ASCII. Before Unicode became universally accepted, there were a lot of workarounds that tried to deal with this problem in a way that did not broke legacy applications. Most of them were around using, yes, you guessed it, 8-bit ASCII. But 8-bit ASCII is not enough to cover even the different character sets across the western world, much less deal with oriental character sets.

The result is that data quality deteriorates as information moves across systems in those flat text files. Each time a file is exported and/or imported, an assumption is made about source and target character data sets that most of the time is not tested, if not plain wrong.

And no, flat text files do not have a way to encode character set information in them, unless of course you make them non-flat and irregular. Therefore, each time you try to compare what should be the same information in two different systems, you have all kinds of mismatches in textual data that are simply due to wrong character set conversions. In some cases, there are so many of them that they make the signal/noise ratio in the data so low that the actual data quality audit is impossible.

Cost

A flat text file has to be written by a program somewhere, according to some agreed format between parties. It then has to be transferred to the other system and read back by the other machine, interpreting the agreed format. Simple, isn't it?

No, it is not simple. Think about it. Now think again, twice. Each time you exchange information across systems you're performing essentially a data transformation to adapt from one data model (or database schema, if you prefer) to another. Now, by writing a flat text file, you're introducing yet another transformation, the one you're doing from the source or target data model to the flat text file format. So you're paying the cost not only of two data transformations, but at least three. Usually, text files have a format that closely matches one of the two data models, with the other unlucky side having to perform the bulk of the transformation job. In extreme cases, I've witnessed otherwise intelligent people engage in even more ridiculous exercises, trying to define so called "neutral data formats", in which the two parties have to perform extensive data transformations to be able to use the data in their applications.

Still thinking that it's cheaper to use text files? Ah, that's because you have lots of people who can write programs that read and write text files, isn't it? Well, perhaps it's time to look for more qualified resources, as the incremental cost of learning other technologies for application information exchange is slim compared with the additional costs you're incurring by performing all those additional transformations. Don't want to invest in that knowledge? Think again, you're basically stagnating yourself or your team in old technologies that simply are being replaced by application connectivity and middleware.

Ah, I hear you cry, but flat text files can easily be audited, can't they? I have to say, yes, they can. Have you ever heard of application log files? What's stopping you writing something on an application log or journal instead of keeping your flat text files around? Isn't that easier, safer and subject to regular application security than keeping a folder somewhere with a bunch of flat text files? Answer left as an exercise to the reader.

On top of that, each time you need to transfer additional attributes, you have to modify two programs, the one that writes and the one that reads. How's that impacting the cost?

If you think all of the above is not enough to stop using text files, read on.

Synchronization

The flat text file is static. Your data is dynamic. Your business is dynamic. Enough said. Writing something to a text file means that it has to be processed later. Means that it cannot be changed at the source unless you're sure that the other side has captured the previous change. Means that there will be a delay between the time one application has a piece of data and the other has the same piece. This causes no end of race conditions, unless both applications are changed to reflect the special state the data is while it is being sent across. Which introduces yet more artificial complexity in data management.

Not only that, but those files have to be moved around. Which means another protocol, be it FTP, CIFS (Windows file sharing) or similar, that is involved in the transfer. Some of those protocols insist in introducing some additional layers of character set translation, require customized firewall settings and need to be taken care of. Not only that means additional cost, but also another point of failure where things can go wrong.

Experienced developers have a solution for this problem, usually closing the system for end user access while the flat text files are being read or written. Sometimes, I've seen flat text files as some sort of journal used to defer the import or export, usually to be done on a night batch process. Those nightly jobs can become monster bottlenecks, where all useful application processing is deferred to be done during the night. Cross your fingers, because if anything goes wrong it will have to be fixed the following day, delaying whatever processes depend on the data being moved, sometimes making the regular services stop to give the systems enough breathing room to do their monster nightly batch runs during the day.

You can fix this, of course, by having continuous monitoring of your beloved flat text files import and export processes. Remember to add those costs to the total cost of support.

If you believe that replacing the flat text file with some other synchronous or asynchronous solution will involve expensive distribute transaction technologies, I have an answer for that: only if you want it. Your systems will be much more robust and data integrity much higher if you use transaction monitors or messaging queues. But even if you don't use any of those facilities you'll end up in a much better shape than if you used a flat text file. The explanation is, your method will update an entity at at time and you can have full error control during the transfer (be it transactional or not) and alert the user or the support staff of any problems on the spot. Whereas when you read your flat text file records and act on them, you're going to generate the same error condition over and over. Fixing problems as soon as they appear is much easier and faster on the long run than just deferring the fix to the time you analyze the results of your flat text file data load, when a lot of damage may be already done.

Performance

And now to the final point. After what has been said, the people still favoring the flat text file are about to show their last card: sometimes it's faster to import or export a flat text file than to do it in other ways, right?

Yes, sometimes it is. Large data loads are probably done faster by using some intermediate storage where the application or database can perform bulk reads or writes. Of course, this intermediate storage is our beloved flat text file(s). Usually, those large data loads are one off jobs. In practice, there are very few, if any, applications that need to perform those on a regular basis. Even large data warehouses are updated incrementally, simply because there is a point where there is not enough time to rebuild them from scratch each time a new bit of data arrives. And in those cases, the bulk of the time is spent refreshing the computed results, not the base level data.

Still, the idea of using the same formats for your one-off and incremental loads looks attractive. Again, look at the reasons explained above and think if it is still worth enslaving yourself and paying those costs forever just because you want to keep the same text file format forever. Specially the first time you need to change that written in stone flat file format used by all those programs around.

As for performance, while there is no discussion that databases usually like to work on data sets as big as possible, the incremental updates are not usually big enough to provide a significant performance improvement over doing them on the spot. And remember, doing the information transfer at the time it is done on the source means that the sender application already has most, if not all, the data that needs to be sent to the other application, which means that you are saving at least another read if you don't defer the transfer for later. In any case, you're saving the read/write time of the flat text file. If you're still doing significantly better with bulk batch loads it probably means that there is still some fundamental flaw in the way the two applications are exchanging data, as taking that logic to the extreme would mean that you would stop using anything but flat text files for exchanging information, even inside the application, wouldn't' you?

Final words of hate against flat text files

I realized that this posts reflects that I passionately hate text files. Perhaps it's because I've had enough of them and my hatred is irrational, but I cannot see how any of the above arguments can be challenged. I've yet to face a situation where using tools less than 10 years old, one has to resort to flat files for data transfer, yet I keep seeing them being created over and over. By this time, I think that you've enough arguments to stop trusting anyone that says that flat text files are "flexible", "efficient" or "cheap to implement". They aren't. Of course, someone selling you this idea probably has a vested interest in you using flat text files, as they know too well that you'll become dependent on them forever to maintain them. Do not trust them.

If you come across one of these consultants, mention briefly SOA, REST, RPC, SAP RFC, RMI, .Net remoting, Oracle SQL Links or the myriad other technologies developed over the last 20 years to address all those needs.

I sometimes dream about the end flat files. Somehow, I break during the night into the customer data center and replace each and every single flat text file interface with a modern, performing approach. I then add a trigger for each change in the target database that writes a record in the flat text files that should have been or written to transfer the data. I bet that in most cases it will take them a week or so to realize that there is something different. Well, for some customers the job would take me much longer than a night, as the number of different flat files flying around is in the hundreds, if not thousands.

Flat text files should have died a long time ago as a way of exchanging data among applications. Let's leave the accountants and finance people play with them for their Excel spreadsheets, saving them to learn anything new and us from having to police them in abusing databases. Let's drop them as soon as we can for anything else.

Database Performance Tuning

Thursday, 2 July 2009

Kill the flat text file

Internationalization

Cost

Synchronization

Performance

Final words of hate against flat text files

No comments:

Post a Comment

Useful links