Database Performance Tuning: February 2012

TL;DR: database version control does not fit well with source code version control practices because challenges associated with data size.

I could not help but think about posting a comment on this well written blog post, but realized that it was a topic worth discussing at length in a separate entry. If you've not clicked and read already the rather interesting article, here's the summary: there is a big impedance mismatch between what are the current best practices regarding changes in source code and change control inside databases. The solution proposed is quite simple: develop a change script, review it, test it, do it in production. If anything fails, fix the database by restoring a backup and start all over again.

For added convenience, the author proposes also to store the scripts in the database itself, so that everything is neatly documented and can be reproduced at will.

There are a number of very interesting comments proposing variations on this process, and all of them really reflect some fundamental problems with databases that do not have their reciprocal in the world of source code control. While I seriously think that the author is basing his proposal on real world experience and that the process works well for the systems he's involved with, there are a few environmental factors that he is ignoring that render the approach impractical in some cases. It is as if he is falling into the classic trap of believing that everyone's view of the world has to be the same as yours. Here are a few reasons why not.

Databases can be huge

This is the crux of the problem. Code is the definition of some data structures plus how to process them. Databases contain structure, data and perhaps also instructions on how to deal with data. Compilers can process the source code in a matter of minutes, but adding the data takes much longer. Either by restoring files, running scripts or whatever other means, there is no way to avoid the fact that the time to restore data is at least a couple of orders of magnitude above the time needed to compile something.

This makes all the "simple" strategies for dealing with databases fail above certain size, and break the agilistic development cycles. In the end, if you want to have continuous integration or something similar, you simply cannot afford to start from an empty database in each build cycle.

Big databases have big storage costs

In an ideal world, you have your production database, plus a test environment plus a copy of the production database in each development workstation so that you can make changes without disturbing anyone. This works as long as the data fits comfortably in a hard disk.

In the real world, with a database big enough, this is simply not possible. What ends up happening, in the best case, is that developers have a local database with some test data seeded in it. In the worst case, all developers share a database running in a server that is not likely able to hold a full copy of the production environment.

Performance testing for big databases is done usually on what is sometimes called a pre-production environment: a full copy of the production database restored on separate storage. That already doubles the total cost of storage: production environment plus pre-production environment. For each additional environment you want to have, say, end user testing, you're adding another unit to the multiple.

Before you blame management greed for this practice, think again. Each full copy of the production database is increasing storage costs linearly. For $100 hard disks, this is perfectly acceptable. For $100.000 storage arrays, it is not.

We've had for decades Moore's law on our side for RAM capacity and CPU power. But the amount of data that we can capture and process is increasing at a much faster rate. Of course, having infinite infrastructure budgets could help, but even the advocates of setting up the best developemnt environment agree that there are limits on what you can afford to spend on storage.

One promising evolution of storage technology is the snapshot-copy on write based systems. They provide nearly instantaneous copy time -only metadata is copied, not the actual data- and only store what is changed across copies. This looks ideal for big databases with small changes, but is unlikely to work well with databases with big changes, either big or small, as you're going to pay the "price" -in terms of amount of storage- of the full copy at the time you do the changes. But, don't forget that the copied databases will be impacting the performance of the source database when they access unchanged parts. To prevent that from happening, you need to have a standalone copy for production, and another for pre-production, and another for development. So at a minimum, you need three different copies.

Restores mean downtime

So does application code upgrades, one could say. And in fact they do. However, the business impact of asking someone to close an application or web page and reopening it later can be measured in minutes. Restoring a huge database can mean hours of downtime. So it's not as easy as saying "if anything goes wrong, restore the database" Even in a development environment, this means developers waiting hours for the database to be available. In a big enough database, you want to avoid restores at all costs and if you do them, you schedule them off hours or in weekends.

Data changes often mean downtime too

While in the past adding a column to a table required an exclusive lock on a table, or worse, on the whole database, relational DB technology has evolved to the point of allowing some data definition changes not to require exclusive access to a table. However, there are still some other changes that need that nobody else is touching the object being changed. While not technically bringing down the application, this in practice means that there is a time frame when your application is not available, which in business terms means downtime.

It's even worse: changes that don't need exclusive locks it usually run inside a transaction, which can represent a significant resource drag on the system. Again, for a small database this is not a problem. For a big enough database, it is not likely to have have enough resources to update 100 million records and at the same time allow the rest of users to use the application without taking a huge performance hit.

Is there a way of doing it right?

Simple answer: yes. Complex answer: as you increase the database size, the cost of doing it right increases, and is not linear. So it is going to become more and more difficult, and it's up to you to decide where the right balance of safety and cost is.

However, a few of the comments in the post suggested improvements that are worth considering. In particular, having an undo script for each change script seems to me the most reasonable option. Bear in mind that some kind of data changes do not have an inverse function: for example UPDATE A SET A.B=A.B*A.B is always going to yield B positive regardless of the sign of the original value of B. In those cases, the change script has to save a copy of the data that before updating it. With that addition, at least you have some way of avoiding restores. This does not completely remove the problem of downtime, but at least mitigates it making it shorter.

This, plus the practice of keeping the scripts inside the database, has also the added benefit of keeping the database changes self contained. That means less complexity should you need to restore, which is something DBAs appreciate.

According to the different scales, these are then the different scenarios:

Small database: ideal world, one full production copy for each developer. Use the original process. When moving to production there is a low chance of problems, but if they appear, use the original process: restore and go back to step 1.
Mid size database: devs with small, almost empty database, final testing on pre-production. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.
Large database: devs with small, almost empty database. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.

Not pretty or perfect, but likely the world we live in. And note that NoSQL does not change anything here, except on the simplistic views of those novices to the NoSQL or development world at large. In some NoSQL implementations, you don't have to do anything special because there is no real database schema. You only change the code, deploy and everything is done. Really? Well, yes if you don't count all the places where you, or anyone else, made assumptions about the implicit structure of your documents. Which could be everywhere any of that data is being used. The relational world has some degree of enforced structure (keys, cardinality, nullability, relationships, etc) that makes certain category of errors to surface immediately instead of three weeks after the code is in production.

Today, Canonical announced that they are relegating Kubuntu, one of their "official" variants of their flagship Ubuntu Linux distribution, to the same status as the other distribution derivatives.

Canonical is the brainchild of Mark Shuttleworth, a dot com boomer that wanted to give back to the same community that provided some of the wonderful FOSS software that helped him becoming a millionarie. When it started, Canonical did not had any clear financial constraints, or objectives for that matter. Bug #1 in Ubuntu's bug database simply reads "Microsoft has a majority market share" Was Canonical's objective to take away market share from Windows? At the time, that seemed to be a bold statement, but the first Ubuntu releases were making giant strides towards that objective, to the point of being considered a credible alternative by many established players, including Microsoft itself.
Kubuntu, one of the Canonical projects, is an attempt to merge the friendliest Linux distribution -Ubuntu- with the desktop environment that I find that is the closest fit for a Mac or Windows user: KDE. An ideal combination for someone that switches between operating systems or is a seasoned user that wants to move away from the proprietary environments.

What the latest announcement means essentially is that the single individual paid by Canonical to develop and maintain Kubuntu will no longer be assigned to that role, and any further developments in Kubuntu will have to come from the community instead. This does not necessarily implies that there will be no more Kubuntu releases after Pangolin: for example, the Edubuntu community has managed to keep up with releases.

It's not that Kubuntu users are left in the cold, however. The next release (Precise Pangolin) of the Ubuntu family will still include Kubuntu, and being a Long Term Support (LTS) one, means that existing Kubuntu users will get patches and support for the next three years.

So what's so important about the announcement, then? By itself, very little, except for the small minority of Kubuntu users. Kubuntu had not enough user base, reputation or visibility to be worth keeping, hence Canonical has retired the single individual dedicated to Kubuntu because it does not make economic sense to keep paying him to do that.

What is important is not the announcement, is the trend: Canonical is more and more taking decisions based on economic, not idealistic, considerations.

Now, those idealistic goals are set aside more and more in search of a more mundane objective: profitability. The turning point was reached last year: they released Unity, a new desktop environment, targeted at non computer users, with an eye on using it as the interface for Ubuntu based touch devices and other non-PC environments. That was a big change that was received by the existing user community with a lot of backslash, yet Canonical is firmly resoluted to develop Unity in spite of that, and not willing to devote time or resources to keep alive an alternative to Unity.

What Canonical wanted to be at the beginning was not clear, but now it is: to be profitable.

And is hard to blame Canonical for not trying. After all, they have an extensive -read, expensive- staff dedicated to the many projects they sponsor. They have clearly invested a lot of effort -read, money- into many initiatives targetting everything from the office productivity desktop to the settop TV box, with incursions in the music store business, cloud storage, cloud OS and management, alternative packagings (Kubuntu still appears there at the time of writing this, by the way), education, home multimedia hubs, corporate desktop management and who knows what else.

All these projects have generated a considerable user base, at least in comparison with previous attempts, and helped Canonical to accomplish Shuttleworth original intention of giving back to the community, even if there are differing opinions on how much actual contribution has been made.

Yet, none of these projects seems to have generated a respectable enough line of business. Maybe some of these projects are self sustaining, maybe some of them generate some profit. But are they taking over the world? No. Are they going to be a repeat of the first Shuttleworth success? No. Are they making headlines? No. All that investiment is certainly producing something that is valued and appreciated by the open source community, but profitable is not.

At least yet. How do I know ? Honestly, I don't know for sure. What I know for sure is that any degree of significant success would be heralded and flagged as an example by the always enthusiast open source community. And that is not happening.

So Canonical is looking for profitability, big time. And if it means losing all their existing user base, so be it. Which will not be a big loss, because their existing user base is demonstrably not very profitable to begin with.

Which makes complete sense from a business perspective. If I was a millionare and had put a lot of my own personal fortune in something, I'd be expecting to see something back. Another, completely different issue, is how they can become profitable. That would mean looking into Unity, their biggest bet so far, and... well, you already know my opinion. Is a keyboard search the most effective way of finding things in a device without a physical keyboard? You judge.

Anyway, it was about time to change distro. With all due respect for the huge contribution Canonical and Ubuntu have made to build a robust, flexible and fast desktop environment. Until they stopped wanting to do that, of course.

Database Performance Tuning

Wednesday, 29 February 2012

Is there a right way of doing database version control?

Databases can be huge

Big databases have big storage costs

Restores mean downtime

Data changes often mean downtime too

Is there a way of doing it right?

Tuesday, 7 February 2012

The changing goals of Canonical

Useful links