TL;DR: database version control does not fit well with source code version control practices because challenges associated with data size.
I could not help but think about posting a comment on
this well written blog post, but realized that it was a topic worth discussing at length in a separate entry. If you've not clicked and read already the rather interesting article, here's the summary: there is a big impedance mismatch between what are the current best practices regarding changes in source code and change control inside databases. The solution proposed is quite simple: develop a change script, review it, test it, do it in production. If anything fails, fix the database by restoring a backup and start all over again.
For added convenience, the author proposes also to store the scripts in the database itself, so that everything is neatly documented and can be reproduced at will.
There are a number of very interesting comments proposing variations on this process, and all of them really reflect some fundamental problems with databases that do not have their reciprocal in the world of source code control. While I seriously think that the author is basing his proposal on real world experience and that the process works well for the systems he's involved with, there are a few environmental factors that he is ignoring that render the approach impractical in some cases. It is as if he is falling into the classic trap of believing that everyone's view of the world has to be the same as yours. Here are a few reasons why not.
Databases can be huge
This is the crux of the problem. Code is the definition of some data structures plus how to process them. Databases contain structure, data and perhaps also instructions on how to deal with data. Compilers can process the source code in a matter of minutes, but adding the data takes much longer. Either by restoring files, running scripts or whatever other means, there is no way to avoid the fact that the time to restore data is at least a couple of orders of magnitude above the time needed to compile something.
This makes all the "simple" strategies for dealing with databases fail above certain size, and break the agilistic development cycles. In the end, if you want to have continuous integration or something similar, you simply cannot afford to start from an empty database in each build cycle.
Big databases have big storage costs
In an ideal world, you have your production database, plus a test environment plus a copy of the production database in each development workstation so that you can make changes without disturbing anyone. This works as long as the data fits comfortably in a hard disk.
In the real world, with a database big enough, this is simply not possible. What ends up happening, in the best case, is that developers have a local database with some test data seeded in it. In the worst case, all developers share a database running in a server that is not likely able to hold a full copy of the production environment.
Performance testing for big databases is done
usually on what is sometimes called a pre-production environment: a full
copy of the production database restored on separate storage. That
already doubles the total cost of storage: production environment plus
pre-production environment. For each additional environment you want to have, say, end user testing, you're adding another unit to the multiple.
Before you blame management greed for this practice, think again. Each full copy of the production database is increasing storage costs linearly. For $100 hard disks, this is perfectly acceptable. For $100.000 storage arrays, it is not.
We've had for decades Moore's law on our side for RAM capacity and CPU power. But the amount of data that we can capture and process is increasing at a much faster rate. Of course, having infinite infrastructure budgets could help, but even the advocates of setting up the best developemnt environment agree that there are limits on what you can afford to spend on storage.
One promising evolution of storage technology is the snapshot-copy on
write based systems. They provide nearly instantaneous copy time -only
metadata is copied, not the actual data- and only store what is changed
across copies. This looks ideal for big databases with small changes,
but is unlikely to work well with databases with big changes, either big
or small, as you're going to pay the "price" -in terms of amount of
storage- of the full copy at the time you do the changes. But, don't forget that the copied databases will be impacting the performance of the source database when they access unchanged parts. To prevent that from happening, you need to have a standalone copy for production, and another for pre-production, and another for development. So at a minimum, you need three different copies.
Restores mean downtime
So does application code upgrades, one could say. And in fact they do. However, the business impact of asking someone to close an application or web page and reopening it later can be measured in minutes. Restoring a huge database can mean hours of downtime. So it's not as easy as saying "if anything goes wrong, restore the database" Even in a development environment, this means developers waiting hours for the database to be available. In a big enough database, you want to avoid restores at all costs and if you do them, you schedule them off hours or in weekends.
Data changes often mean downtime too
While in the past adding a column to a table required an exclusive lock on a table, or worse, on the whole database, relational DB technology has evolved to the point of allowing some data definition changes not to require exclusive access to a table. However, there are still some other changes that need that nobody else is touching the object being changed. While not technically bringing down the application, this in practice means that there is a time frame when your application is not available, which in business terms means downtime.
It's even worse: changes that don't need exclusive locks it usually run inside a transaction, which can represent a significant resource drag on the system. Again, for a small database this is not a problem. For a big enough database, it is not likely to have have enough resources to update 100 million records and at the same time allow the rest of users to use the application without taking a huge performance hit.
Is there a way of doing it right?
Simple answer: yes. Complex answer: as you increase the database size, the cost of doing it right increases, and is not linear. So it is going to become more and more difficult, and it's up to you to decide where the right balance of safety and cost is.
However, a few of the comments in the post suggested improvements that are worth considering. In particular, having an undo script for each change script seems to me the most reasonable option. Bear in mind that some kind of data changes do not have an inverse function: for example UPDATE A SET A.B=A.B*A.B is always going to yield B positive regardless of the sign of the original value of B. In those cases, the change script has to save a copy of the data that before updating it. With that addition, at least you have some way of avoiding restores. This does not completely remove the problem of downtime, but at least mitigates it making it shorter.
This, plus the practice of keeping the scripts inside the database, has also the added benefit of keeping the database changes self contained. That means less complexity should you need to restore, which is something DBAs appreciate.
According to the different scales, these are then the different scenarios:
- Small database: ideal world, one full production copy for each developer. Use the original process. When moving to production there is a low chance of problems, but if they appear, use the original process: restore and go back to step 1.
- Mid size database: devs with small, almost empty database, final testing on pre-production. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.
- Large database: devs with small, almost empty database. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.
Not pretty or perfect, but likely the world we live in. And note that NoSQL does not change anything here, except on the simplistic views of those novices to the NoSQL or development world at large. In some NoSQL implementations, you don't have to do anything special because there is no real database schema. You only change the code, deploy and everything is done. Really? Well, yes if you don't count all the places where you, or anyone else, made assumptions about the implicit structure of your documents. Which could be everywhere any of that data is being used. The relational world has some degree of enforced structure (keys, cardinality, nullability, relationships, etc) that makes certain category of errors to surface immediately instead of three weeks after the code is in production.