Wednesday, 29 February 2012

Is there a right way of doing database version control?

TL;DR: database version control does not fit well with source code version control practices because challenges associated with data size.

I could not help but think about posting a comment on this well written blog post, but realized that it was a topic worth discussing at length in a separate entry. If you've not clicked and read already the rather interesting article, here's the summary: there is a big impedance mismatch between what are the current best practices regarding changes in source code and change control inside databases. The solution proposed is quite simple: develop a change script, review it, test it, do it in production. If anything fails, fix the database by restoring a backup and start all over again.

For added convenience, the author proposes also to store the scripts in the database itself, so that everything is neatly documented and can be reproduced at will.

There are a number of very interesting comments proposing variations on this process, and all of them really reflect some fundamental problems with databases that do not have their reciprocal in the world of source code control. While I seriously think that the author is basing his proposal on real world experience and that the process works well for the systems he's involved with, there are a few environmental factors that he is ignoring that render the approach impractical in some cases. It is as if he is falling into the classic trap of believing that everyone's view of the world has to be the same as yours. Here are a few reasons why not.

Databases can be huge

This is the crux of the problem. Code is the definition of some data structures plus how to process them. Databases contain structure, data and perhaps also instructions on how to deal with data. Compilers can process the source code in a matter of minutes, but adding the data takes much longer. Either by restoring files, running scripts or whatever other means, there is no way to avoid the fact that the time to restore data is at least a couple of orders of magnitude above the time needed to compile something.

This makes all the "simple" strategies for dealing with databases fail above certain size, and break the agilistic development cycles. In the end, if you want to have continuous integration or something similar, you simply cannot afford to start from an empty database in each build cycle.

Big databases have big storage costs

In an ideal world, you have your production database, plus a test environment plus a copy of the production database in each development workstation so that you can make changes without disturbing anyone. This works as long as the data fits comfortably in a hard disk.

In the real world, with a database big enough, this is simply not possible. What ends up happening, in the best case, is that developers have a local database with some test data seeded in it. In the worst case, all developers share a database running in a server that is not likely able to hold a full copy of the production environment.

Performance testing for big databases is done usually on what is sometimes called a pre-production environment: a full copy of the production database restored on separate storage. That already doubles the total cost of storage: production environment plus pre-production environment. For each additional environment you want to have, say, end user testing, you're adding another unit to the multiple.

Before you blame management greed for this practice, think again. Each full copy of the production database is increasing storage costs linearly. For $100 hard disks, this is perfectly acceptable. For $100.000 storage arrays, it is not.

We've had for decades Moore's law on our side for RAM capacity and CPU power. But the amount of data that we can capture and process is increasing at a much faster rate. Of course, having infinite infrastructure budgets could help, but even the advocates of setting up the best developemnt environment agree that there are limits on what you can afford to spend on storage.

One promising evolution of storage technology is the snapshot-copy on write based systems. They provide nearly instantaneous copy time -only metadata is copied, not the actual data- and only store what is changed across copies. This looks ideal for big databases with small changes, but is unlikely to work well with databases with big changes, either big or small, as you're going to pay the "price" -in terms of amount of storage- of the full copy at the time you do the changes. But, don't forget that the copied databases will be impacting the performance of the source database when they access unchanged parts. To prevent that from happening, you need to have a standalone copy for production, and another for pre-production, and another for development. So at a minimum, you need three different copies.

Restores mean downtime

So does application code upgrades, one could say. And in fact they do. However, the business impact of asking someone to close an application or web page and reopening it later can be measured in minutes. Restoring a huge database can mean hours of downtime. So it's not as easy as saying "if anything goes wrong, restore the database" Even in a development environment, this means developers waiting hours for the database to be available. In a big enough database, you want to avoid restores at all costs and if you do them, you schedule them off hours or in weekends.

Data changes often mean downtime too

While in the past adding a column to a table required an exclusive lock on a table, or worse, on the whole database, relational DB technology has evolved to the point of allowing some data definition changes not to require exclusive access to a table. However, there are still some other changes that need that nobody else is touching the object being changed. While not technically bringing down the application, this in practice means that there is a time frame when your application is not available, which in business terms means downtime.

It's even worse: changes that don't need exclusive locks it usually run inside a transaction, which can represent a significant resource drag on the system. Again, for a small database this is not a problem. For a big enough database, it is not likely to have have enough resources to update 100 million records and at the same time allow the rest of users to use the application without taking a huge performance hit.

Is there a way of doing it right?

Simple answer: yes. Complex answer: as you increase the database size, the cost of doing it right increases, and is not linear. So it is going to become more and more difficult, and it's up to you to decide where the right balance of safety and cost is.

However, a few of the comments in the post suggested improvements that are worth considering. In particular, having an undo script for each change script seems to me the most reasonable option. Bear in mind that some kind of data changes do not have an inverse function: for example UPDATE A SET A.B=A.B*A.B is always going to yield B positive regardless of the sign of the original value of B. In those cases, the change script has to save a copy of the data that before updating it. With that addition, at least you have some way of avoiding restores. This does not completely remove the problem of downtime, but at least mitigates it making it shorter.

This, plus the practice of keeping the scripts inside the database, has also the added benefit of keeping the database changes self contained. That means less complexity should you need to restore, which is something DBAs appreciate.

According to the different scales, these are then the different scenarios:
  • Small database: ideal world, one full production copy for each developer. Use the original process. When moving to production there is a low chance of problems, but if they appear, use the original process: restore and go back to step 1.
  • Mid size database:  devs with small, almost empty database, final testing on pre-production. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.
  • Large database: devs with small, almost empty database. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.

Not pretty or perfect, but likely the world we live in. And note that NoSQL does not change anything here, except on the simplistic views of those novices to the NoSQL or development world at large. In some NoSQL implementations, you don't have to do anything special because there is no real database schema. You only change the code, deploy and everything is done. Really? Well, yes if you don't count all the places where you, or anyone else, made assumptions about the implicit structure of your documents. Which could be everywhere any of that data is being used. The relational world has some degree of enforced structure (keys, cardinality, nullability, relationships, etc) that makes certain category of errors to surface immediately instead of three weeks after the code is in production.


  1. The copy-on-write you propose is a fascinating idea and one we are exploring at along with versioning of course.

    The main pain-point you are bringing up of "When moving to production, cross your fingers and pray" is something we already extinquished.

  2. (Disclaimer - I work for Red gate Software)

    A neat way of version controlling database development using your existing source control system (SVN, TFS, Vault, Hg, Git, Perforce,... all others) is using the SSMS plug-in SQL Source Control - there are free webinars showing you how -

    1. Michael, the post is not about the lack of version control products, of which yours is a fairly well know one, but the problems associated with introducing changes in large databases. I cannot see how your product specifically helps with those.

      Specifically, I mention the downtime associated with running change scripts, and the long time needed to restore a scrambled database. Does your product have any features to address these issues? I could not find them.

  3. Version control is actually database deployment management - you are not managing the change.
    dbMaestro, is a complete database change management solution
    that covers all of the aspects of your database from development to deployment
    We offer it for FREE for small teams.

    Covers all aspects such as:
    Control- Control and manage all changes being implemented in the database- no out of process updates or changes can occur to the DB
    o Database change repository covering all DB object types
    o An enforced check in/out process
    o Table structure, PL/SQL code, Meta Data (relevant table contact – usually the data that influences application behavior)
    o Saving Object dependencies for future deployments
    o Enabling easy rollbacks
    o Interconnected to your existing SCM
    • Change Documentation
    o No code overrides (two developers start working on one package only to lose one's work at some point)
    o No undocumented changes are allowed

    Deploy – Eliminate risks threatening your deployment and decrease Your Deployment Costs by 95% with an integrated deployment engine
    o Create deployment scripts, instead of writing them manually – efficient, reliable, repeatable and documented and a huge time saver
    o Three way analysis/ impact analysis
    o Merge engine to automatically merge code from different teams doing development on the same time
    o Deploy changes based on business requirements (work items, tasks etc)

    Protect - Enhance your Database Security – control who is doing what in the database
    Track each database change back to the requirement and tasks that initiated it (+ integration with SCM for A to Z coverage)

    Comply- Improve Compliance to Regulatory Bodies
    o Roles and responsibilities enforcement – (important for audit compliance)
    o Instead of determining who can do what based on their access rights (can be very problematic if whole groups are using the same login credentials)
    o Make sure each person can do only what he should (linking with directory groups to determine project scopes of work)
    o Set roles: a DBA can change table structure while the developer can only change the procedures even if sharing the same login credentials

    1. Brenda, I really appreciate you leaving a comment here, but hard as as I tried, I could not see how your product is helping to remediate any of the problems discussed in the post.

      Yes, there are database version control products. Yes, they help to some degree to relieve the pain of database changes. No, none of them address the problems listed in the original post. Maybe if you read the article and take the time to discuss with some DBAs and developers you can contribute something useful, instead of copying and pasting your product feature list.

  4. Sorry if I sounded like a data sheet here- I just wanted to point out that I totally agree with the fact that there are bigger difficulties as the data grows. If we take it a step backward in order to manage the data you need to make sure that the right script is being deployed. In order to do so it is important to use the correct method of control (not all version controls are the same). The way that the dbMaestro TeamWork deals with this is by working directly on the database ensuring that the correct changes are being deployed instead of "crossing fingers" that the script you are deploying is the same as that being tested.
    Ensuring that the object dependencies are taken into consideration by using TeamWork to generate the deploy changes scripts. (no need to know the complicated DDL syntax either)
    Include business logic in the deployment script
    To make it short : all of the preparation, testing and executing in production wont pay off unless you make sure that all your correct changes are included in the deployment.