Tuesday, 20 May 2008

Database tuning myths

Businesses are always looking for predictability. From financial plans that span three years to operational plans for next year, to five year strategic plans, there is a good amount of time and resources at any business devoted just to predict what the future is going to look like. Entire business functions are needed to create the plans and track its evolution. And the stock market regularly punishes companies that do not meet their financial plans. Ironically, they do it only the ones were the objectives are not met. Someday they should look at some medical disciplines and discover why growing faster than you should is not always good for your health, but that's another story.

It's not surprising that all those processes create a culture of predictability, where everything must be planned in advance and coordinated. Over time, and by repetition, those processes become so embedded in the company culture that people forget why things are done in a particular way. This has many benefits, as the process of planning itself becomes predictable. But it also transpire to other areas of the company, and everything that is either important or big enough ends up having a detailed process that everyone has to follow. Detailed and predictable outcomes reinforce the hearts and souls of the financial people because that full fills one of the basic tenets of capitalism: trust in the future.

Unfortunately, this model does not scale well if you want to keep an eye in all the details, even at the business process level. At some point, you have to forget about the details of how something is manufactured, or how the accounting is done. That gives you abstractions, and those are an essential tool for scaling up the complexity of any model, because without them you would get lost in the details very quickly.

Out of those abstractions, things that are know as "rules of thumb", or more corporate/engineering "best practices" emerge over time. Those are supposed to be generic rules or techniques that can be applied to any problem and give satisfactory, predictable results.

But as many are discovering, abstractions are not perfect. By their very nature, they ignore the details and focus on a few areas of interest to make them reasonably simple to use, its usefulness given precisely because they ignore the details.

If you've kept reading up to now (thank you very much), you may be wondering how does all this relates to database performance tuning. Well, this is one of the places where I see daily rules of thumb applied, often with disastrous results. Computer systems have evolved from the centralized mainframe to the distributed storage, processing and client access environment of today. Where in the past you had a machine and storage system with a thin operating system layer on top now you have to deal with anoperating system that attempts to coordinate a wide range of things connected to the system, plus a network stack, a database persistence layer, a business process model layer, a presentation layer and a client (thin or thick) layer.

What were in the past valid "rules of thumb" in one environment cannot simply be applied to a different, more complex one. Well, in fact, some of them were never applicable in any environment at all. Here's my "best practices hell" top three:

1- We just need to use fancy new hardware

The number one, because there is natural tendency to associate something physical with the performance of the system. And that's the box sitting in the data center. Just as you can get a faster car, why not just get a faster computer?

The answer is, yes as long as you make sure that you actually need it. Adding a faster CPU for a system that is disk bound is not going to improve its performance. Putting a faster disk to a machine that runs processes that eat up all its RAM will only make it swap faster. Adding more memory to a system that spends most of its time sending data across the network does not improve its performance.

In short, you need to diagnose before healing. You need to know your system and you need to know your application and how it's using the system. Without that knowledge, you are just guessing, and spending a lot of money at it. And I've watched this happening more than a few times. It's very frustrating to spend money and resources in an expensive hardware upgrade only to find that the results are unnoticeable.

The reason behind that is that you've reached some hard limit and that your whole system cannot be easily improved by throwing more hardware at it. Then it's time to examine your application and move its performance boundaries to other components. This sounds complicated, and it sometimes is. But often it means checking the application and making sure it's using the database resources with measure. The classic case is the 10.000 item grid that is completely filled up on applicationstart up . The network round trips necessary for this are huge and the application changes usually minor. The irony is that after this exercise, you'll probably discover that you actually did not need any new hardware at all.

But the myth stays firmly planted in the industry because often a hardware replacement means performance increases of all its components. Therefore, whatever the performance boundaries, the system has better performance after the hardware upgrade.

2- We just need to create a few indexes

This is one of the most popular, yet dangerous myths. Come on, let's finish this discussion taking the myth to the extreme. Go ahead, index everything in your database. And I mean every single column and every different combination of columns in each table. Now try to do a few queries. Wow, some of them are probably faster. Now try to do any INSERT or UPDATE on it and watch the extreme performance degradation. Convinced?

The explanation for this is that an index is just a table with a special organization that allows you to find things quickly in it. Yes, it's a small table because it contains only the indexed columns, but it's a table nonetheless. Each time you add an index you're creating one of those tables and increasing the overhead of keeping all of them up to date, not to mention the concurrency issues that your database has to deal with each time it's updating a table and it's indexes. Oh yes, I hear you saying, ha!, cluster indexes do not have this problem! Of course, but the problem with cluster indexes is that you cannot have more than one for each table, so it's very unlikely that you're going to cover all cases with them.

Remember, performance tuning is a trade off. It's about using available resources where they are more useful. And this is clearly not a way to profit from your system.

This is not to mean that indexes are not useful. Yes, they are. But anyone that is applying rules of thumb is sooner or later going to fall into this trap. Remember, you've been warned.

3- We did this the last time and it improved performance

As explained in the introduction, "rules of thumb" do not account for complexity. Face it, the systems you're dealing with are very, very complex. the performance of a system is dictated as much by its external environment and constraints as by its own internal characteristics.

So, unless the previous one was identical to the new one, this assumption is likely going to be false. In particular, this myth is fimrly implanted in single product specialists and consultants and is very visible when they move to something different. Remember, reject any changes that do not come with good background. Just saying "because I've a book that says so", "because the vendor recommends it" or "because in my last project at customer xxxx it worked well" are not valid reasons per se.

Wednesday, 14 May 2008

Having a baseline

While the technical part of database performance tuning is almost different in each case, on the business side of things I run almost always into the same conversations. My favorite one is the conversation with potential customers. These follows more or less the same structure. And it always begins with the same sentences:

-Customer: our application has performance problems, we're looking for someone to improve it.
-Consultuning: have you already identified specific parts of the application that are not performing as you need?
-Customer: well it's this screen/process/job that is being perceived as slow.
-Consultuning: ok, let's have a look at it. What exactly do you mean by slow?
-Customer: I mean awfully slow, our business cannot operate with these response times.

Ironically, you'll think that at this point the customer answer would be "it takes xxxx seconds", but usually that's not the case. It's unusual to face a customer that has actually timed it at least to the second. For interactive applications ("dialog steps" in the SAP world) this is understandable, as it was proved a long time ago by usability studies that anything longer than three seconds is perceived as "slow". For batch processes, since they are run unattended, the execution time is not usually monitored even if it's captured in the system logs.

But remember, if you don't measure it, you cannot tune it. Any performance improvement effort will be a balance of cost, time to implement, and resource consumption. These have to be weighted against the benefits you're getting. And unless you have some tangible numbers, you simply cannot do that. So you'll not have any basis to accept or reject any proposed changes. One of my rules of thumb is to reject any proposed change if it's not backed up by some measurements that prove that the performance gain is worth the cost. This is also one of the points of friction when dealing with people (or automated tools) trying to improve performance by applying "rules of thumb", as the cost and impact of changes is often ignored. But is worth the discussion.

And what about the targets?

Yes, ideally, you should have a performance target. The problem is, the very nature of performance tuning is not deterministic. Because if you already know what your target is and that you can reach it, you probably already solved the tuning problem. So the conversation follows...

-Consultuning: well, how fast you want it to be?
-Customer: of course, I want it to be fast enough.

My advice is that you try to define "fast enough", but be aware that you'll not get any commitment to reach that performance level from any tuning expert worth its title, except in the "been there, done that" cases, but those are very rare with bespoke applications. So we have two cases:

-You have a performance target. Stop tuning when you reach it. Save your tuning budget for later rounds.
- If you don't have a target in your head, keep this instead: each tuning step will yield less performance benefits than the previous one. At some point, the gains will be minimal, and the cost to implement them too high. This is the moment where you should stop tuning.

Wednesday, 7 May 2008

Sometimes a tenth of a second is important

As any other field where a certain skill level is reached, performance tuning for the uninitiated becomes some kind of magical black art beyond certain point. It's then part of your skill set to try to explain in terms that a non specialist can understand, what you're really doing. (and they are paying for)

This is the point where excessive simplification can, and in fact does, give you problems. A case in point was a tuning exercise done on a large import process for a database. The import process itself was iteratively doing on a row by row basis an SQL query that retrieved a single row. The application design was not optimal, as this access was not even partially indexed and the import process was agonizingly slow.

The data model was "customized" from the original, big application vendor, design to adapt it to business needs, and that customization had not taken into account all possible impacts of changing some meta data. But since the import process was also supplied by the vendor there was no way to change it. Neither calling the vendor for support was an option.

Ah, I said, no problem, we can create the necessary index and be done with it. But first, remembering that performance tuning is always a balancing exercise, I took the extra caution of checking the cardinality of the involved fields. The query in question had four columns in the WHERE criteria, but only three of them were significant in its selectivity. The fourth one just had a 2:1 ratio, while the other three together had more than 10.000:1 ratio. Thus, I created an index on the first three fields and left the fourth alone.

It is always a good idea to check the selectivity of your indexes, having unnecessary columns on them simply degrades performance without any benefit in the access times. It's also a very common, and wrong, performance optimization "rule" among non specialists that says "just add more indexes" as an answer to almost any database performance problem. The cardinality check is usually a good way to dispell that myth, because not always more indexes equate better performance.

After I did that, the results were good, the single row select was down from 10 seconds per row to .2 secs per row.

Well, that's it, I said. Only until later on the application developer came in and asked if I could further improve the performance. Since it was already 50 times better, I was in denial of adding the remaining column to the index. And when I did that, performance was down to .1 secs per row instead of .2. This kind of performance improvement, while spectacular in itself (how many times can you double something for such a small price?) was nowhere near the order of magnitude gained by the first one, and probably not worth doing in the general case.

Unless of course the single row SELECT is performed 100.000 times. Then it really makes a difference to wait 10.000 seconds or 20.000. In the first case your process will finish within normal working hours. In the second, you'll have to stay in the office late until it finishes. Case closed.

Over-specify to avoid performance problems

Nobody likes to stop the press, the train or the plane, or the cash collection, or whatever process their business is engaged on, waiting for a computer to finish its task. Nobody likes having to fit the process ("list of deliveries will not be available until 11:30 in the morning") to the timing of a machine. Those are usually the symptoms of a computer system not performing up to the business requirements.

Of course, this is more commonly seen in the business world. Software manufactured in other fields simply cannot afford to have performance defects. Think of life support or vehicle control systems, where performance is a requirement that is expressed in very precise terms. Think of video games, where jerky animations will mean that end users will not purchase them. Software used to support business processes is usually viewed as a constant balance between cost, time to market and quality. Seasoned project managers, when confronted with the previous sentence will instantly recognize it and say "you cannot have the three together, you have to pick two"

In the cost, time and quality triangle, where do you place performance? As veteran project managers and performance consultants already know, the answer probably lies somewhere between the three. Performance is surely going to be considered a quality item, but also happens to impact both price and time to market. Here's my experience in the field, note that your mileage may vary.

In the end, it's the quality that suffers

The traditional project life cycle in business environment is seldom completed as it was initially envisioned. Very few projects end up with their scope or schedule unchanged, if only because the traditional business practice of "optimizing" everything cannot resist the temptation of subtracting resources or shorten the time line if the prospects of a project finishing on time and budget are good. Very seldom business will drop features or accept to postpone delivery dates, perhaps because the business side of the project can understand very well those concepts, where "quality" is usually more abstract and harder to quantify.

Besides, this is software, right? Being software, it means that the same principles applied to other engineering disciplines are not applicable here. This is not the place to discuss if software is engineering or not, but certainly some rules that are usually applied to well established engineering practices do not apply. Can you imagine a civil engineers changing the width of a bridge once the construction work has started? Hardly. But the same civil engineers, when leading a software project, do not have any problems changing the specifications for data volumes while building the system. This makes even more unlikely that all the spec changes stay in sync, not to mention the performance ones.

To have any chance of surviving schedule and scope changes, performance testing, like unit, functional or integration testing, should be an integral part of the project plan. If you don't have performance requirements and testing at the beginning of the project, skip to another article, as the damage has already been done. You'll resort to monitor the live system performance and watch for performance trends to show up before even having a clue of what you need to do to keep acceptable performance levels.

But if you have performance testing as an integral part of the project plan, don't hold any high hopes either. In my experience, it's the "quality" part of the triangle that suffers. Outside of mission critical contexts with human health at stake, and in spite of all the "quality" buzzwords that surround modern management methodologies, changes in scope or schedule always will win over quality. Part because "quality" is a hard to define concept for business software and part because the number of defects in a software application is largely irrelevant as long as it fulfills business requirements today. And today means this quarter, not even this year. Since software is infinitely flexible, we can always fix defects later, right?

This philosophy largely permeates management because, much like performance tuning, everything is seen as a balance. Your application is perhaps not complete, has a few outstanding bugs and has not been performance tested. But the business benefits, and sometimes management reputations, are much preferable than moving the delivery date to have time to fix those problems Let's not even talks about performance testing, which except in cases where a good previous baseline exists, is almost impossible to estimate data volumes or densities.

But it is manageable, isn't it?

The answer is both yes and no. Yes, everything that can be measured is manageable (and even things that are not measurable should be manageable, it's only that you cannot easily rate how good you are at doing it) No, when you reach the point where managing an issue becomes more expensive than any losses you may incur by simply not trying to manage it until you are sure of what the issue actually is. That is, if you've to specify performance targets for each of your application processes, then you're probably going to spend much more time than just saying "no process should take more than 3 seconds" After stating that, you'll discover that 3 seconds is actually a very short time to generate a couple thousands of invoices, so you'll adjust your performance requirements for that. At the end your performance requirements probably will consist of a single word: "enough"

Paradoxically, this is the point where you've realized what are your performance requirements. It makes no sense to tune any process as long as it's performing well enough. It only makes sense to invest in tuning the processes where performance is not enough.

Trouble is, over a lifetime of an application, what is "enough" will change. Because the number of your customers can change, your two thousand invoices can become twenty thousand. Therefore, performance tuning becomes part of application maintenance. And that does not mean that you'll be always tuning your application, just means that you have to keep an eye on its performance, so it would be more precise to say that is monitoring what's actually part of the standard maintenance process. Actual tuning happens when the monitoring detects something going wrong.

But you want to avoid that, right? Because tuning is an expensive process that does not deliver a clear business case, unless it belongs to the survival dimension. The question is, can I really having performance monitoring at least outside of the normal maintenance activities?

And again, the answer is both yes and no. No, unless you're always right in your business forecasts and the numbers you're playing with don't change over the application lifetime. Remember, we're talking three to five years minimum application lifetime. Are you so confident?

And yes because there's an easy way of avoiding it entirely. In essence when you plan your system there is a stage that we'll call "capacity planning", which is what you're supposed to do when you specify your system hardware. The trick is very simple: take the hardware specs recommended by application vendor/developer/consultant and triple them. No, better make them four times better. In all dimensions, not just 4x disk space, but also 4x memory and 4x CPU speed. But don't set that system as your dev or test environment. Keep the test environment on the vendor/developer/consultant original specifications and make them comply with your performance requirements on that environment. Sounds extreme?

Perhaps you are thinking "why just make the performance requirements four times better and set up a test system that is equal to the live one? would that be more supportable?" If you're in the kind of environment where quality of application support depends on having the exact same hardware for test and live, you probably stopped reading the article a while ago, because you belong to the minority that develops operating systems, those pieces of software that isolate applications from the hardware they are running in. I'm not advocating that you choose different architectures, such as testing on Intel and deploying on SPARC, just that you get a scalable enough platform that allows you to set up different environments for test and live. Besides, in the long run, your test and live systems will in fact will be different. It only takes an expensive HW upgrade to realize that you don't need the fastest and more expensive disks on earth for your test environment. In the realm of business applications, that just does not happen.

You need some discipline, because invariably there will be performance tests that pass on the live environment and fail on the test one. Avoid being tolerant, this will become over time more frequent and will inevitably lead to performance problems because it will make the developer/vendor less disciplined. Set your performance goals on test, and have in your head some rough numbers that will allow you to predict how it will scale on live.

If you're willing to go this route, be prepared. First, the live system is going to be expensive. Very expensive. Second, the pressure will grow on accepting performance results as they are going to appear on the live system, not on the test one. Third, the 4x figure is based on experience, not hard data. After all, you're buying four times the system that is supposedly run your application well. Only a lot of experience and, if they exist, learnings from past experiences, will make your point strong enough to allow the increased hardware price.

It's your risk, you manage it

Did I mentioned that before? Yes, as always you're facing a balancing decision. Spend more than you theoretically need or face the performance problem later. In a business context, there's nothing that cannot be prevented or mitigated by javascript:void(0)adding money to it. It's just that sometimes it's too much money.

If you choose the traditional route, accept that at some point in time, your monitoring will scream tuning at you. Set aside some budget for performance improvements, it will be much easier to justify.