Wednesday, 7 May 2008

Over-specify to avoid performance problems

Nobody likes to stop the press, the train or the plane, or the cash collection, or whatever process their business is engaged on, waiting for a computer to finish its task. Nobody likes having to fit the process ("list of deliveries will not be available until 11:30 in the morning") to the timing of a machine. Those are usually the symptoms of a computer system not performing up to the business requirements.

Of course, this is more commonly seen in the business world. Software manufactured in other fields simply cannot afford to have performance defects. Think of life support or vehicle control systems, where performance is a requirement that is expressed in very precise terms. Think of video games, where jerky animations will mean that end users will not purchase them. Software used to support business processes is usually viewed as a constant balance between cost, time to market and quality. Seasoned project managers, when confronted with the previous sentence will instantly recognize it and say "you cannot have the three together, you have to pick two"

In the cost, time and quality triangle, where do you place performance? As veteran project managers and performance consultants already know, the answer probably lies somewhere between the three. Performance is surely going to be considered a quality item, but also happens to impact both price and time to market. Here's my experience in the field, note that your mileage may vary.

In the end, it's the quality that suffers


The traditional project life cycle in business environment is seldom completed as it was initially envisioned. Very few projects end up with their scope or schedule unchanged, if only because the traditional business practice of "optimizing" everything cannot resist the temptation of subtracting resources or shorten the time line if the prospects of a project finishing on time and budget are good. Very seldom business will drop features or accept to postpone delivery dates, perhaps because the business side of the project can understand very well those concepts, where "quality" is usually more abstract and harder to quantify.

Besides, this is software, right? Being software, it means that the same principles applied to other engineering disciplines are not applicable here. This is not the place to discuss if software is engineering or not, but certainly some rules that are usually applied to well established engineering practices do not apply. Can you imagine a civil engineers changing the width of a bridge once the construction work has started? Hardly. But the same civil engineers, when leading a software project, do not have any problems changing the specifications for data volumes while building the system. This makes even more unlikely that all the spec changes stay in sync, not to mention the performance ones.

To have any chance of surviving schedule and scope changes, performance testing, like unit, functional or integration testing, should be an integral part of the project plan. If you don't have performance requirements and testing at the beginning of the project, skip to another article, as the damage has already been done. You'll resort to monitor the live system performance and watch for performance trends to show up before even having a clue of what you need to do to keep acceptable performance levels.

But if you have performance testing as an integral part of the project plan, don't hold any high hopes either. In my experience, it's the "quality" part of the triangle that suffers. Outside of mission critical contexts with human health at stake, and in spite of all the "quality" buzzwords that surround modern management methodologies, changes in scope or schedule always will win over quality. Part because "quality" is a hard to define concept for business software and part because the number of defects in a software application is largely irrelevant as long as it fulfills business requirements today. And today means this quarter, not even this year. Since software is infinitely flexible, we can always fix defects later, right?

This philosophy largely permeates management because, much like performance tuning, everything is seen as a balance. Your application is perhaps not complete, has a few outstanding bugs and has not been performance tested. But the business benefits, and sometimes management reputations, are much preferable than moving the delivery date to have time to fix those problems Let's not even talks about performance testing, which except in cases where a good previous baseline exists, is almost impossible to estimate data volumes or densities.

But it is manageable, isn't it?


The answer is both yes and no. Yes, everything that can be measured is manageable (and even things that are not measurable should be manageable, it's only that you cannot easily rate how good you are at doing it) No, when you reach the point where managing an issue becomes more expensive than any losses you may incur by simply not trying to manage it until you are sure of what the issue actually is. That is, if you've to specify performance targets for each of your application processes, then you're probably going to spend much more time than just saying "no process should take more than 3 seconds" After stating that, you'll discover that 3 seconds is actually a very short time to generate a couple thousands of invoices, so you'll adjust your performance requirements for that. At the end your performance requirements probably will consist of a single word: "enough"

Paradoxically, this is the point where you've realized what are your performance requirements. It makes no sense to tune any process as long as it's performing well enough. It only makes sense to invest in tuning the processes where performance is not enough.

Trouble is, over a lifetime of an application, what is "enough" will change. Because the number of your customers can change, your two thousand invoices can become twenty thousand. Therefore, performance tuning becomes part of application maintenance. And that does not mean that you'll be always tuning your application, just means that you have to keep an eye on its performance, so it would be more precise to say that is monitoring what's actually part of the standard maintenance process. Actual tuning happens when the monitoring detects something going wrong.

But you want to avoid that, right? Because tuning is an expensive process that does not deliver a clear business case, unless it belongs to the survival dimension. The question is, can I really having performance monitoring at least outside of the normal maintenance activities?

And again, the answer is both yes and no. No, unless you're always right in your business forecasts and the numbers you're playing with don't change over the application lifetime. Remember, we're talking three to five years minimum application lifetime. Are you so confident?

And yes because there's an easy way of avoiding it entirely. In essence when you plan your system there is a stage that we'll call "capacity planning", which is what you're supposed to do when you specify your system hardware. The trick is very simple: take the hardware specs recommended by application vendor/developer/consultant and triple them. No, better make them four times better. In all dimensions, not just 4x disk space, but also 4x memory and 4x CPU speed. But don't set that system as your dev or test environment. Keep the test environment on the vendor/developer/consultant original specifications and make them comply with your performance requirements on that environment. Sounds extreme?

Perhaps you are thinking "why just make the performance requirements four times better and set up a test system that is equal to the live one? would that be more supportable?" If you're in the kind of environment where quality of application support depends on having the exact same hardware for test and live, you probably stopped reading the article a while ago, because you belong to the minority that develops operating systems, those pieces of software that isolate applications from the hardware they are running in. I'm not advocating that you choose different architectures, such as testing on Intel and deploying on SPARC, just that you get a scalable enough platform that allows you to set up different environments for test and live. Besides, in the long run, your test and live systems will in fact will be different. It only takes an expensive HW upgrade to realize that you don't need the fastest and more expensive disks on earth for your test environment. In the realm of business applications, that just does not happen.

You need some discipline, because invariably there will be performance tests that pass on the live environment and fail on the test one. Avoid being tolerant, this will become over time more frequent and will inevitably lead to performance problems because it will make the developer/vendor less disciplined. Set your performance goals on test, and have in your head some rough numbers that will allow you to predict how it will scale on live.

If you're willing to go this route, be prepared. First, the live system is going to be expensive. Very expensive. Second, the pressure will grow on accepting performance results as they are going to appear on the live system, not on the test one. Third, the 4x figure is based on experience, not hard data. After all, you're buying four times the system that is supposedly run your application well. Only a lot of experience and, if they exist, learnings from past experiences, will make your point strong enough to allow the increased hardware price.

It's your risk, you manage it


Did I mentioned that before? Yes, as always you're facing a balancing decision. Spend more than you theoretically need or face the performance problem later. In a business context, there's nothing that cannot be prevented or mitigated by javascript:void(0)adding money to it. It's just that sometimes it's too much money.

If you choose the traditional route, accept that at some point in time, your monitoring will scream tuning at you. Set aside some budget for performance improvements, it will be much easier to justify.

No comments:

Post a Comment