Thursday, 27 November 2008

The future of query optimizers

Future performance optimizations


While learning about the Oracle 11g invisible index feature, I really became aware of how complex the modern cost based optimizers have become. Essentially, Oracle has added a feature to its cost based optimizer that allows you to make an index invisible to its optimizer.

This brought to me fond memories from when I had to revisit an already tuned statement, as performance was mysteriously been dropping over the last few days, or hours, without an associated increase in data volumes or system loads.

In almost all cases, and after some investigation, I realized that the previous statement access plan was no longer optimal. This was always due to one of those: either the size of the result set was drastically different, the cardinality of some of the columns involved had changed, or finally, the data volumes, while not having increased in a meaningful way, made some of the assumptions of the CBO invalid. In all cases, the fix was always either running again the statistics collection utilities or rewording the statement, maybe adding also optimizer hints, so that a different access plan was used.

And yes, having at those times a feature that made indexes invisible to the CBO would have been nice. Very nice in fact. But have we gone too far?

Too clever and too rigid?


At the time, I just accepted that this was a shortcoming of the CBO, for all its advantages one could not reasonably expect the CBO to always get it right. What I was being paid for after all? No system is perfect and the Oracle CBO is specially good at picking the best access plan. Very seldom I've had to use optimizer hints or other tricks force it to use a better alternative, and for most of the work in an application (famous 80-20 rule) the CBO does a good enough job.

But recently I was involved in some work for a content recommendation system.

Initially it was only a performance problem, but after becoming familiar with the code and algorithms, I was able to improve the recommendation accuracy as well as doing one of those spectacular "14 hour process converted into a 0.02 sec on line query" feats.

Now, I'm no expert at statistics, and I sometimes have tried to understand all the complexities of the NetFlix submissions and would need six months to even understand the finer points of all the algorithms. But one thing became clear: all of the try to predict future trends based on the past. And the best algorithms even adjust hemselves over time to correct their own mistakes. Of course, all of them reach a point where they cannot get better accuracy and their results stabilize. They will never be able to reach 100% accuracy, as their mistakes when predicting are also their bases for improving their predictions.

Sorry for the digression, what does this has to do with database performance, after all? Somehow after reading the Oracle specs I was under the impression that Oracle had created a fix for a problem that their CBO has created in the first place. By the way, if you check the documentation of other RDBMS, for example MySQL and Postgres, you'll find that they don't have a similar feature (so if you're doing one of those feature list comparisons, here's one for Oracle) Even the fairly good, sometimes even better than Oracle, Postgres genetic algorithm for selecting an access plan does not allow you to selectively "hide" its indexes. In short, Oracle allows me to protect the CBO from itself. This lead me to think, why this is necessary? Should the problem have not happened to begin with?

Tuning approach today


More or less, the standard way of approaching database tuning always start with the 80-20 rule. Simply said, for 80% of the application, performance is perfectly acceptable. In those cases, nobody is willing to invest time in improving performance, because there is no real benefit in doing that. What we should not forget, however, is that the database in fact could be performing very poorly in that 80% in terms of resource usage. But as business priorities are what drives the focus on performance improvement areas, it does not make sense to look at them.
But, and this is important, that does not mean that we're wasting resources there.

I predict that with the recent rise of application service providers and cloud computing, the issue of utilization ratio of large farms will be important in the future. Let's say that I have a cluster: if it is being used at 50% capacity and I can make it magically 10% more efficient that's really going to make a 10% improvement in my ability to serve more customers. Note that 10% is equally applicable to the 80% that performs satisfactorily from a business point of view as well as the usually problematic 20%.

But it is not practical to even try to tackle the 80%, simply because of the sheer size of the job. As applications are growing in size and complexity, no single team of manageable size can even think about doing such systematic analysis. Simply looking at thousands of access plans is not viable. We must trust the CBO, it's our only way of keeping our ability to grow on application complexity.

The four traps with current optimizers


Access plan generation is one of the features that differentiate RDBMSs in the market. One that is probably very complex to implement, as theoretical principles and real world experience must be merged to produce something that is "almost" always right. And by "almost" we should read "good enough in most cases". Unless you're dealing with really tight constraints, It's really not that relevant if a SQL statement takes 0.01 or 0.02 seconds to execute, as the difference usually does not have a major impact in overall system performance (again, warning, of course there will be extreme cases where this will actually be very important) What is important is that the database maintains a consistent, real world reflection of the complexity and data volumes that it is handling. One finds reasonable that a statement that deals with small data volumes executes quickly, and statements handling million of rowws are completed in a reasonable amount of time.

And current optimizers do quite a good job at that. But they fall into four traps.
  • The context trap. As I've said many times, they don't know the business context of the data, nor the business priorities. So they cannot say "nah, no problem in not using this index this time. The system is not loaded at the moment, I have power to spare and will finish a bit later than usual, but nobody will care about this" or "hey, it's really critical that we execute this process in two hours, go head over heels if necessary and make it faster" The on site presence of some
    performance specialist is usually a symptom that a database has fallen into this trap.

  • The complexity trap. As seen by the Oracle example, sometimes they try so aggressively to optimize that they overlook opportunities for optimizing. The presence of optimizer hints is usually a tell-tale sign of this trap.

  • The change trap. An access plan is expensive to generate, and is usually cached to avoid having to paid the cost of generating it again. This means that the plan is not able to change should the cardinality, data volumes, etc, change. The existence of features to hide or avoid to use data structures whose only reason of existence is to improve performance is a symptom that the system is into this trap.

  • The 80-20 trap. As I've said, some of the most frequent operations can
    in fact be under optimized, but as long as we don't have any feedback from system owners, we'll never notice or care.

Something is wrong here, and I just realized that the optimizers should learn from the content recommendation systems. As far as I know, the optimizers currently in use generate an access plan and then store it to be reused later. None of them try to verify that their assumptions were right, or that the resource consumption was in line with their expectations. In short, none of them learn from their mistakes.

Adaptive access plans


This, I predict, will be the next step in cost based optimizers: the Adaptive Access Plan. In the future, databases will still keep a log of the statements they execute, but they will add to that information about how long took to execute with a certain parameter set. They will keep track of how many reads were made, how many of them were from the memory cache. They will analyze that history and adapt their access plans to the data volumes, system concurrency factors, previous execution times
and available resources.

In that bright future, we'll see performance improve over time as the system adapts its access plans to changing data patterns. Databases will tune themselves, not merely suggesting different ways of running, but even running in parallel different access plans and choosing the one that performs best for future executions.

Will this be the end of the database tuner? Probably not. The context trap, unless some sophisticated Turing test compliant software is created, will remain the realm of human beings.

PS: I'm sure that there is people playing around with the idea of adaptive access plans, and some of them even are thinking of getting a patent granted on it.Consider this article as prior art.

No comments:

Post a Comment