Thursday 10 July 2008

Procedural programming and SQL don't mix well

Some of the biggest performance problems I've seen came from perfectly reasonable decisions made from fairly intelligent people. Specially, good developers coming from procedural programming languages trying to adapt the paradigms they know to the SQL language. I think that there's a strong parallelism between these cases and some of the most cited problems with people new to object oriented programming: at first, they program with the OO language but they are not really using objects, creating instead one big class comprising 90% of application functionality that essentially contains a well structured procedural program.

Data sets versus control flow

At its hearth, SQL is a set oriented language. Its main constructs and the philosophy behind them is that you're manipulating sets of rows. Years of research have been invested so that you are as abstracted as far as possible from how the database actually does that, and to do that in the most efficient way. You just write the columns that you want, the relationships between your tables and the conditions that you want to meet and off it goes. Behind that there are sophisticated tuning engines, monitors, clusters, load balancers and god knows what else, all looking to give you the results in the fastest possible way without you having to know how they actually are performed. Years of research and tons of money have been throw at the problem of how to do that in the most efficient and safe way.

In sharp contrast we have the classic procedural languages. Basically, those are a set of constructs (data types, structures, control flow) that allows you to exactly specify to the machine the instructions you want to execute. Notice that the procedural programming language is not concerned with why you want to do this or that, much less with the correctness of your steps. It just wants a set of instructions to execute on a set of data. Over the course of time and thanks to some very bright people, procedural languages have been improving, trying to make the job of the programmer easier. Mostly by abstracting, encapsulating and hiding the parts and details of the problem not relevant to your main concern.

Those are very different concepts. On one corner, a language designed only for data manipulation in the form of sets of rows with columns and relations amongst them. On the other, a set of primitive instructions that, like Lego blocks, allow you to do anything.
Needles to say, procedural languages are very powerful. They allow you, depending on the level at which they operate, to specify all the details of what you want the machine to do. They are the lowest level of abstraction, and with that comes the price of complexity. Your domain specific language sacrifices flexibility and applicability giving you in exchange much more productivity as well as a view of the world that more closely matches the problem you're trying to solve.
SQL is a domain specific language, its domain being the manipulation of sets of rows with columns that have relationships amongst them. It happens to fit a lot of scenarios, specially business automation problems where machines are replacing and augmenting manual, well specified processes. But SQL sacrifices detail, from the concrete operation of the database to its inability to perform seemingly simple tasks. How do you open a window in SQL? How do you print a report in SQL? The answer is, you don't. Well, that's not strictly true, some extensions of SQL can allow you to do such things, but they are not really SQL. They are hacks built into the language so that managers and programmers convince themselves that they don't need to learn another language.

In fact, they are really learning to fit two paradigms in the same language, but marketing is calling it the same as what they already know so that it's easier to sell them the whole thing.

Some examples


The following are real world examples of these types of problems. I've found them everywhere, from SAP to custom built J2EE applications. Bear in mind, those are general rules and you'll sometimes have a valid exception for them. But surely it's not going to be the first, second or third time you encounter this. Exceptions to these rules are very, very uncommon.

Loops versus single SQL statements

Ignoring transaction isolation levels for a moment, you will agree that the following pieces of code are equivalent:

FOREACH R IN (SELECT * FROM T WHERE T.B=2)
UPDATE T SET T.A = 1 WHERE = R.
LOOP

UPDATE T
SET T.A = 1
WHERE T.B = 2;

And yes, they do the same thing. But the seasoned procedural programmer delving into SQL tends to write the first version, because of the feeling of having complete control. But, and this should not come as a surprise, the second version usually performs much better, or at worst on par with the first one, is shorter and easier to maintain. This, taken to the extreme, makes people write:

A = MIN_INTEGER;
FOREACH R IN (SELECT VA FROM T)
IF R.VA > A THEN A = R.VA;
LOOP;

instead of

SELECT MAX(VA) INTO A FROM T;

and this time, believe me, for enough rows in T the performance difference is spectacular. Of course, in favor of the SQL statement.

Function calls in WHERE clauses

The following two pieces of code are also equivalent

SELECT * FROM R WHERE SQRT(R.A) = 2;

SELECT * FROM R WHERE R.A = 4;


But as you have guessed from the subject, the second is usually much faster, because you can use indexes to look up the values of R.A directly. Oh, yes, the latest version of your favorite database has function-based indexes, I hear you say. Well, what if you need to look up values based on the results of three different functions? Are you going to create an index for each and every different operation you perform? Function-based indexes are one of those features that can save you from disaster but only if you're one of the selected few that actually needs them, but are not useful on the general case.

Function calls versus in-line calculations

This one has already been mentioned in this blog. It is very nice to have a function like getCustomerCredit(CustID) that calculates the credit rating for a single customer based on its order history. That way, your procedural code that deals with setting up the screen display for the order entry can just do a lCustCreditLabel.Value = getCustomerCredit(custID). But when the procedural programmer is tasked to create a printed report of customer ratings, the first version is always:

SELECT custID, custName, getCustomerCredit(custId)
FROM customers....

Probably getCustomerCredit looks somewhat like

SELECT Customer.creditLimit - SUM(orders.OrderValue)
FROM customers, orders
WHERE customers.custId = :custId AND orders.custId = :custId AND orders.stat = 'OUTSTANDING';

Instead, the following alternative will be far, far more efficient

SELECT custId, custName, creditLimit - SUM(orders.OrderValue)
FROM customers, orders
WHERE customers.custId = :custId AND orders.custId = :custId AND orders.stat = 'OUTSTANDING';
GROUP BY custId, custName
;

Replacing built-in SQL functionality

This is probably a repeat of all previous ones together. When something can be done with raw SQL, it is almost always faster and better to do it in raw SQL. Fancy frameworks may hide some of the complexity, but will make you paying a price. And this price is usually performance.

You can have some structure in your SQL


As you have seen in the examples above, the more efficient alternatives are breaking structured programming, abstraction, detail hiding and probably some other principles of structured programming in some way or another. After you've worked long and hard to encapsulate the business rules in a way that avoids repetition and hides complexity, I'm basically telling you to throw them away in favor of better performance. I don't have an answer, and I think that nobody has, to this dilemma. Possibly because SQL was not designed to abstract business rules or abstract data structures beyond tuples (rows) with attributes (columns)

That's not to say that you cannot have any kind of structure in your SQL. Quite the opposite, you should strive to have it. There's nothing worse and unmantainable than an application that is hand crafting SQL sentences and sending them to the database to execute for each little thing it does. But always keep in mind that abstractions are useful as long as they pay off. Don't abstract breaking the assumptions made at either language desing or database engine development.
For example, it's very common and good practice to abstract business rules (data manipulation and validations) in stored procedures.

I'm not advocating cut and paste programming or that you should not try to be as structured as possible in the coding and design or your applications. But you really have to be aware of the trade offs, and above all, don't try to reinvient the wheel. The designers of your RDMBS spent probably a long time thinking about how to make their engine efficient and easy to use. Don't fight with them, take the red pill and do things its way. It's better on the long term.

1 comment:

  1. thanks for sharing. a DBA for a major corporation shared that he rewrote SQL code to perform in several hours vs >1/2 day

    ReplyDelete