Monday, 15 September 2008

Persistence engines

How things came to be this way

Anyone involved long enough in software development ends up discovering a truth that is long ago known by other craftsmen and engineers: each problem is best solved with the tool more appropriate to solve it. Only problem for software is, it is such a young discipline that is all the time trying to tackle new problems not previously solved. Thus, sometimes it's very difficult to know beforehand if a tool is going to be adequate or not until you actually try it.

All software development tools, from assemblers to compilers to scripting languages, are a testament of this constant evolution of software to meet the ever increasing requirements of solving new problems using computers. It's easy to see that over time, as long as the more generic tools are developed and consolidated, more and more tools are created to attack more and more specific problems.

Structured data persistence was one of the most often attacked problems since the inception of the first computers, so at some point (Codd et al?) invented the concept of the relational database. This was so successful that the concept has survived to this day and is used extensively as a standard means of storing and retrieving data.

Reuse, encapsulation, inheritance and polymorphism was another recurring problem that software projects, specially at large scale, had to face. Thus, the object oriented programming model was invented and expressed in a number of different languages. Some of the more successful ones (C++, Java, C#) are the standard means of developing software today, usually supported by a large number of pre-built components that provide a number of useful features that can be reused across many different projects.

Sometimes when facing a new problem, software designers even find worthwhile to create a new tool just for the purpose of solving that specific problem. That's how small (and sometimes big) scripting languages are born and a lot of other tools (parsers, code generators, etc) are created. Other kind of tools that are also in wide use are "Frameworks", which are collection of software components that work together to solve a problem. The distinction between frameworks and applications lies mainly in that frameworks are designed to be used in applications, but they don't usually perform any function directly useful to the business objective of a piece of software.

Frameworks are also means of abstracting complex subjects into more manageable pieces. Specially with object oriented languages, properly designed frameworks allow programmers to selectively use the parts that they are interested in without having to learn the whole thing at once.

So, in a perfect world, developers have a wide range of options to choose form when creating a new application. In practice, this is not the case and often business standardize around a set of tools that are used across projects. Three factors conspire together. First, the complexity of tools mean that learning them represents a usually significant investment in time. Second, the biggest the number of skilled resources available for a tool makes much easier and cheaper to hire people for your shop. Third, humans have a natural tendency to avoid change and thus once you find something that has a reasonable price and works for you, you tend to keep it forever.

The end result is that usually applications are built using a set of components consistently across the same or many applications. One of the latest category of frameworks that have been consolidated are the persistence frameworks (or engines) Those are a set of classes that make your objects get stored or retrieved from a relational database without you (the programmer) even being aware of all the exchange among the application server and the database.

The object-relational impedance mismatch

Wonderful as they are, persistence engines have a problem: they are designed to store objects into relational tables. Objects have methods and data put together, plus (depending on the language) visibility rules. Tables are just tuples of data with relationships defined between them. (Digression: now it's time for me to hide from object/database purists looking to kill me since I've oversimplified concepts tremendously, but for the purpose of this discussion it's enough) Now it's easy to see that somehow in the translation, but suffice to say that the relational model is a subset of the object model, in the sense that everything that can be expressed with tables and relations can be expressed with objects. But not the other way, as you cannot express methods or polymorphism with relational structures.

Notice that there are some Object Databases that do away with the persistence engine and allow you to store and retrieve objects without losing anything on the way back. But, as you've guessed, these are not relational. While these engines someday may enjoy popularity, they are not presently part of the standard tool set in the majority of shops. And that's not likely to change in the near future. So for now we're stuck with relational databases. There have been some attempts, most notably from Oracle, to introduce objects in the relational database. While the technology does have its merits, it has never been very popular and the rest of the major industry players have not followed suit.

The performance compromise: how bad is that?

As always, there is no free lunch. For the persistence engine to work as transparently as possible, there has to be some kind of compromise. You give something, you get something. Since this is a blog about database performance, you are probably expecting me now to tell you how bad is it. Well, surprisingly not too bad. Provided that you use the persistence engine for what is intended to be used.

Here is the list of things that persistence engines are intended to be:

  1. A way to abstract reading and writing objects to your database

  2. A way to map your database tables into objects

  3. A way to perform basic operations (CRUD: Create, Read, Update, Delete) on single instances of objects (read: database tables)

  4. A way to abstract business process rules in your object layer instead of doing it in your database layer. This means that things that "sales that are above x limit should be approved by manager, and those above 2x should be approved by Finance Director" are much better easily expressed in object methods than in SQL triggers

  5. Here is a list of things that persistence engines are not intended to be:

  6. A way to avoid the relational database model, SQL language, or basic principles of database design. Sorry, you'll have to concede some things to your object model, because it is stored in a relational database. Please don't believe claims like the ones of the EVL design, they are simply not true.

  7. A way to solve every database related problem under the sun. It's always best to choose a simple engine that solves 80% of the problem that a complex one that solves the 100%. The logic here is that the remaining 20% probably has a better tool around.

  8. A way to forget about performance, the persistence engine takes care of that, does it?

  9. A way to perform analytical queries that allows you to avoid writing SQL

Please do not disregard persistence engines as useful tools just because of points 5-8. The strengths of the first for make more than enough for them, as most applications have most of their bulk in the advantages. In short, properly used persistence engines can make your application much easier to write and will not cost you too much in terms of performance. But the engine will not solve all of your problems. Special mention should be made of so-called analytical queries (those that have SUM(), GROUP BY, etc) Those are specially badly done using persistence engines, as they were not designed to do that. Consider that:

SELECT SUM(sales) as TotalSales FROM ORDERS WHERE CustId = xxxx;

Is hundred times more efficient than:

resultSet = engine.Query("SELECT FROM ORDERS WHERE Customer = ", xxxx);
totalSales = 0
foreach( rec in resultSet ) {
totalSales = rec.Sales + totalSales

Chainsaws are not good for juggling

The golden rule to avoid performance problems with persistence engine is to apply the same logic that you would apply if you were using a hammer. I'm sure that when facing a screw, even if you've pounded thousands of nails before, you'll not even think of using the hammer, would you? The moment you feel that you're forcing a tool to do something it's not suited to do is the time to stop and reconsider your choice of tool for the task.

In short, avoid fighting with the tool. When you start fighting with the tool, it's a symptom that you should stop and think.