Wednesday, 22 October 2008

Optimizing MySQL data loads

Classic Mystery

Sometimes, performance problems can be like a classic mystery novel. Hey, just a bit of fantasy sometimes help to get the stage. Here we are, surrounded by the usual suspects, and pondering which one is the most likely to blame. And then the underdog detective comes up pointing to the least likely person to have committed the murder. Of course, all the police detectives that have been tracking the evidence scattered around the crime scene and changing its opinion all the time about who was the most likely person to blame during the book, nod in disagreement and manifest their disbelief.

Of course, the characters themselves, had their read some books of this type before, would know that they were wasting time all along. Everything they've done so far has only created reasons to discard all the usual suspects and almost reach the end of the book with no good solution to the problem.

But somehow, and perhaps with a bit of cruelty, they allow the strange person in the room to explain the logic behind. What they really want is to have a bit of relief from all the tension and stress that having to face an unsolvable mystery in your hands create. After all, they want to see how the rest of the world is as lost and clueless as they are, or even more. When you start your explanation they even refute and joke on each fact that you state, and about how absurd your reasoning is.

However, after the first laughs they shut up and listen, as facts begin to stack one on top of another, without any cracks in the logic. At some point, the audience seems mesmerized, following each sentence in silence, their minds trying to anticipate what is going to be your next logical conclusion.

And suddenly, it all makes sense. Everyone is convinced that should be right. Of course, when the real culprit faces detention, he (or she) either gives up and confesses or runs away.

Database tuning mysteries

Performance tuning, while less dramatic, sometimes has all the ingredients of a classic mystery novel. Well, except for the runaway (of course, there is always the possibility if one cannot reach the customer performance targets, but it's a bit more professional to not bill them in that case) Also, there are no documented cases of databases escaping from their tuners. But the best tuning jobs usually are the ones where the best solution for the problem is the least unexpected one. Talk about lateral thinking.

Recently I engaged with a potential customer that wanted to improve the load times of a text file into a MySQL database. This potential customer had a very well defined problem. A text file containing roughly 110.000 records had to be loaded frequently on its database.

The crime scene

Data was loaded into the IMPORT_DATA table. For each record on that table, a comparison was made to check if the record met some conditions. If true, the data record was stored in the DATA_VERSIONS table, looking it up first to see if there was an existing record with the same values. In each case, either the LATEST_VERSION was updated with the current timestamp of the day to reflect the latest time this record was received or a new record was added to LATEST_VERSION with the current timestamp and the primary key of the new record.

The net result is that all records ever received from the customer that met the conditions to be interesting were stored in the DATA_VERSIONS table, with a timestamp in the LATEST_VERSION table to reflect the last time they were loaded.

The baseline time I have from the basic operation was around 10 minutes. The customer wanted to optimize the load to be as fast as possible. He was talking about creating table partitions, using the BlackHole engine (more on that later), tuning its memory allocation, index structure, whatever.

The data was loaded from a plain text file, one line per record, with fields separated by the pipe (|) character.

The essential clues

The way all this process was done was curious, at least for me. Each time a row was inserted into the IMPORT_DATA table, a trigger was fired. The trigger in pseudocode read something like:

If RecordIsInteresting Then
    Look up record in DATA_VERSIONS
    If found
        Update LATEST_VERSION for the primary key record with timestamp
        Insert record into LATEST_VERSION
     End If
End If

The condition at RecordIsInteresting read more or less like FIELD_1 = 'A' AND FIELD_14 = 'B'

False suspects

Mmmm... By this time, you'll probably already have a good mental model of the process. And also, if you've faced situations such as this one, I can almost hear your thoughts. Hey, let's check the look ups. Was the DATA_VERSIONS correctly indexed for the look up of the just loaded record? Was the LATEST_VERSION table indexed also for the existence check? Was the IMPORT_DATA table emptied before the load started?

A good start, then. But it turned out that the answer to all those questions was yes. Everything was correctly indexed, access plans looked optimal. I even learned that MySQL is pretty efficient using the LOAD DATA ... REPLACE, as the timings were not that different versus LOAD DATA ... APPEND. Even so, the data volumes loaded were within the reach of machine CPU and RAM so no resource seemed to be exhausted. Repeated tests with detailed checking of operating system statistics did not reveal any particular bottleneck. Machine resources were used efficiently, and the CPU was not waiting idle for the disks to read the data.

So all the usual first aid measures were not going to help. This is were the fun really begins, as improving performance in this way would not be anything to write home about (much less a complete novel) But then, mystery novels had those characters there just as a vehicle for the reader to discover the essential facts about the crime.

The conventional thinking

Enter the expert and veteran detective that is looking at this case as his golden opportunity to become a media star. As clearly something has to be done to solve the mystery, the veteran takes a long and hard look at the problem. Of course, he examines all the false starts. And after a while, he has an idea. After all, why use a trigger to process the records one by one when all of them can be processed at once? This is a classic optimization technique that I've already covered. After all, what's the point of doing it record by record if you can do all of them in one single pass?

So the expert sits down and writes something like

    TimeStamp = Now

    TimeStamp = Now

With a smile, the expert disables the trigger on the IMPORT_DATA table, loads the data and executes the two sentences. Before claiming victory, he checks the timings. Now, he's down to 8 minutes! What has happened? He checks that the RecordIsInteresting condition is covered by an index (it is!) Not very proud of his achievement, but satisfied nonetheless as he has won were others have failed, he announces that his improvements deliver 20% better performance.

Lateral thinking

Enter the underdog. Great, he says, at least we know something. Wait, what exactly do you mean? - answers the expert. Of course, we know that MySQL triggers are pretty efficient. It's only a 20% overhead, which means that MySQL developers have done quite a good job in their trigger runtime. But, have you asked yourselves if the customer is going to do anything with the IMPORT_DATA information after the load?

- Why? - answers the expert. Of course, he does not want to do anything with the IMPORT_DATA table. That's why he was thinking of using the BlackHole engine for this table, since his only interest was in the data during the trigger execution. Of course, my optimization will not work with the BlackHole engine, as it depends on the data being available to be processed in a single pass.

Realizing what he had just said, the expert creats the IMPORT_DATA table with the BlackHole engine and tests the original data load timing. Well, it now takes 9 mins to load, meaning that he's still able to claim his 20% optimization. Relieved, he then proceeds to probe the underdog that the time it takes MySQL to actually store the data is not that relevant after all.

- Oh, yes, I see, so the IMPORT_DATA information is useless after the load. Can I do a quick check? - asks the underdog.

The underdog bangs away at the keyboard for ten minutes or so, then he claims - See, you can load the data in 1 minute. That's it.

The expert is amazed. At first, he does not believe it. He checks and rechecks the resulting LATEST_VERSION and DATA_VERSIONS tables and everything is correct. Well, now, please show us what amazing MySQL technique you've used, he says.

The underdog is, at heart, a shy person. So it takes him a moment to sort out his argument. It's not that simple. I had to check how many interesting records were in the table. As it turns out, only 10.000 out of the 100.000 records loaded met the RecordIsInteresting condition. So I just said to myself, what if I dont' load those at all?

- But how do you know which record is interesting or not without loading them?, asks the expert
- Well, I wrote this awk program to filter the file before loading it.

# Records are separated by |
BEGIN { FS = "|" }

# First line is for headers, keep it to not having to change the MySQL LOAD DATA statement
NR == 1 { print $0 }

($1 == "A") && ($14 == "B") { print $0 }

- The awk program takes two seconds to execute and creates a file with just the 10.000 interesting records. If you load that file instead of the original one, the results are exactly the same. The only difference is that you're saving MySQL to read a record from a text file, splitting the line into field values, filling the temporary buffers necessary, firing a trigger and evaluating a condition 90.000 times. You only use MySQL to process the records you already know that are interesting.

... end of the story. The real mistery is that my real customer never heard of this solution, as I lost contact with him just after we reached an agreement on the pricing. I don't know where he is, and I hope that he does not mind if I use his problem as an excuse to poke fun on classic mystery novels.

PS - I had very good fun when I was young reading those novels. And I must admit that I never, ever was able to guess who was guilty in advance.
PPS - of course names have been changed to protect the innocent

Wednesday, 15 October 2008

The Google Application Engine

In the beginning, there was timesharing

In the beginning, there was timesharing. This was in the early, early days of computing, before most of us were even born. At the time, computers were big, expensive, complex and extremely difficult to operate. Not to mention underpowered in comparison with, say, a domestic media player. Since most business did not have the resources -read money- or even the need to use one of those monsters full time, you basically rented time from the machine. This model of using applications was perceived as a win-win, as the customer did not have to pay the up front costs of maintaining all that infrastructure and the supplier could maximize its return on what was at the time, a significant investiment.

Over the years, as demand for usage of computers increased and prices went down, more and more business found that it was profitable in the long run to own and operate the monsters by themselves. Thus, the timesharing business model gradually faded, althought some parts of it survived. Specially for the big mainframe iron, IBM still continues to "rent" their software and "cap" the CPU cycles that a modern mainframe uses based on a monthly fee.

I'm not sure if this model is still used literally. I'm sure that 10 years ago it was still in use (anyone remembers the defunct Comshare company?) But as most everything in the computer industry, the model has not survivied literally, it has morphed over time while keeping the same conceptual underpinnings.

Cloud Computing

"Cloud Computing" is the latest industry buzzword. Basically, it means that when you use applications, those applications are not running in hardware you own, you don't even know what hardware is used. It also means that the provider of those applications has to worry about availability, scalability and other concerns that were responsibility of the company IT department. It also means that the provider can take advantage of economies of scale, by for example reusing transparently its machine resources to support many different customers across different timezones. It can also spread geographically its machines to save on communication costs and improve response times, and apply other optimizations.

So, are we seeing the return of timesharing? Some would say that yes, this is essentially the same model applied in the 60s, only using bigger, more powerful and cheaper machines. And yes, this is a repeat of the SaaS (Software as a Service) wave that was the latest industry buzzword five years ago. Of course there are a lot of similarities, but there are a couple of points that make things different now.

First companies adopting cloud services are making a choice from their current established infrastructure and the supplier offerings. They have experience, can compare costs and are much more conscious of the trade offs, limitations and opportunities that each model has. This was not the case with timesharing, since the costs were so high that none could really afford to supply internally the same computing resources. This time it's a choice, not an obligation.

Second, some software categories have become "commoditized" For some software categories (Mail clients, Office productivity apss, even CRM) one can choose from different offerings and there is a clear relationship of the price/features ratio. The timesharing of the past was almost always done on either higly specialized number crunching applications (hardly classificable as commodities) or in completely custom built business software.

The Google Application Engine (GAE)

In the past months we've seen the launch of Amazon E3, a notable offering that is proving to be a cost effective alternative to housing or hosting providers and also have the resources to scale should the needs of your application increase. As a counter offering, Google has launched its beta Application Engine. Its main difference is that Amazon has split their offerings into infrastructure categories, such a storage service (S3) and a "super hosting" service (EC) whereas Google's offering is centred around an application hosting environment that includes more built in facilities, but is also more restricted than Amazon's.

Basically, Google offers to share its enormous infrastructure, for as yet undisclosed price. The advantage is obvious, Google has proved already that they can scale to a massive number of users and storing massive amounts of data without disruptions and without degrading the quality of the service provided to the users. They have been developing and refining their own technology stack for years, all of that with the eye on massive scalability. It's very unlikely that anybody outside Yahoo, Amazon or Microsoft can match that, much less with limited resources in comparison with what Google can put together.

Google has not yet provided all the details, and the service is still in beta. But some of the main points of the offering are already apparent. There are already lots of reviews where you can find their strengths and weaknesses, written by people much more knowledgeable than me at web development. But for me, the interesting part is their storage engine. Google provide an abstraction called the "datastore", that is, an API that allows programs hosted in the App Engine servers to store and retrieve data. You don't have any other means of accessing the data persisted than the datastore API, so you better familiarize with it before delving further in any application design.

Google datastore, limited?

The most striking feature of the datastore is that neither it's relational, nor it tries to be. It's a sort of object oriented database, where Python (the only language the App Engine supports so far) objects defined in terms of Python classes are stored and retrieved. Optionally, indexes can be defined on properties, but an index only indexes a single property. Primary keys are usually automatically generated, and relationships between objects are handled via "Reference" types.

The datastore can then be queried and objects retrieved. The Google Query Language (gql) has familiar SQL syntax, but outside of the syntax is a completely different animal. It neither allow joining entities, nor to use very common combinations of conditionals in its WHERE clause. There are no analytical functions (COUNT, SUM) and there is a limit of 1000 objects per result set.

Is this a rich environment for building data intensive applications? It all depends on your idea of "rich" and "data intensive" Of course all the missing features can be worked around in some way or another. But, and this is the difficult part, probably they should not be. By that I mean that anyone can provide iterative code to perform the joins, aggregations and whatever else you find missing from gql, but that does not mean you should do it.

For one thing, the GAE has strict limits on the CPU cycles that a request can consume, so you'll probably use them quickly if you write too much code. Those limits are there to protect Google shared resources (nobody likes its machine cycles being sucked up by the guy from the next door) So joining entities like you were joining tables is out of the question. Same for counting, aggregating and other time consuming operations.

Scaling in a scale that you cannot even start to imagine

All those limitations are, at the very least, irritating. Some of them even make completely unfeasible some kind of applications. At first, I was felt deceived by GAE. However, after browsing some more documentation I soon realized what GAE is all about.

Some background is in order: I live and breathe business applications. By that, I mean that I'm used to the scale of ERPs or Data Warehouses that span, at best (one would say sometimes worst) thousands of users. That's a formidable challenge in itself, and one that takes a dedicated army of engineers to keep running day to day. The usual mid sized system is aimed at a medium size business or a department inside a big corporation, never tops 150 users. And I've devoted a significant part of my professional efforts to study those environments and to make them perform better.

Now, Google plays in another different league. They provide consumer applications. While business users can be in the thousands, consumers are in the millions range. That takes another completely different mindset when planning and scaling applications. The GAE restrictions are not there just because Google engineers are timid or have developed an incomplete environment. They are there because GAE can make your application scale very big. Awfully big. With that in mind, you can start to explain to yourself why GAE datastore has so many restrictions. Those restrictions are luxuries that you can only afford when you can limit your user base to most a few thousands users. If you want to scale beyond that, you have to give up on some things.

Consumer applications are different

In short, don't expect to see shortly an ERP built atop of GAE. It's not the ideal environment for that. But it's equally unlikely that there is ever consumer level demand for your own ERP, isn't it? Think of GAE then as an ideal platform for consumer level applications. Then it starts to make sense. Think of Amazon EC as a super scalable hosting service, where you basically can make your application scale in terms of hardware, but you're going to be on your own when defining how to make that hardware perform.

Even so, GAE and related environments bring a new set of challenges, at least for me, in terms of how to efficiently store and retrieve data. I'll keep posting my findings, and I'm sure it will be a lot of fun to explore and discover.