Wednesday 22 October 2008

Optimizing MySQL data loads

Classic Mystery


Sometimes, performance problems can be like a classic mystery novel. Hey, just a bit of fantasy sometimes help to get the stage. Here we are, surrounded by the usual suspects, and pondering which one is the most likely to blame. And then the underdog detective comes up pointing to the least likely person to have committed the murder. Of course, all the police detectives that have been tracking the evidence scattered around the crime scene and changing its opinion all the time about who was the most likely person to blame during the book, nod in disagreement and manifest their disbelief.

Of course, the characters themselves, had their read some books of this type before, would know that they were wasting time all along. Everything they've done so far has only created reasons to discard all the usual suspects and almost reach the end of the book with no good solution to the problem.

But somehow, and perhaps with a bit of cruelty, they allow the strange person in the room to explain the logic behind. What they really want is to have a bit of relief from all the tension and stress that having to face an unsolvable mystery in your hands create. After all, they want to see how the rest of the world is as lost and clueless as they are, or even more. When you start your explanation they even refute and joke on each fact that you state, and about how absurd your reasoning is.

However, after the first laughs they shut up and listen, as facts begin to stack one on top of another, without any cracks in the logic. At some point, the audience seems mesmerized, following each sentence in silence, their minds trying to anticipate what is going to be your next logical conclusion.

And suddenly, it all makes sense. Everyone is convinced that should be right. Of course, when the real culprit faces detention, he (or she) either gives up and confesses or runs away.

Database tuning mysteries


Performance tuning, while less dramatic, sometimes has all the ingredients of a classic mystery novel. Well, except for the runaway (of course, there is always the possibility if one cannot reach the customer performance targets, but it's a bit more professional to not bill them in that case) Also, there are no documented cases of databases escaping from their tuners. But the best tuning jobs usually are the ones where the best solution for the problem is the least unexpected one. Talk about lateral thinking.

Recently I engaged with a potential customer that wanted to improve the load times of a text file into a MySQL database. This potential customer had a very well defined problem. A text file containing roughly 110.000 records had to be loaded frequently on its database.

The crime scene


Data was loaded into the IMPORT_DATA table. For each record on that table, a comparison was made to check if the record met some conditions. If true, the data record was stored in the DATA_VERSIONS table, looking it up first to see if there was an existing record with the same values. In each case, either the LATEST_VERSION was updated with the current timestamp of the day to reflect the latest time this record was received or a new record was added to LATEST_VERSION with the current timestamp and the primary key of the new record.

The net result is that all records ever received from the customer that met the conditions to be interesting were stored in the DATA_VERSIONS table, with a timestamp in the LATEST_VERSION table to reflect the last time they were loaded.

The baseline time I have from the basic operation was around 10 minutes. The customer wanted to optimize the load to be as fast as possible. He was talking about creating table partitions, using the BlackHole engine (more on that later), tuning its memory allocation, index structure, whatever.

The data was loaded from a plain text file, one line per record, with fields separated by the pipe (|) character.

The essential clues


The way all this process was done was curious, at least for me. Each time a row was inserted into the IMPORT_DATA table, a trigger was fired. The trigger in pseudocode read something like:


If RecordIsInteresting Then
    Look up record in DATA_VERSIONS
    If found
        Update LATEST_VERSION for the primary key record with timestamp
    Else
        Insert record into LATEST_VERSION
     End If
End If



The condition at RecordIsInteresting read more or less like FIELD_1 = 'A' AND FIELD_14 = 'B'

False suspects


Mmmm... By this time, you'll probably already have a good mental model of the process. And also, if you've faced situations such as this one, I can almost hear your thoughts. Hey, let's check the look ups. Was the DATA_VERSIONS correctly indexed for the look up of the just loaded record? Was the LATEST_VERSION table indexed also for the existence check? Was the IMPORT_DATA table emptied before the load started?

A good start, then. But it turned out that the answer to all those questions was yes. Everything was correctly indexed, access plans looked optimal. I even learned that MySQL is pretty efficient using the LOAD DATA ... REPLACE, as the timings were not that different versus LOAD DATA ... APPEND. Even so, the data volumes loaded were within the reach of machine CPU and RAM so no resource seemed to be exhausted. Repeated tests with detailed checking of operating system statistics did not reveal any particular bottleneck. Machine resources were used efficiently, and the CPU was not waiting idle for the disks to read the data.

So all the usual first aid measures were not going to help. This is were the fun really begins, as improving performance in this way would not be anything to write home about (much less a complete novel) But then, mystery novels had those characters there just as a vehicle for the reader to discover the essential facts about the crime.

The conventional thinking


Enter the expert and veteran detective that is looking at this case as his golden opportunity to become a media star. As clearly something has to be done to solve the mystery, the veteran takes a long and hard look at the problem. Of course, he examines all the false starts. And after a while, he has an idea. After all, why use a trigger to process the records one by one when all of them can be processed at once? This is a classic optimization technique that I've already covered. After all, what's the point of doing it record by record if you can do all of them in one single pass?

So the expert sits down and writes something like



INSERT INTO DATA_VERSIONS
SELECT
    *
FROM
    IMPORT_DATA
WHERE
    RecordIsInteresting
ON DUPLICATE KEY UPDATE
    TimeStamp = Now

INSERT ON LATEST_VERSION
SELECT
    *
FROM
    IMPORT_DATA
WHERE
    RecordIsInteresting
ON DUPLICATE KEY UPDATE
    TimeStamp = Now



With a smile, the expert disables the trigger on the IMPORT_DATA table, loads the data and executes the two sentences. Before claiming victory, he checks the timings. Now, he's down to 8 minutes! What has happened? He checks that the RecordIsInteresting condition is covered by an index (it is!) Not very proud of his achievement, but satisfied nonetheless as he has won were others have failed, he announces that his improvements deliver 20% better performance.

Lateral thinking


Enter the underdog. Great, he says, at least we know something. Wait, what exactly do you mean? - answers the expert. Of course, we know that MySQL triggers are pretty efficient. It's only a 20% overhead, which means that MySQL developers have done quite a good job in their trigger runtime. But, have you asked yourselves if the customer is going to do anything with the IMPORT_DATA information after the load?

- Why? - answers the expert. Of course, he does not want to do anything with the IMPORT_DATA table. That's why he was thinking of using the BlackHole engine for this table, since his only interest was in the data during the trigger execution. Of course, my optimization will not work with the BlackHole engine, as it depends on the data being available to be processed in a single pass.

Realizing what he had just said, the expert creats the IMPORT_DATA table with the BlackHole engine and tests the original data load timing. Well, it now takes 9 mins to load, meaning that he's still able to claim his 20% optimization. Relieved, he then proceeds to probe the underdog that the time it takes MySQL to actually store the data is not that relevant after all.

- Oh, yes, I see, so the IMPORT_DATA information is useless after the load. Can I do a quick check? - asks the underdog.

The underdog bangs away at the keyboard for ten minutes or so, then he claims - See, you can load the data in 1 minute. That's it.

The expert is amazed. At first, he does not believe it. He checks and rechecks the resulting LATEST_VERSION and DATA_VERSIONS tables and everything is correct. Well, now, please show us what amazing MySQL technique you've used, he says.

The underdog is, at heart, a shy person. So it takes him a moment to sort out his argument. It's not that simple. I had to check how many interesting records were in the table. As it turns out, only 10.000 out of the 100.000 records loaded met the RecordIsInteresting condition. So I just said to myself, what if I dont' load those at all?

- But how do you know which record is interesting or not without loading them?, asks the expert
- Well, I wrote this awk program to filter the file before loading it.



# Records are separated by |
BEGIN { FS = "|" }

# First line is for headers, keep it to not having to change the MySQL LOAD DATA statement
NR == 1 { print $0 }

($1 == "A") && ($14 == "B") { print $0 }



- The awk program takes two seconds to execute and creates a file with just the 10.000 interesting records. If you load that file instead of the original one, the results are exactly the same. The only difference is that you're saving MySQL to read a record from a text file, splitting the line into field values, filling the temporary buffers necessary, firing a trigger and evaluating a condition 90.000 times. You only use MySQL to process the records you already know that are interesting.

... end of the story. The real mistery is that my real customer never heard of this solution, as I lost contact with him just after we reached an agreement on the pricing. I don't know where he is, and I hope that he does not mind if I use his problem as an excuse to poke fun on classic mystery novels.

PS - I had very good fun when I was young reading those novels. And I must admit that I never, ever was able to guess who was guilty in advance.
PPS - of course names have been changed to protect the innocent

1 comment:

  1. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Consulting Services

    Data Lake Solutions

    Advanced Analytics Services

    Full Stack Development Solutions

    ReplyDelete