Thursday, 9 June 2011

SELECT * is bad for performance



And I thought that this discussion was forgotten a long time ago. But yes, at some point in the distant past, I wasted some time reading the different arguments for using an attribute list versus using the wildcard in a SELECT statement.

At the time, those discussions about the performance differences did not made sense to me. In the grand scheme of things, the difference was so small at the time that anyone trying to improve performance significantly from these tricks was basically wasting time. Further, improvements on CPU speed have left the disk I/O in databases as the basic bottleneck, so there was little point in looking at that. The conclusion was: use whatever you want, it is not going to make a difference performance wise.

Besides the nonexistent performance implications, the discussion had an interesting twist: from a maintenance point of view, both sides were arguing that their approach was better. One side said "SELECT * is bad because you don't know what you're including in the result set, so the day you remove an attribute from a table you're not going to detect the problem until you test the app" To which the other side said "yeah, but removing attributes is quite uncommon, what happens often is that you add them, and in those cases you not only need to touch the database table, but the query as well. Had you used SELECT *, everything will be included in your result set by default"

Not being involved too much in maintenance work, I forgot this discussion a long time ago. But then I ran into a case where SELECT * can hurt performance. A lot.

Picture this. Developer A creates the classic company directory application. Naturally, there is a EMPLOYEE table that contains the necessary information, such as company ID, start of employment, first and last names and so on.

All is well. Application passes all tests and goes live. The application performs well and everyone is happy with it, so happy that they decide to expand its functionality a bit. One of the best ideas comes from HR: add a photo to the employee profile. With all the remote working and teams scattered across countries, it is always good to put a face to a name.

Developer B is assigned to the task. The obvious place to add this content is in the EMPLOYEE table, and provide some end user facility to upload a picture. So he adds a picture blob field to the EMPLOYEE table.

Again, all is well. Application passes all tests and goes live. Everyone starts uploading their photos to their profiles, with the usual competitions to look good or funny in your personal profile. Photos are usually coming from high end mobile phones or digital cameras, so they are usually bigger than 1MB in size. After all, who wants to post a crappy picture of himself just because you're too lazy to use more than your web cam?

However, as time passes the application becomes slower and slower. Simply listing employees or searching for one takes ages, whereas in the pasts was nearly instantaneous. Developer A denies all responsibility, everything was working fine before Developer B touched the application, and there has not been a significant increase in the number of employees to justify such performance differences.

DBAs and sysadmins all deny making any changes in the system. Except for the expected database growth in size, due to the uploaded photos, nothing has been changed. Furthermore, database machine and web servers do not show any capacity exhaustion of memory or CPU  resources.

Developer B is puzzled. He cannot see how his changes have introduced performance regressions, more so when he has not even touched any parts of the application written by Developer A, except of course the employee profile that now shows the photo, if exists. Developer B denies any responsibility on the performance regression. After all he has not touched any of the functionality that has become slow over time. But there is a clear relationship between his changes and the application slow down, as before he touched it everything was fine. So everyone points their fingers to his changes as the cause of the problems. No doubt that his changes are somewhat related to the performance problems, but how?

The answer to the quiz, as you may already have guessed, lies in the way Developer A implemented database access to the EMPLOYEE table. SELECT * was fine to list the employee names and to render the employee profile. But now that the picture blob is part of the EMPLOYEE table, SELECT * retrieves the photo from the database. Each and every time, even if it does not need it. What was in the past a KB sized result set, for example to list employees, has become a multi megabyte affair that has to travel from the database to the application each time this SELECT * is executed.

Developer B realizes that, with the help of a database performance tuning expert of course (shameless plug), fixes the SELECT * statements and application performance comes back to its previous levels. Developer A hides in a cave to avoid embarrassment, and everyone is happy.

So leave SELECT * for your database console sessions. Applications should always ask explicitly for what they want. Otherwise you can inadvertently introduce problems that are going very hard to track.

No comments:

Post a Comment