Sunday 31 October 2010

Unicode explained to the younger developers



I had recently a question from one developer about the different ways that Oracle has to declare a string column. He was specifically confused between VARCHAR2(xx), VARCHAR(xx) and finally VARCHAR2(CHAR xx).

Short answer: forget about VARCHAR, use always VARCHAR2. The difference is related to how it handles Unicode characters. VARCHAR2(xx) will have room for xx bytes, where VARCHAR2(CHAR xx) will have room for xx characters. If your dealing with new code, use VARCHAR2(CHAR xx), if you're maintaining legacy code use VARCHAR2(xx). The legacy code will have issues dealing with Unicode, but you'r going to fool it if you think that using VARCHAR2(CHAR xx) will improve the situation, because other places in the code are likely assuming VARCHAR2(xx) has some way of dealing with Unicode, or none at all. Either way when those parts use your tables, they will likely not properly understand what's there.

While the question was interesting in itself, he was puzzled by the answer. His answer was, "that's all well and good, but why there are so many different ways of declaring what is simply a string?"

Here's one of those moments where experience weights in and you can indulge in a bit of history, and your interlocutor, who feels that at least your answer deserves some gratitude, listens to your explanation anyway. So here it goes.

In the dawn of time, character sets had 7 or 8 bits. ASCII, ECBDIC or some ASCII variant. In those ancient times, 8 bits were enough to store a character. Maybe you had some bits to spare, especially if you were using English characters, but 8 was enough. The only problem for applications was to know which character set they were using, but was usually easy to solve. Did you say the only problem? No, there was another, much bigger problem. 8 bits were enough to represent most of the western world characters, but 8 bits were not enough to represent all the characters at the same time.

That meant that if your application had to deal with one character set, you were fine. If your application had to deal with many different languages at once, then you were in trouble. What you do, store for each string the character set it was using?

There was no good solution for this problem. But using 8 bits for everything had a lot of advantages. Each character was a byte. Millions of lines of code were written assuming sizeof(char) == 1. Copying, comparing and storing strings assumed that each one took one byte. The world was a stable place for almost everyone, except for the poor souls who had to maintain applications that worked with languages (Chinese?) that needed more than one byte to represent a character.

Then came Unicode to save the world. The only problem is, depending on the way you choose to represent Unicode, you may need more than one byte for each character. In some of the most popular, backward compatible Unicode encodings, you actually need a variable number of bytes to represent a character. Time to review all your string handling code. You can no longer assume that increasing by one a pointer will get you the next character. You can no longer assume that a string needs in memory as many bytes as characters it has. You can no longer compare the byte values of each character to determine their sort order.

Of course, if you're young enough, or have never the curiosity to use C/C++ or FORTRAN, you've never seen this problem. Your handy string class provides everything you need to handle Unicode wrapped in a nice package. The memory size of char[] and byte[] are different, but you essentially don't care about that.

Oracle, being in existence before the Unicode days, is of course greatly affected by the change. Not because Oracle cannot adapt itself to Unicode, but because of the huge codebase that needs to maintain backwards compatibility with. That's why they have invented VARCHAR2(CHAR xx). It is for them the best way to support modern Unicode encodings and at the same time remain backwards compatible.

Since VARCHAR2 is not standard SQL to begin with, extending its syntax further is not a big loss anyway. So next time you have to interface with legacy string data, think about how it is encoded.

Friday 15 October 2010

This code is crap



Helping customers get the most out of their systems is good and interesting job. One gets to know a lot of industry sectors, processes and ways of working, as well as some good people.

Invariably, the job involves reading code. And writing code, usually a tiny fraction of the whole body, because you focus on the parts that provide most benefit to your customer. Whenever I wrote code, I always try to make it stand out for its clarity, performance and readability. Experience has taught me that having your code reused or called from a lot of places is a sign of a happy customer, so it made sense to do my best for the customer.

As you can imagine, one develops a good sight for reading code. And I've seen my fair share of bad code over the years. Very very bad code. It is difficult sometimes to keep yourself neutral and resist the temptation to think of yourself as some kind of elite coder that can see what other people can't. With enough time and experience you learn that there are a number of factors, all of them human related, that can affect significantly the outcome of your work. Even if you try to deliver your best effort, sometimes you just were under pressure to meet a deadline and had to rush something out. Or perhaps you have a child with a cold waiting for you at home and that worries you much more than the quality or performance of what you're writing.

But that perception changed when I recently had to look at some code written five years ago.

At first, it was not that bad: at least the formatting was consistent. But it was pretentious and sophisticated beyond necessity. There were missing oportunities for simplification everywhere. Performance could be easily doubled with a number of simple, apparently obvious changes. There were comments, but essentially useless because they were centered in irrelevant parts of the code. Design decisions were not explained properly, there was no summary explaining why on earth certain algorithm was chosen or modified.

There were obvious refactoring spots all over the place. The code could have been much sorter, cleaner and efficient.

I was becoming impatient with that code, and finding progressively more difficult to sympathize with the original coder.

The only problem was, I was the one that wrote that horrible code.

As it turns out, that was I considered good code five years ago. I was so shocked that I started to dig around other, older fragments, of my old code. Ten years ago it was even worse. What I wrote fifteen years ago was practically unreadable.

Suddenly, I felt the urge to fix all that. I then realized that these lines are still happily executing many times every day, processing millions of transactions. And nobody yet has replaced it with something better, probably because they don't need to touch it.

It took me a while to recover from, but then I think I learned something. I’m not as a good coder as I think I am. You’re probably not a good coder too. Yes, there are good coders out there. Probably the club of good coders has members whose last names are Knuth, Kernighan, Aho, Torvalds, Duff or Catmull. But I'm not there. Not now at least.

I'm now a good enough coder to recognize good code. Perhaps over time I'll get better, enough to be considered a good coder. In the meantime, what I deliver seems to be good enough.