Sunday, 31 October 2010

Unicode explained to the younger developers



I had recently a question from one developer about the different ways that Oracle has to declare a string column. He was specifically confused between VARCHAR2(xx), VARCHAR(xx) and finally VARCHAR2(CHAR xx).

Short answer: forget about VARCHAR, use always VARCHAR2. The difference is related to how it handles Unicode characters. VARCHAR2(xx) will have room for xx bytes, where VARCHAR2(CHAR xx) will have room for xx characters. If your dealing with new code, use VARCHAR2(CHAR xx), if you're maintaining legacy code use VARCHAR2(xx). The legacy code will have issues dealing with Unicode, but you'r going to fool it if you think that using VARCHAR2(CHAR xx) will improve the situation, because other places in the code are likely assuming VARCHAR2(xx) has some way of dealing with Unicode, or none at all. Either way when those parts use your tables, they will likely not properly understand what's there.

While the question was interesting in itself, he was puzzled by the answer. His answer was, "that's all well and good, but why there are so many different ways of declaring what is simply a string?"

Here's one of those moments where experience weights in and you can indulge in a bit of history, and your interlocutor, who feels that at least your answer deserves some gratitude, listens to your explanation anyway. So here it goes.

In the dawn of time, character sets had 7 or 8 bits. ASCII, ECBDIC or some ASCII variant. In those ancient times, 8 bits were enough to store a character. Maybe you had some bits to spare, especially if you were using English characters, but 8 was enough. The only problem for applications was to know which character set they were using, but was usually easy to solve. Did you say the only problem? No, there was another, much bigger problem. 8 bits were enough to represent most of the western world characters, but 8 bits were not enough to represent all the characters at the same time.

That meant that if your application had to deal with one character set, you were fine. If your application had to deal with many different languages at once, then you were in trouble. What you do, store for each string the character set it was using?

There was no good solution for this problem. But using 8 bits for everything had a lot of advantages. Each character was a byte. Millions of lines of code were written assuming sizeof(char) == 1. Copying, comparing and storing strings assumed that each one took one byte. The world was a stable place for almost everyone, except for the poor souls who had to maintain applications that worked with languages (Chinese?) that needed more than one byte to represent a character.

Then came Unicode to save the world. The only problem is, depending on the way you choose to represent Unicode, you may need more than one byte for each character. In some of the most popular, backward compatible Unicode encodings, you actually need a variable number of bytes to represent a character. Time to review all your string handling code. You can no longer assume that increasing by one a pointer will get you the next character. You can no longer assume that a string needs in memory as many bytes as characters it has. You can no longer compare the byte values of each character to determine their sort order.

Of course, if you're young enough, or have never the curiosity to use C/C++ or FORTRAN, you've never seen this problem. Your handy string class provides everything you need to handle Unicode wrapped in a nice package. The memory size of char[] and byte[] are different, but you essentially don't care about that.

Oracle, being in existence before the Unicode days, is of course greatly affected by the change. Not because Oracle cannot adapt itself to Unicode, but because of the huge codebase that needs to maintain backwards compatibility with. That's why they have invented VARCHAR2(CHAR xx). It is for them the best way to support modern Unicode encodings and at the same time remain backwards compatible.

Since VARCHAR2 is not standard SQL to begin with, extending its syntax further is not a big loss anyway. So next time you have to interface with legacy string data, think about how it is encoded.

No comments:

Post a Comment