Database Performance Tuning: 2012

Sunday, 6 May 2012

Developing Android applications with Ubuntu - I

The journey begins

What? Hey, you are usually focused in ranting about random topics, database performance, and generally proving the world how smart you are. Why then this sudden curiosity for creating an Android application?

It is part curiosity, part opportunity. As they say, opportunities are there waiting for someone that is in the right place at the right time to catch them. I'm not that one, for sure, but still, after the sad news that come from the Java camp, I wanted to explore new ways of writing applications.

Of course, it also helps if the potential audience for your application is numbered in the hundred of millions, if not more.

So, I wanted to develop a simple Android application. Being a Linux aficionado, and looking at the Google docs, Eclipse under Linux seemed like the main opportunity. Let's start with the basics.

Setting up the stage

First, install the Android SDK. Well, the Android SDK is just a zip file that you extract somewhere in your local disk. According to what I read later, one can create whole applications with the SDK without needing any IDE at all. It has been a long time since I created user interfaces out of raw hexadecimal dumps, so I'm not one of those brave souls. In any case, take note of the folder where you extract the Android SDK. You'll need it later.

Android likes you to use Eclipse to create applications. Perhaps, after my long stint with NetBeans it's time to go back to Eclipse again? For some reason, I tend to go from NetBeans to Eclipse and back each year or so. I tend to like the all-included NetBeans philosophy, whereas Eclipse is the place where the minority and cutting edge tools start to appear. This time is back to Eclipse, I guess.

So go to Kubuntu and start Muon. Oh, or Software Center or something similar if you're using Ubuntu. Make sure Eclipse is installed. Start Eclipse to make sure everything is ok. Choose a suitable folder as your workspace.

Next, you can finally go to http://developer.android.com/sdk/eclipse-adt.html#installing and attempt to follow the steps to install the Eclipse infrastructure for Android. You go to Help->Install new software. You add https://dl-ssl.google.com/android/eclipse/ to the Eclipse list of sources, select the Developer Tools, click next and after a quite long pause you get... an error.

Cannot complete the install because one or more required items could not be found.

Software being installed: Android Development Tools 16.0.1.v201112150204-238534 (com.android.ide.eclipse.adt.feature.group 16.0.1.v201112150204-238534)

Missing requirement: Android Development Tools 16.0.1.v201112150204-238534 (com.android.ide.eclipse.adt.feature.group 16.0.1.v201112150204-238534) requires 'org.eclipse.wst.sse.core 0.0.0' but it could not be found

This is one of these errors that if it were not for Google, I'd never be able to resolve. Fortunately, a noble soul has documented the fix, even with a video here. Thanks a million. However, I'm feeling that this is threading into waters that I don't know well enough. There is something very good about the Internet. Being able to tap such huge resources of information is fantastic, but am I really learning something by applying the fix? Yes, that there are people out there that know a lot more than I. Better respect these people and try to contribute something back, like with this article.

Are you ready to create your first Android app? Not yet. When you restart, Eclipse warns you that you have not selected an Android SDK. Go and define one, choosing the right API level for your target and using the folder where you extracted the SDK package. My target is going to be Android version 2.1, just because I happen to have a phone that runs that version.

Now, I'm ready for Hello World.

Wednesday, 21 March 2012

Microsoft is now a niche player

If you're about to purchase a smartphone, a tablet, or even a PC, you probably have already noticed it: Microsoft now has become a niche player.

It is all about how the balance of producers and consumers of content has evolved. When the PC revolution started, PCs were used to create content that was consumed by other means. PCs were, and they are still, used to create music, graphics, movies, books or movies. They were used to enter data. But the content was primarily consumed in non electronic forms. Magazines, theatres, records. Paper, film or vinyl. Computers helped to create content that was consumed in other mediums.

The only exception of this rule was, and still is, data processing applications. Data is entered in an application, and then transformed and retrieved in many ways, but the results rarely go out of the application, perhaps they are interfaced with other applications and is transformed. But the ratio between the amount of transactions entered and the volume of information that is extracted is increasingly smaller. Data is condensed in tiny amounts of information for dashboards, account statements or check balances.

Then things started to go digital. Content created on computers is increasingly consumed only on electronic devices. And the PC was the main device used to consume content. Databases, on the other hand, increased in size and complexity, with each evolution of the technology, each iteration generating bigger and bigger amounts of data. A significant trend is that the most of today's data is directly entered by the end user, be it plane reservations, shopping carts or generated based on clickstreams from web sites. There are less and less data entry clerks, for each iteration of process optimisation attempts to reduce or eliminate the need for human intervention. Warehouses and store shelves are full of bar code labels that reduce or eliminate data entry to its minimal expression.

Ten years ago, if you wanted to do anything useful with a PC, there was little choice but use Windows. It was the result of a three pronged approach: the tight control Microsoft exerted over the hardware manufacturers ensured that Windows was a popular, even cost effective choice, for PC hardware. Their product portfolio covering such a wide surface of applications allowed them to offer very seductive deals to their customers. In the database area, for example, it was not uncommon years ago to hear someone going to standardise in SQL Server, and learning from insiders that the product was throw in the box close to free as part of a much larger deal involving workstation, office and server software. And finally, their lock in in the proprietary formats and protocols kept everyone else from making competing products.

When the PC was the only device capable of running applications for content creation, there was little choice but use Windows. When the mainframe terminal died, the PC was the only alternative for data entry.

The world of today is different. The balance of content creators versus content consumers has shifted. Content can be created and consumed in many different ways, all of them completely digital. There are now orders of magnitude more devices in the world capable of running applications than personal computers running Windows. New classes of devices (phones, tablets, settop boxes, book readers) have separated clearly the roles of creator and consumer. You no longer need to use the same device for creating and consuming content. Data entry happens by means of bar code scanners or users entering the information themselves, and behaviour data is collected automatically by web logs or TV settop boxes.

And almost none, if not all, of those devices run Windows. Windows and windows applications have failed to move to these scenarios, except when they have managed to hide an embedded PC inside the devices (think of ATMs). At this point, I can only see three Windows use cases, and each is getting weaker and weaker.

Enterprise applications and office productivity: that is a now niche that is restricted only to people needing five year old applications that depend on Windows being compatible to run them. That plus people at home that want to have a home computing environment similar to the one in the office. This segment is being attacked very effectively by cloud services and apps, but the inertia here is huge, so it's going to last them a few years. It is also the most profitable, so expect Microsoft to fight to death to preserve it.
Content creators: people that still need the full power and ergonomy of a desktop or laptop computer to create content. Note that even with the empowerment of the digital technology to create, the ratio of content creators vs. content consumers is still like 1 to 1000. This is not very profitable for Microsoft, but is a key segment because this channel in the past has served to promote content in propietary formats (VB, C#, SliverLight, Office formats, WMA, .AVI, DRM music, .avi,....) that were essential to increase the desirability of their products for the consumer segment. Unfortunately for them, open standards and/or reverse engineering of formats and aversion to DRM are destroying the virtuous cycle of created content that can only be consumed on the Windows platform.
People that simply want a computer for basic tasks (browsing, mail, light content creation) and make a cost conscious purchase. It is actually true that Windows PCs are cheaper than Macs. While this is likely Microsoft's safest niche for now, it is so for a reason: this segment is the bottom of the barrel in terms of profitability. And both Mac and open source based alternatives are eroding market share from both the long and short ends of the profitability spectrum.

Microsoft Windows can now be considered a niche player in these three segments. It is a huge niche, and most anyone else would be happy to own these niches, but still a niche nonetheless. Either because of self complacency, protection of their cash cows, or lack of vision, Microsoft has failed to make any significant presence in any new technology since the year 2000 or so. The cruel irony is that protecting those niches is also what has lead them to losing in other segments. Disruptive players do not care about preserving their legacy because they don't have one to preserve.

Some of you may point to the XBox as a counter example. Check the financials of the Microsoft console division and see how long, if ever, they will recover all the money thrown to make XBox fight for number two or three in the console market before progressing the discussion.

In the database arena, things have been very similar. SQL Server has always been limited in scale by the underlying Windows platform. SQL Server could only grow as far as the type and number of CPUs (Intel or Alpha in the early days) word and RAM size of the Windows OS, and this prevented it being used for big loads, or even small or medium loads if there were plans to make them bigger. Since the definition of "big load" keeps changing with Moore's law, SQL Server has never made any serious inroads beyond the medium sized or departmental database, facing competition from above (Oracle, DB2) and below (open source) Could Microsoft have made SQL Server cross platform and have it running on big iron? Probably, at an enormous expense yes. But that would also miss the nice integration features that made it such a good fit to run under Windows. And also the reason to buy a Windows Server license.

And when SQL Server was seemingly ready for the enterprise, a number of competitors arrived that made unnecessary to host your database on your own server (Amazon). Or to have a relational database at all (NoSQL). Could Microsoft have moved earlier to prevent that? Probably, but that would have required first to foresee it, and it would have happened at the expense of those lucrative Windows licenses sold for each SQL Server instance.

So the genie is now out of the bottle, and Microsoft can't do anything to put him back in. They are now niche players. Get used to it. The next point of sale terminal may not be a PC with a connected cash drawer .

Wednesday, 29 February 2012

Is there a right way of doing database version control?

TL;DR: database version control does not fit well with source code version control practices because challenges associated with data size.

I could not help but think about posting a comment on this well written blog post, but realized that it was a topic worth discussing at length in a separate entry. If you've not clicked and read already the rather interesting article, here's the summary: there is a big impedance mismatch between what are the current best practices regarding changes in source code and change control inside databases. The solution proposed is quite simple: develop a change script, review it, test it, do it in production. If anything fails, fix the database by restoring a backup and start all over again.

For added convenience, the author proposes also to store the scripts in the database itself, so that everything is neatly documented and can be reproduced at will.

There are a number of very interesting comments proposing variations on this process, and all of them really reflect some fundamental problems with databases that do not have their reciprocal in the world of source code control. While I seriously think that the author is basing his proposal on real world experience and that the process works well for the systems he's involved with, there are a few environmental factors that he is ignoring that render the approach impractical in some cases. It is as if he is falling into the classic trap of believing that everyone's view of the world has to be the same as yours. Here are a few reasons why not.

Databases can be huge

This is the crux of the problem. Code is the definition of some data structures plus how to process them. Databases contain structure, data and perhaps also instructions on how to deal with data. Compilers can process the source code in a matter of minutes, but adding the data takes much longer. Either by restoring files, running scripts or whatever other means, there is no way to avoid the fact that the time to restore data is at least a couple of orders of magnitude above the time needed to compile something.

This makes all the "simple" strategies for dealing with databases fail above certain size, and break the agilistic development cycles. In the end, if you want to have continuous integration or something similar, you simply cannot afford to start from an empty database in each build cycle.

Big databases have big storage costs

In an ideal world, you have your production database, plus a test environment plus a copy of the production database in each development workstation so that you can make changes without disturbing anyone. This works as long as the data fits comfortably in a hard disk.

In the real world, with a database big enough, this is simply not possible. What ends up happening, in the best case, is that developers have a local database with some test data seeded in it. In the worst case, all developers share a database running in a server that is not likely able to hold a full copy of the production environment.

Performance testing for big databases is done usually on what is sometimes called a pre-production environment: a full copy of the production database restored on separate storage. That already doubles the total cost of storage: production environment plus pre-production environment. For each additional environment you want to have, say, end user testing, you're adding another unit to the multiple.

Before you blame management greed for this practice, think again. Each full copy of the production database is increasing storage costs linearly. For $100 hard disks, this is perfectly acceptable. For $100.000 storage arrays, it is not.

We've had for decades Moore's law on our side for RAM capacity and CPU power. But the amount of data that we can capture and process is increasing at a much faster rate. Of course, having infinite infrastructure budgets could help, but even the advocates of setting up the best developemnt environment agree that there are limits on what you can afford to spend on storage.

One promising evolution of storage technology is the snapshot-copy on write based systems. They provide nearly instantaneous copy time -only metadata is copied, not the actual data- and only store what is changed across copies. This looks ideal for big databases with small changes, but is unlikely to work well with databases with big changes, either big or small, as you're going to pay the "price" -in terms of amount of storage- of the full copy at the time you do the changes. But, don't forget that the copied databases will be impacting the performance of the source database when they access unchanged parts. To prevent that from happening, you need to have a standalone copy for production, and another for pre-production, and another for development. So at a minimum, you need three different copies.

Restores mean downtime

So does application code upgrades, one could say. And in fact they do. However, the business impact of asking someone to close an application or web page and reopening it later can be measured in minutes. Restoring a huge database can mean hours of downtime. So it's not as easy as saying "if anything goes wrong, restore the database" Even in a development environment, this means developers waiting hours for the database to be available. In a big enough database, you want to avoid restores at all costs and if you do them, you schedule them off hours or in weekends.

Data changes often mean downtime too

While in the past adding a column to a table required an exclusive lock on a table, or worse, on the whole database, relational DB technology has evolved to the point of allowing some data definition changes not to require exclusive access to a table. However, there are still some other changes that need that nobody else is touching the object being changed. While not technically bringing down the application, this in practice means that there is a time frame when your application is not available, which in business terms means downtime.

It's even worse: changes that don't need exclusive locks it usually run inside a transaction, which can represent a significant resource drag on the system. Again, for a small database this is not a problem. For a big enough database, it is not likely to have have enough resources to update 100 million records and at the same time allow the rest of users to use the application without taking a huge performance hit.

Is there a way of doing it right?

Simple answer: yes. Complex answer: as you increase the database size, the cost of doing it right increases, and is not linear. So it is going to become more and more difficult, and it's up to you to decide where the right balance of safety and cost is.

However, a few of the comments in the post suggested improvements that are worth considering. In particular, having an undo script for each change script seems to me the most reasonable option. Bear in mind that some kind of data changes do not have an inverse function: for example UPDATE A SET A.B=A.B*A.B is always going to yield B positive regardless of the sign of the original value of B. In those cases, the change script has to save a copy of the data that before updating it. With that addition, at least you have some way of avoiding restores. This does not completely remove the problem of downtime, but at least mitigates it making it shorter.

This, plus the practice of keeping the scripts inside the database, has also the added benefit of keeping the database changes self contained. That means less complexity should you need to restore, which is something DBAs appreciate.

According to the different scales, these are then the different scenarios:

Small database: ideal world, one full production copy for each developer. Use the original process. When moving to production there is a low chance of problems, but if they appear, use the original process: restore and go back to step 1.
Mid size database: devs with small, almost empty database, final testing on pre-production. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.
Large database: devs with small, almost empty database. When moving to production, cross your fingers and pray. If something goes wrong, apply undo scripts.

Not pretty or perfect, but likely the world we live in. And note that NoSQL does not change anything here, except on the simplistic views of those novices to the NoSQL or development world at large. In some NoSQL implementations, you don't have to do anything special because there is no real database schema. You only change the code, deploy and everything is done. Really? Well, yes if you don't count all the places where you, or anyone else, made assumptions about the implicit structure of your documents. Which could be everywhere any of that data is being used. The relational world has some degree of enforced structure (keys, cardinality, nullability, relationships, etc) that makes certain category of errors to surface immediately instead of three weeks after the code is in production.

Tuesday, 7 February 2012

The changing goals of Canonical

Today, Canonical announced that they are relegating Kubuntu, one of their "official" variants of their flagship Ubuntu Linux distribution, to the same status as the other distribution derivatives.

Canonical is the brainchild of Mark Shuttleworth, a dot com boomer that wanted to give back to the same community that provided some of the wonderful FOSS software that helped him becoming a millionarie. When it started, Canonical did not had any clear financial constraints, or objectives for that matter. Bug #1 in Ubuntu's bug database simply reads "Microsoft has a majority market share" Was Canonical's objective to take away market share from Windows? At the time, that seemed to be a bold statement, but the first Ubuntu releases were making giant strides towards that objective, to the point of being considered a credible alternative by many established players, including Microsoft itself.
Kubuntu, one of the Canonical projects, is an attempt to merge the friendliest Linux distribution -Ubuntu- with the desktop environment that I find that is the closest fit for a Mac or Windows user: KDE. An ideal combination for someone that switches between operating systems or is a seasoned user that wants to move away from the proprietary environments.

What the latest announcement means essentially is that the single individual paid by Canonical to develop and maintain Kubuntu will no longer be assigned to that role, and any further developments in Kubuntu will have to come from the community instead. This does not necessarily implies that there will be no more Kubuntu releases after Pangolin: for example, the Edubuntu community has managed to keep up with releases.

It's not that Kubuntu users are left in the cold, however. The next release (Precise Pangolin) of the Ubuntu family will still include Kubuntu, and being a Long Term Support (LTS) one, means that existing Kubuntu users will get patches and support for the next three years.

So what's so important about the announcement, then? By itself, very little, except for the small minority of Kubuntu users. Kubuntu had not enough user base, reputation or visibility to be worth keeping, hence Canonical has retired the single individual dedicated to Kubuntu because it does not make economic sense to keep paying him to do that.

What is important is not the announcement, is the trend: Canonical is more and more taking decisions based on economic, not idealistic, considerations.

Now, those idealistic goals are set aside more and more in search of a more mundane objective: profitability. The turning point was reached last year: they released Unity, a new desktop environment, targeted at non computer users, with an eye on using it as the interface for Ubuntu based touch devices and other non-PC environments. That was a big change that was received by the existing user community with a lot of backslash, yet Canonical is firmly resoluted to develop Unity in spite of that, and not willing to devote time or resources to keep alive an alternative to Unity.

What Canonical wanted to be at the beginning was not clear, but now it is: to be profitable.

And is hard to blame Canonical for not trying. After all, they have an extensive -read, expensive- staff dedicated to the many projects they sponsor. They have clearly invested a lot of effort -read, money- into many initiatives targetting everything from the office productivity desktop to the settop TV box, with incursions in the music store business, cloud storage, cloud OS and management, alternative packagings (Kubuntu still appears there at the time of writing this, by the way), education, home multimedia hubs, corporate desktop management and who knows what else.

All these projects have generated a considerable user base, at least in comparison with previous attempts, and helped Canonical to accomplish Shuttleworth original intention of giving back to the community, even if there are differing opinions on how much actual contribution has been made.

Yet, none of these projects seems to have generated a respectable enough line of business. Maybe some of these projects are self sustaining, maybe some of them generate some profit. But are they taking over the world? No. Are they going to be a repeat of the first Shuttleworth success? No. Are they making headlines? No. All that investiment is certainly producing something that is valued and appreciated by the open source community, but profitable is not.

At least yet. How do I know ? Honestly, I don't know for sure. What I know for sure is that any degree of significant success would be heralded and flagged as an example by the always enthusiast open source community. And that is not happening.

So Canonical is looking for profitability, big time. And if it means losing all their existing user base, so be it. Which will not be a big loss, because their existing user base is demonstrably not very profitable to begin with.

Which makes complete sense from a business perspective. If I was a millionare and had put a lot of my own personal fortune in something, I'd be expecting to see something back. Another, completely different issue, is how they can become profitable. That would mean looking into Unity, their biggest bet so far, and... well, you already know my opinion. Is a keyboard search the most effective way of finding things in a device without a physical keyboard? You judge.

Anyway, it was about time to change distro. With all due respect for the huge contribution Canonical and Ubuntu have made to build a robust, flexible and fast desktop environment. Until they stopped wanting to do that, of course.