Database Performance Tuning: 2009

Sunday, 15 November 2009

Indexes cost money

All this industry talk about cloud computing is, more than ever, placing under the spotlight the true cost of IT infrastructure. Combined with the current economic climate, this is sparkling an interesting and refreshing debate on what is the actual cost of supporting and growing an IT infrastructure. And more and more, a realistic cost breakdown of how much each piece of the infrastructure is emerging.

And one of the surprises of this exercise is the exploding costs of mass storage. Data centre storage is under an increasing pressure. Data Warehouses keep storing more and more data points and history. ERPs continue to take on more business processes. Web front ends generate incredible amounts of detailed user activity information. Graphics and multimedia file formats are taxing file servers and document repositories at the far end of their capacity. Mail is now an accepted medium for conducting business, and mailbox size is increasing without an apparent limit.

On top of that, legal regulations demand that most everything has to be kept during five, seven, or in extreme cases I've seen, fourteen years.

Technology is also evolving to catch up with all those demands. From SAN/NAS storage, tiered data management, storage outsourcing, you name it. But, except for some notable examples (Amazon EBS, Google) the business IT environment at large is failing to meet that demand.

This is even more difficult to explain to the non-IT executive levels, since at the consumer level storage, there is an expanding and ever increasing amount of USB and firewire devices offering amazing capacities at prices that are getting lower and lower. Not to mention that the average desktop PC now comes with a 1GB hard disk as the standard size.

Usually, the point that non-IT business folks fail to realize is that it is cheap and easy to increase the raw amount of storage available. Not so cheap is to back it up daily, keep that archive during five years and make sure that is recoverable.

My usual response in the typical elevator conversation about storage cost with business types (yes, those conversations actually happen) is to answer, "yes, you can purchase a couple of TB for a few bucks. Now, tell me what you do if those disks fail. Will you be making back ups? Will you be able to recover them in five years? Will you be able to provide a replacement in 4 hours if they fail?" It's not that they are trying to outsmart you, it's simply that all those dimensions of the support infrastructure costs are simply invisible for them.

Oh, I hear you cry, is this one of these posts of late that do not have anything to do with database tuning? No, this time my point is actually related to DB tuning. In fact, all of the above is essential to introduce the topic. The topic is "just create an index and this will be faster"

This is so wrong, and I think that I've blogged it before, that deserves to be refuted in all its possible dimensions. First, yes, there are situations where it actually makes sense to create an index.

But an index is not free. First, an extra index in your database will penalize inserts and updates on the affected table. Second, the index takes storage space, that expensive and scarce resource. Third, sometimes there are better alternatives to adding an index. Two of the typical are:

- Caching values read from the database, instead of querying for them each time they are needed. This has to do with a tendency during maintenance to prefer always adding new code rather than change existing code. The more difficult to maintain the code is, the higher the chance that maintainers simply do not want to touch it. This is safer for them, but also imposes a performance overhead as it cannot take advantage of work already done by the existing code.

- Converting procedural code to SQL data set code. Query execution plans favor working with whole tables and result sets over slowly iterating over them.

So, next time someone proposes an index on that pesky 4TB table, ask, are there any other alternatives? Tuning exercises should not provide a single answer to a problem, but a balanced cost/benefit analysis of different alternatives. It's not enough to answer "just create an index", you have to say "creating an index is the best option as opposed to...." That puts the conversation with the paying customer in the right level.

Saturday, 7 November 2009

The tragedy of tactical vs. strategic solutions

I've noticed lately an increased tendency for the so-called "Tactical Solutions" in big organizations. Those are usually focusing their IS efforts around a few key systems, termed the "Strategic" ones. Strategic is synonym of big, complex and ambitious.

I've always heard the term "tactical solution" used to denote something that was put together quickly, without paying too much attention to the strategic aspects of an organization. Its data model does not try to encompass every possible process in the world, and its functionality is a fit for purpose for the existing business process, without taking into account the expected changes introduced by the business in the future. It's something built knowing that it is going to be replaced in the near future by the all encompassing, strategic application, but solves now today's problem.

Fact is, the all encompassing, strategic application does not always arrive. Well, in fact, it's quite likely that it will never arrive. Two are the strongest contributing factors:

Plans are never completely executed: forget about five year long plans, face it. Nobody in any organization even remotely knows what they will be doing in three years time. Plans change as the economic environment changes, as competition changes, as legislation changes , as technology changes and as boards of directors change. When I see one of those five year long plans, I always do as if I accept them and cannot but admire the lack of grounding in reality that the people that make them and those that believe them have.

Complex solutions are likely to fail: the more complex and ambitious the strategic solution is, the more it is likely to fail to deliver. This happens also outside IS and is a inevitable side effect of having to deal with a complex problem. It's a fact. Complex development fails, at a much higher rate than simpler development. So even if people sticks to the plan and do their best to execute them, there is a built-in high probability of failure in those three to five year plans.

If you think I'm exagerating, I've seen tactical solutions that were intended to last 12 months being used for five years. And did the strategic solution arrived after five years? No, the tactical solution was replaced by another tactical solution, yes, you guessed right, intended to last only 12 months.

Of course, this is an extreme example, and I'm ready to admit that there are organizations (very few) that are actually able to stick to the plan and execute it during three to five years. But rarely the tactical solution is effectively replaced. Similar to the demise of dinosaurs, the tactical solution lower size and complexity allows it to change and adapt much faster and better to the environment than the monolithic and big strategic solution. And the point is, the "tactical solution" ends up playing a key role in the application landscape, and thus becomes strategic. As with any strategic system, it ends having to comply with the same requirements as its bigger brothers. It has to be fault tolerant, it has to scale up, it has to have change control, it has to be changed to adapt to business requirements and it has to be auditable.

What? You're not able to build those attributes into your tactical solution? Didn't you think about those requirements while building it and it is now heniuosly expensive to do it? Well done, you built actually what is a tactical solution. Instead, what if your tactical solution is fault tolerant, can scale up, is under change control, is easily changed to fit new requirements and is auditable? Congratulations, you built an strategic solution much cheaper and quicker than the strategic solution is going to be built.

Whatever the result, it's time to stop using the term "tactical solution" as something quick and dirty developed for the short term and instead start describing it as what they really are. Like the Ubuntu joke (Ubuntu is an ancient African word that means "I don't know how to install Debian") "Tactical Solution" means "I don't have time for the big guys to tackle my problem and I'm solving it myself"

Just take care of making it strategic while building it.

Thursday, 29 October 2009

GAE: the good, the bad and the irrelevant

From what I read, the general feeling is that there are two irreconcilable factions in the cloud/no cloud debate: those that go for it and those that do not.

For such a highly technical topic, I'm surprised to see GAE being debated at almost emotional levels. In one side, there are folks that look like Google fanboys and see nothing but advantages for your business to host things in GAE. The other side seems to be in total denial of even consideration of using GAE for anything even remotely related to their line of work.

My view is that as with any question complex enough in life, the answer is "it depends". That's why I put together the following points that I think you should consider before betting for GAE.

Note that Amazon EC2 offers you a much lower level (system instance) hosting, these considerations still partially apply. And note that they are based more on experience with application hosting and development in general than my experience in GAE, which still is below what I wish it were. Bombs away.

The irrelevant

This may sound provocative, and in fact this is exactly the feeling I want to evoke. There a couple of things that I hear over and over against GAE but I'm still trying to understand the logic behind them.

Data Protection and security

GAE detractors say that they will lose control of their data by using GAE. Your data will be in Google's hands and that is bad. Let's face it, and please be realistic. Depending on how many degrees of outsourcing you have in place, your data is already in many hands, and is as safest as the weakest link on the chain. Very few business take even elementary precautions with their business data when dealing with it internally. Simple things like backup encryption are not being done for business sensitive data.

If anything, Google's massive scale and focus on technology will give the average business a much higher degree of data protection than they have today, if only because they can spread the huge capital cost of setting up all that infrastructure across thousands of customers. In theory, Google could potentially do whatever they wanted with your valuable business data, but it would be probably against their own interest to do so, if only for the lost business due to lack of confidence. And who do you think it has the greater incentive to use your data for nefarious purposes, the guy that takes the daily backup tapes away from your (shared? hosted?) data center or Google? Have you already taken any measures to encrypt your laptop hard disk? Do you allow for USB disks to be plugged on your workstations? Both thinks pose a greater risk to your data, at least judging by public reports and security experts, than anything that can happen inside Google's cloud.

I'm not saying that the folks at Google are perfect. They will, at some point, have a problem with the data they keep for third parties. Nobody knows when, but it will happen, if only because in any security scheme the weakest link in the chain is the human at the keyboard, and no matter how well Google recruits staff, they are not perfect.

I've yet to read any reports of security breaches at Google, and I'm sure that any hacker wishing to raise its market value would be quick to announce any kind of success with Google infrastructure. Security is a multiple layered onion and that the two topmost layers, the users and the application, are still being nursed by you. So look again at all those expensive firewalls and IDSs in your data center (that is, if you have any of those), guess how many of those Google has and start worrying instead about the most likely risks as of today: an application programmer leaving any SQL injection holes in your application or a user saving a spreadsheet with a list of all your customers in his laptop and the forgetting it in the back seat of a cab or in a train.

Uptime

This comes down to two reasons. The first is that they don't have the same level of control over scheduled downtime windows as they have on their own environment. I'm fine with that reason, and to some degree I'm ready to accept that as valid. But looking at those windows in GAE, while some of them could be inconvenient for your business, I'm sure that will compare favorably in terms of quantity and duration with the scheduled downtime that any local application you may have. Google environment is redundant and fail safe by design, that is, their need to bring the whole thing down to perform any chances is much, much less often than the typical environment. As for scheduled downtime due to application changes, this is more a factor of your application than the environment you choose to run it.

You also need to remember that GAE downtime history has been up until now more governed by it being a beta system than anything else. You should not expect to have similar degrees of downtime, scheduled or not, once they are fully up and running.

The second reason is truly, really, genuinely irrelevant. Unscheduled maintenance windows, that is, the time when the system crashes when nobody expects to, are the wors nightmares for any IT organization. And frankly, at any big consumer facing web business, be it Microsoft, Yahoo, eBay or Google downtime is way lower than anybody else in the closed environment data center world. Each time any of Google services, be it gmail or GAE, goes down, the critics raise their voices and say "that's why I'd never trust the cloud" I find it very hard to find environments where uptime is on par with the web-facing consumer giants, not even closer. Those folks have economies of scale that allow them to set up redundancy and fault tolerance on levels that the ordinary IT budgets cannot even dream of. So their unscheduled uptime is going to be lower than yours, period.

The most ironic part of it is that the people that make these kind of arguments against GAE are completely out of the GAE radar. The GAE proposition does not make sense in those enterprise environments that are above Google levels of security and uptime, that is, if they are actually meeting them outside of their dreams. But it's hard for me to find as many 99.99% uptime environments as GAE critics exist. My only possible explanation is that the criticism come mostly from armchair strategists. It is easy to "design" security and availability on paper. It is not easy or cheap to implement those designs.

And please, please, remember: I consider those points not being relevant in the context of discussing going with GAE or not. But that does not mean that they are not relevant. By all means they are, so always make sure that you have the right clauses in place whatever direction you choose.

The Bad

And yet I think there are genuine concerns you should be aware of when commiting to GAE, yet I don't see them widely discussed, perhaps because the bulk of the arguments belong to the the irrelevant ones.

Lock-in what you buy when you choose GAE is lock in. Good old lock in, of the deep variety. Any application you write for GAE will remain running under GAE forever as soon as it reaches any reasonable level of complexity. Don't fool yourself thinking that if you use Java your chances of being free are higher, because they are not. The GAE storage engine is different enough from anything else that you cannot run your application anywhere else. Period. If you want a good example, just look at GAE own framework, Django. The Django GAE has so many modifications over the standard Django framework that even Google is not making a sales argument of their Django heritage. Django is mentioned, but not promoted.

Rigid environment: by desing, GAE is not well suited to some applications. You may start your application with certain functionality in mind and later on want to expand it. In that expansion process, you may run into some of the limits imposed by GAE. By that time, by virtue of the lock-in, you'll have already invested some time and money in your application and you will then face the choice of basically throwing away most of that investment and switch to another, less restrictive platform, or start fitting your functionality to the GAE mindset: no analytical queries, no big batched transactions, no incoming mailbox, no …. I'm not completely aware of all the things you cannot do with GAE (or are extremely difficult and cumbersome to do as to not being practical) but it's likely that you'll run into any of them at some point. The problem with that, as I've already written somewhere else, is that Google itself does not have to deal with those as they have access to MapReduce and other technologies that allow them to sidestep any GAE limitations nicely. But you will not.

The good

Scalability. if there's one single thing they have mastered is how to massively scale everything they do. Massively. If you need that kind of scalability, Google is the way to go. And they keep commiting resources and capital to make their infrastructure even more big and scalable all the time. They have simply no competition in that front at this moment.

Uptime. Hey, all of us have heard of that server that has been running for the last eight years without a single reboot. Like urban legends, you never hear that from someone that has actually seen this machine, but only form people who know someone who know… While not breaking any records at this moment, Google uptime is among the best in the industry. Much more so taking into consideration the size of their operations. Also, I have the feeling that Google knows that this point is key to win the customer hearts.

In summary, I think that you should take a hard look at the bad and good before going to GAE. And, while not completely ignoring the irrelevant, take the time to measure them against what it is going to cost you to attain similar levels of uptime and security on your own. My bet is that what I've classified as irrelevant will actually look like, well, irrelevant, unless there are up front requirements (legal, for example) that prevent you to even pondering them in the decision. Then decide.

Monday, 21 September 2009

The demise of global software services

I came across this post by no other than Kent Beck. It's no wonder how he has developed the status he has in the industry today: he is a very good communicator. What could take me half a book to describe, Beck can do in three paragraphs. This style of communication is a perfect fit for the eternally busy business people that cannot take away his eyes off the BlackBerry for more than fifteen minutes. And at the end of the article, I was amazed at both how simply and accurately he describes the business model of many service companies I know about.

Beck is puzzled as to why absolutely no software services company was interested in his idea. He then goes on to explain and prove that, in he general case, service companies don't see increasing the skills of their workforce as a good investment. Note that I'm not passing any judgement on whether his “Responsible Design” idea actually makes sense or not. The relevant information here is that he could not find anyone interested in his idea among software services companies, but software product companies were interested in it.

This is strange. The same service companies that are ready to push the next set of industry buzzwords lead by the next wave of luminaries are not interested in Beck's ideas.

Perhaps he does not want to raise hard feelings against him in the software services community. Perhaps he is somewhat blinded into thinking that the problem lies in his proposition. Perhaps he just wants the reader to extract his or her own conclusions. Since Beck is both an intelligent person and obviously not ignorant of the evolution and state of the software industry, I'm taking that as an invitation to extract my own conclusions. Here they are.

In my view, his model depicts exactly the core business model of any large service company, and the reason all of them are in demise. That is, investing in skills is simply not as profitable as increasing the quantity of cheaper resources they provide to the customers. They know and assume a high turnover rate at the bottom (or "burn rate" as they call it) and keep senior staff at the minimum level to prevent disasters.

Many years ago, software services companies reached market saturation in their big accounts. There was simply no more room to employ more staff in big accounts. To further expand and grew, they started to look for smaller accounts, using the reputation acquired in the biggest ones for commitment, delivery and responsiveness. They were competing in that market with the smaller, possibly local, software services companies, Their value proposition basically said “yes, we're a bit more expensive, but you are getting quality and experience beyond the local players” And many smaller accounts bought that argument.

But that has not turned out to be true. You know that if you've experienced it. You hired a global services company because you were sold on the idea that they could access a level of knowledge and experience way beyond you could on your own. You were sold on the idea that their resources were skilled and experienced beyond anything you could have in your own organization. But you did not got any of that.

This is because the number of senior level people now is so small that they have started to decrease their quality of service to their "lower valued" customers. It is very frustrating for someone who is not in their "big fish" pond to know that they have the right people for the job at hand, yet you'll never see them as they are allocated to accounts bigger than yours or even idle, but not able to work on your project because its budget is not big enough to reach the profit levels that these senior resources are supposed to generate.

The irony is that those smaller customers have been their source of growth for a long time. Instead of looking of ways to increase their business with existing customers (pretty hard in current economic environment) or expand their market (much harder than the previous one) they start reducing overheads. The next logical step in order to keep profit levels is getting rid of more senior people. If you do this enough times, and many of those companies have done it at least once per year in the last three years, you introduce the same quality problem in their highest valued customers.

All the above is already happening. The ratio of senior vs. junior resources is approaching zero. This is creating an opportunity for smaller size operations where quality and customer satisfaction is actually in their core values, as opposed to short term profit. For them, any investment in skills has a payback in the long run, either in increased volume with existing customers or expanded market share and increased employee retention. Whether they will buy into Beck's ideas is anotter matter, but at least they will hear him.

Thursday, 9 July 2009

Open Source does not have market share - does not even have a market?

According to the Wikipedia definition:

"A market is any one of a variety of different systems, institutions, procedures, social relations and infrastructures whereby persons trade, and goods and services are exchanged, forming part of the economy. It is an arrangement that allows buyers and sellers to exchange things"

I'm neither economics nor market expert. But certainly that definition implies that something is exchanged, right? Then let's look up the definition of Market Share

Market share, in strategic management and marketing is, according to Carlton O'Neal, the percentage or proportion of the total available market or market segment that is being serviced by a company.

Wow, again those business words like "company", and "market" All that lead me to think about all the different measurements that are constantly trying to determine the "market share" of open source products.

First, of course, you have all the debate around which numbers are correct. Because there are as many numbers of "market share" as there are ways of measuring them. There is always a heated debate about which numbers are more accurate. Recently for example, Mozilla Firefox "share" has been seen rising quickly, to the point of being already a serious contender for Web client dominance against IE.

What bothers me is that people keep talking about usage figures as if they were "market share" But they are not. When something is essentially free, such as a Firefox download, there is no exchange of things. You just download something and install it. There is no direct exchange of anything between you and the Mozilla foundation, or Canonical, or Red Hat when you use their products.

Yes, the open source community argues correctly that there is in fact a return on that. Be it bug reports, suggestion for enhancements, code fixes , hardware or money donations, or translations, there is usually a return for the successful open source products. But note that it probably applies for the more successful ones only, and that there is no direct exchange.

The ironic situation is that those products are placed and measured against other where their usage, piracy notwithstanding, do have a market. Windows, Internet Explorer or MacOSX do cost money beyond what you're paying for your Internet bandwidth. More ironic is that, if you tried to estimate the economic value of producing things like Firefox, Red Hat or Ubuntu have, you'll probably be in the billion dollar range. Yet they don't cost you anything. And in spite of the vast majority of their users not doing any kind of contribution, direct or indirect, these things thrive and continue to grow, threating to crush their paid for alternatives.

Therefore is not correct to talk about "market share" in this situation. There is no market involved, and the same logic used by business and marketeers cannot be applied here. So please stop using the term "market share" and start using "usage stats" or something like that.

Thursday, 2 July 2009

Kill the flat text file

Perhaps I'm not lucky and I always end up working with old systems, but I've become increasingly irritated by a practice that should have died years ago: the flat text file interface. In the stone age era, this was the standard way of exchanging data between systems, as the plethora of options that integration technologies have been creating in the last 20 years or so were not available.

Probably because of its ubiquitous availability, this practice was noticed by the office tool creators (yes, Lotus 1-2-3, I'm looking at you) and they built nice functions in the applications to deal with flat text files. Spreadsheets became common office tools, and people grew up using text files as their only way of getting data from systems. Perhaps because of that, a whole generation of office workers assumed that this was "the way" of doing things. When those office workers face a technology problem, they of course tend to prefer to use the tools and techniques they know better. After all, if you can open in Excel a flat text file and understand its contents, what's wrong with using that to talk to other applications?

The answer is, a lot of things. And some, but not all of them, related to performance. But that's not stopping the flat text files flying around. Even in this year, new functionality is built into applications that rely on flat text file interfaces to move information across applications even if much better ways of doing it exists.

Let's review the most common sins incurred by the practitioners of the flat text file interface approach.

Internationalization

30 years ago the world was 7-bit ASCII. Then information systems became global, standardized and multilingual. Probably, English is no longer the dominant language of computer users across the world. Yet the flat text file does not have any means of specify encoding, nor most legacy tools are able to handle international character sets correctly - remember, they were designed and created when the world was 7-bit ASCII. Before Unicode became universally accepted, there were a lot of workarounds that tried to deal with this problem in a way that did not broke legacy applications. Most of them were around using, yes, you guessed it, 8-bit ASCII. But 8-bit ASCII is not enough to cover even the different character sets across the western world, much less deal with oriental character sets.

The result is that data quality deteriorates as information moves across systems in those flat text files. Each time a file is exported and/or imported, an assumption is made about source and target character data sets that most of the time is not tested, if not plain wrong.

And no, flat text files do not have a way to encode character set information in them, unless of course you make them non-flat and irregular. Therefore, each time you try to compare what should be the same information in two different systems, you have all kinds of mismatches in textual data that are simply due to wrong character set conversions. In some cases, there are so many of them that they make the signal/noise ratio in the data so low that the actual data quality audit is impossible.

Cost

A flat text file has to be written by a program somewhere, according to some agreed format between parties. It then has to be transferred to the other system and read back by the other machine, interpreting the agreed format. Simple, isn't it?

No, it is not simple. Think about it. Now think again, twice. Each time you exchange information across systems you're performing essentially a data transformation to adapt from one data model (or database schema, if you prefer) to another. Now, by writing a flat text file, you're introducing yet another transformation, the one you're doing from the source or target data model to the flat text file format. So you're paying the cost not only of two data transformations, but at least three. Usually, text files have a format that closely matches one of the two data models, with the other unlucky side having to perform the bulk of the transformation job. In extreme cases, I've witnessed otherwise intelligent people engage in even more ridiculous exercises, trying to define so called "neutral data formats", in which the two parties have to perform extensive data transformations to be able to use the data in their applications.

Still thinking that it's cheaper to use text files? Ah, that's because you have lots of people who can write programs that read and write text files, isn't it? Well, perhaps it's time to look for more qualified resources, as the incremental cost of learning other technologies for application information exchange is slim compared with the additional costs you're incurring by performing all those additional transformations. Don't want to invest in that knowledge? Think again, you're basically stagnating yourself or your team in old technologies that simply are being replaced by application connectivity and middleware.

Ah, I hear you cry, but flat text files can easily be audited, can't they? I have to say, yes, they can. Have you ever heard of application log files? What's stopping you writing something on an application log or journal instead of keeping your flat text files around? Isn't that easier, safer and subject to regular application security than keeping a folder somewhere with a bunch of flat text files? Answer left as an exercise to the reader.

On top of that, each time you need to transfer additional attributes, you have to modify two programs, the one that writes and the one that reads. How's that impacting the cost?

If you think all of the above is not enough to stop using text files, read on.

Synchronization

The flat text file is static. Your data is dynamic. Your business is dynamic. Enough said. Writing something to a text file means that it has to be processed later. Means that it cannot be changed at the source unless you're sure that the other side has captured the previous change. Means that there will be a delay between the time one application has a piece of data and the other has the same piece. This causes no end of race conditions, unless both applications are changed to reflect the special state the data is while it is being sent across. Which introduces yet more artificial complexity in data management.

Not only that, but those files have to be moved around. Which means another protocol, be it FTP, CIFS (Windows file sharing) or similar, that is involved in the transfer. Some of those protocols insist in introducing some additional layers of character set translation, require customized firewall settings and need to be taken care of. Not only that means additional cost, but also another point of failure where things can go wrong.

Experienced developers have a solution for this problem, usually closing the system for end user access while the flat text files are being read or written. Sometimes, I've seen flat text files as some sort of journal used to defer the import or export, usually to be done on a night batch process. Those nightly jobs can become monster bottlenecks, where all useful application processing is deferred to be done during the night. Cross your fingers, because if anything goes wrong it will have to be fixed the following day, delaying whatever processes depend on the data being moved, sometimes making the regular services stop to give the systems enough breathing room to do their monster nightly batch runs during the day.

You can fix this, of course, by having continuous monitoring of your beloved flat text files import and export processes. Remember to add those costs to the total cost of support.

If you believe that replacing the flat text file with some other synchronous or asynchronous solution will involve expensive distribute transaction technologies, I have an answer for that: only if you want it. Your systems will be much more robust and data integrity much higher if you use transaction monitors or messaging queues. But even if you don't use any of those facilities you'll end up in a much better shape than if you used a flat text file. The explanation is, your method will update an entity at at time and you can have full error control during the transfer (be it transactional or not) and alert the user or the support staff of any problems on the spot. Whereas when you read your flat text file records and act on them, you're going to generate the same error condition over and over. Fixing problems as soon as they appear is much easier and faster on the long run than just deferring the fix to the time you analyze the results of your flat text file data load, when a lot of damage may be already done.

Performance

And now to the final point. After what has been said, the people still favoring the flat text file are about to show their last card: sometimes it's faster to import or export a flat text file than to do it in other ways, right?

Yes, sometimes it is. Large data loads are probably done faster by using some intermediate storage where the application or database can perform bulk reads or writes. Of course, this intermediate storage is our beloved flat text file(s). Usually, those large data loads are one off jobs. In practice, there are very few, if any, applications that need to perform those on a regular basis. Even large data warehouses are updated incrementally, simply because there is a point where there is not enough time to rebuild them from scratch each time a new bit of data arrives. And in those cases, the bulk of the time is spent refreshing the computed results, not the base level data.

Still, the idea of using the same formats for your one-off and incremental loads looks attractive. Again, look at the reasons explained above and think if it is still worth enslaving yourself and paying those costs forever just because you want to keep the same text file format forever. Specially the first time you need to change that written in stone flat file format used by all those programs around.

As for performance, while there is no discussion that databases usually like to work on data sets as big as possible, the incremental updates are not usually big enough to provide a significant performance improvement over doing them on the spot. And remember, doing the information transfer at the time it is done on the source means that the sender application already has most, if not all, the data that needs to be sent to the other application, which means that you are saving at least another read if you don't defer the transfer for later. In any case, you're saving the read/write time of the flat text file. If you're still doing significantly better with bulk batch loads it probably means that there is still some fundamental flaw in the way the two applications are exchanging data, as taking that logic to the extreme would mean that you would stop using anything but flat text files for exchanging information, even inside the application, wouldn't' you?

Final words of hate against flat text files

I realized that this posts reflects that I passionately hate text files. Perhaps it's because I've had enough of them and my hatred is irrational, but I cannot see how any of the above arguments can be challenged. I've yet to face a situation where using tools less than 10 years old, one has to resort to flat files for data transfer, yet I keep seeing them being created over and over. By this time, I think that you've enough arguments to stop trusting anyone that says that flat text files are "flexible", "efficient" or "cheap to implement". They aren't. Of course, someone selling you this idea probably has a vested interest in you using flat text files, as they know too well that you'll become dependent on them forever to maintain them. Do not trust them.

If you come across one of these consultants, mention briefly SOA, REST, RPC, SAP RFC, RMI, .Net remoting, Oracle SQL Links or the myriad other technologies developed over the last 20 years to address all those needs.

I sometimes dream about the end flat files. Somehow, I break during the night into the customer data center and replace each and every single flat text file interface with a modern, performing approach. I then add a trigger for each change in the target database that writes a record in the flat text files that should have been or written to transfer the data. I bet that in most cases it will take them a week or so to realize that there is something different. Well, for some customers the job would take me much longer than a night, as the number of different flat files flying around is in the hundreds, if not thousands.

Flat text files should have died a long time ago as a way of exchanging data among applications. Let's leave the accountants and finance people play with them for their Excel spreadsheets, saving them to learn anything new and us from having to police them in abusing databases. Let's drop them as soon as we can for anything else.

Friday, 19 June 2009

Google Appengine (III) – Tuning means code refactoring

One of the last tuning gigs I was involved with was related to a content recommendation system. The original implementation was a one by one translation of a statistics algorithms into SQL sentences. It generated a huge number of precalculations stored in a multi million record table that were later on reused when visitors were looking for content. It was even worse than that, as the result size grew exponentially with the number of users.

The solution involved creating a set of stored procedures that were able to calculate in submillisecond times the same results without needing to store any intermediate results. Ah, and I discovered a few bugs on the original implementation that rendered the results of all those precalculations incorrect. Overall the customer was very satisfied with the end result.

After we were done with the testing and verification of results, something that is not an easy task with recommendation systems as it basically involves asking people if the results are more or less matching with their tastes, we set out to incorporate the changes to the live system.
Apart of creating the stored procedures, the changes in client application code amounted to five lines of code. Total. This is not unusual. And application down time was zero. Nothing. Users were using the old recommendation and in their next request they were using the new.

Application code changes are expensive, so usually the ratio of cost/benefit has to be quite high to be worth doing any performance optimization work in this area (except if you're dealing with SAP, of course, where everything is ABAP code)

Fact is, by its own design, SQL is an intermediate middleware layer sitting between application code and your data. Performance tuning happens mostly at the SQL layer or below. Seldom application code is changed.

Not so with Google AppEngine.

GAE SQL Dialect

GAE implements something remotely similar to a very small subset of SQL. Now, SELECT, FROM and WHERE have become so popular keywords that each time anyone wants to implement something even remotely related to a query language is using those keywords.

Be it WMI, Lotus Notes ODBC, GAE, Hibernate or whatever, they want to make it easy to learn at least the basics.

But beyond those keywords, they don't provide anything remotely similar to the richness and expressiveness of the SQL language. Simply put, GAE SQL dialect (called GQL) does not go beyond the simplest of data accesses. Fundamentally, GQL is not a set oriented language. Furthermore, whereas SQL provides many different ways of reaching the same final result, GQL usually provides on, if any at all.

Object model

Say that you're playing with performance improvements. You're looking for ways of minimizing the number of IO operations, the number of indexes or both.

Due to the limitations of GQL, none of the different options can be expressed using the query language. The possible choices range from using references, property lists, inheritance, or a combination of them all. The key part of in all of them is that you'll need to change application code to even try them.

This has a number of implications that make the evaluation of different options much more convoluted than in the traditional SQL server scenario.

Each time you want to test something, you need to do an extensive data conversion from one object model to another. This means time invested in data conversion code, time invested in running and verifying that code. Furthermore, if you finally decide to go for a change in pursuit of better performance, you have to execute that code in the live environment, and that probably means application downtime.
No development environment. GAE SDK does not implement the BigTable functionality or indexing. You cannot simply compare performance of different approaches on the local development environment. You need to use the live GAE for that.
Cost. Free GAE has limitations in the amount of IO you can do and the storage you can use, if you reach those during testing, you'll exceed your quota. Either you'll have to pay or wait for quota renewal.

All this is a bit contradictory. While GAE really makes rapid application development easy and attractive, it is still too young to have developed a set of performance best practices. Your best bet is to reach out for all the documentation available and try to get it right in the early stages.

Good documentation, wrong format

Fortunately, Google has created quite good documentation and tutorials and you can get a lot of learning out of those. Especially the Google IO sessions are valuable insights into how the App Engine operates. Unfortunately, most of that is in video format.

Mind you, I'm not whining about the quality of the conferences. They are excellent, and I'm surprised about how good the GAE team are at that. It is a common topic among the developer community that software development profiles do not value much the quality of verbal communication. Not so with the GAE team.

But video as a documentation platform has a lot of problems. First, you have to watch them end to end to really decide if the content is interesting as they are not structured in a book format with chapters, sections, etc. You cannot read the first paragraphs to decide if it's relevant to the topic you're interested in. And second, they cannot be used as reference material or something that you can bookmark to return later. Third, I can read a lot faster than listen to an speech, so what could have been digested in 15 minutes takes an hour instead.

The supporting materials, mostly slides, help to alleviate those problems somewhat. Ironically, they probably also provide the text that also makes those contents easily searchable on Google.

Now, if I just had time to put together a real GAE app.....

Monday, 23 March 2009

GAE - Abstracting complexity and sharding counters

The history of application development is littered with inventions addressing the problem of filling the conceptual gap between a machine serially executing some instructions acting on data and a business person trying to automate some process. Over the years, layer upon layer of abstractions have been developed, from assemblers, compilers, operating systems to the highly abstracted REST and SOA specs of today.

If you're exploring the Google AppEngine (GAE) you probably have already come across sharded counters.

Basic counters explained

Many applications present some kind of counter to the end user: message boards tell you the number of responses to a topic. Auction systems give you how many people are bidding for an item. Digg tells you how many people have rated an item. And so on, there are too many examples of counters used everywhere.

For performance reasons, it is a common practice to keep those counts stored in the database. It's not that they cannot be calculated at page rendering time, but in some cases the performance hit of retrieving those records and counting them is considered to be excessive.

As always with performance tuning, there is a compromise here, as keeping the counter accurate each time something in the database related to the count changes is expensive and complex in terms of database operations. But for web apps where the count is being retrieved hundreds of times more than needs updating, the trade off is usually worthwile.

Let me remark that as always, "usually" is the pivotal concept. There are specific situations where this does not apply and is not worth performance wise to keep the counter updated. Always apply common sense first, apply some commmon sense second, and third do some performance testing.

The Google Datastore and counters

Since the Google Datastore does not have any analytical capabilities and given its constraints in response times, any counters except the most trivial ones, cannot be calculated at page rendering time. There is simple no other option, as the GAE puts sever restrictions on the amount of data that can be fetched and the time it should take a page to be served.

This is all well and good, Google in this case is simply watching your back so that your users get good response times and its infrastructure is not abused.

However, due to the way the Datastore engine works, a single data item is stored in a single machine. That can create problems if lots of application instances try to update the counter at the same time. Google's advice to solve that is that you "Shard" your counter. "Sharding" a counter means to create many counters and incrementing one of them randomly. Each time you need to present the counter value, you issue a query to retrieve all the counters and add them together.

This nicely avoids the botteleneck in high concurrency situations, but notice how this has increased the overhead of retrieving the counter, and more important, forces the designer to make a decision on the data type of a simple counter.

This breaks the rule of abstractions making things simpler. Instead of the engine being able to nicely scale to your requirements, you need to tweak the engine to adapt to the expected workload. Of course, if someone has to consider that decision is probably happy to face scalability problems, as that can only mean that his/her site is being successful.

I'm not picking up on Google for this, but this is another example where GAE strikes me as unbalanced. On one side it offers excellent tools to quickly and scalably put together web applications, with a low barrier of entry in terms of learning. On the other side, it makes you face design considerations that seem to be more the province of specialized parallel processing engineers.

GAE, continued

I've continued exploring Google AppEngine since my last post. Honestly, I've not had much time to devote to self education or blogging lately, so at this moment my opinions are not yet based on real world, live application, experience. And they are mostly opinions, not hard, evidence based, facts.

GAE fundamentals

I'm new to Phyton and Django. Both of them have been a pleasant surprise. The Phyton language is focused on being practical on a day to day use and have good performance. It's also well designed in two ways: one, there is nothing that looks odd or added as an aftertought, althought I'm sure that the language was not born with all the features it has today. That speaks very well of the language designer.

Second, the libraries and syntax almost lead you to choose the "right" approach to solve a problem, in a way that is structured and readable instead of going the quick hack route. (Perl asbestos suit:on) I've always hated how you could in Perl devise many different solutions to the same problem, the quickest to write being almost always the least readable and plain ugly (Perl asbestos suit:off) While it is perfectly possible to write good structured and object oriented code in PHP and non-.NET VB, there is nothing in both languages to prevent a quick and dirty approach or encourage doing it the right way.

In fact, Python blocking convention could be an advantage here. Enforcing proper indentation discourages deep nesting of code blocks, increasing the chance that a problem is break down into smaller, more manageable pieces. Of course, crappy code can be written in any language you want, and the likelyhood of crapiness increases with the popularity of a programming language, so this is by no means a guarantee that you'll always come across clean and nice Python code. Off topic question, does anyone know what is the shape of the crapiness increase function? Linear? Log? Exp?

If you can forgive the lack of a statement terminator, the weak typing, or the odd blocking scope syntax, then Phyton could not be a bad choice as your next general purpose programming language.

The Django motto "A framework for perfectionists with deadlines" holds true everywhere you look at. I'm convinced that there's a huge productivity advantage using Django over a raw, no framework, approach. I'm no expert in web frameworks, so I'm not sure if the productivity delta is convincing enough for a PHP or .NET web developer to switch. But Django surely is easy to learn, extensible, consistent, scalable, and covers the whole gamut of web app scenarios.

Unfortunately, what is built into GAE is not 100% Django. Not even 1.0 Django. Apparently, Django 1.0 was not released on time for GAE to include it. Also, the GAE storage engine is not even remotely similar to a relational database, and thus the persistence part of the Django framework cannot work unchanged under GAE. To make things a bit worse for the newcomer, instead of not including the supported Django modules, code that imports and references them is accepted by the GAE SDK, only to find those modules missing at run time. Of course, if someone digests all the Django AND GAE documentation beforehand this is not a problem, as you're not going to use a Django feature missing in GAE. But for those learning by exploring, example, and reading documentation at the same time, it makes learning somewhat frustrating and does not add to the "perfectionist with deadlines" motto.

In summary, I've learned a lot and still continue learning, but looks like Google has done overall a good choice on those two fundamentals for their App Engine.

Warning: in my opinion, hands on experience is a primary source of knowledge, at least as important as learning the theoretical fundations. Therefore, what follows could be an entirely wrong opinion on what is GAE market niche not much based on experience but on what I've learned from second hand or lab experiments. In fact, I've yet to set up a live GAE application.

An engine looking for apps

The way one desings your data model is very much influenced by the limitations of the persistence layer that you use. For any given system, you tend to delegate each task to the layer that is best suited to handle them. That means that if you're using a relational database, that layer usually handles querying and data integrity in as much as possible. And in a relational database, that "in as much as possible" is a lot, probably most of it.

Application design for GAE datastore is no different. Python code is going to handle the roles that the Datastore does not support, and given that there are no joins , analytical functions or complex WHERE clauses, that's going to be a lot. In fairness, Google are not calling their storage engine a database. Because it's not. Google calls it a "Datastore", which much more accurate describes what it does and what it does not. It took me a while to realize the difference, as I was hoping that my initial impressions were just a consequence of ignorance. So I kept looking for capabilities similar to a relational engine. But they are simply not there. On the other hand, transactions are there, indexes are there, and a query language is there, but with restrictions that don't put them on par with your basic RDBMS.

So the GAE Datastore is ideally suited to store and retrieve data, and huge amounts of it. But I keep asking myself, what is the point of being able to store such huge amounts of data?

Of course, if you are a Google engineer, this is a no brainer. There are lots and lots of uses for this data, anything that enhances your knowledge and/or service to your customers, and thus your revenue. And probably Google's first choice would be to target ads. But from flat reporting suppporting day to day business processes to decision support related information, those vasts amounts of data are an invaluable asset to your business. So you should be able to take advantage of them, right?

Yes, but currently only if you're a Google engineer. Because then you have access to a nice MapReduce implementation with a few (thousands?) nodes to run it. Armed with that, you can analyze, slice and dice and otherwise play with your data to your heart's content. But unfortunately, you cannot do any of that with GAE. At least today.

And that is what I keep asking myself each time I revisit the GAE. As it stands today, there are alternatives that provide, if not the huge storage and scalability, at least the analytical and reporting power that current applications need today. So nobody is going to move its shopping cart, blog or bulletin board to GAE. Nobody that needs database complexity above the basic level, at least. So what are the kind of applications Google expects to host in GAE? Today, anyone thinking of taking advantage of cloud computing is looking at the likes of Amazon EC or watching closely how Microsoft is adding relational capabilities to its Azure storage engine and waiting for them to be mature enough to jump in.

I can only think of three answers to this. One, Google is just seeding the field and creating some nifty toys expecting that the next Web 3.0 revolution is sparkled by something hosted in AppEngine. This revolutionary application has not yet written or even imagined by anyone, but Google expects GAE to be the platform of choice for running it. Two, Google is going to expand GAE so that its storage engine offers more capabilities that make it attractive to existing applications. It surely would be nice to have access to MapReduce-scale resources or more sophisticated database functionality built in. Third, Google is trying to compete with Amazon EC and Microsoft Azure and falling behind them.

I just hope that it's not the third one. In any event, I've learned to look at AppEngine database performance as the combination of Python and Datastore tuning. This is different from relational performance, where one looks to non-SQL code as the part that usually slows down database operations.

Wednesday, 4 February 2009

Changed beyond recognition and the EAV data model

One of the much maligned performance drags an application can have is its abuse of the so called Entity Attribute-Value (EAV) model for data storage. However, as with many catastrophic events, the performance problem is the result of the failure of multiple elements in the decision chain. Let's examine the motivations behind the EAV on each of the parties involved.

The buyer (or designer) dilemma

You're a person in charge of selecting an standard package. Or you're in charge of design or development of a bespoke package in your medium to large business. At some point, you'll be collecting the so-called "requirements" from future "customers", or application users.

No matter how hard you try, at some point in this process it will became clear to you that you cannot exactly define the data model with enough precision to guarantee that the database design is not going to change along the life of the application.

Usually, what is specially troublesome is to define the information structures to represent data for entities that are external to your business: your consumers and customers. This is because the amount of data that the business has at some point in time depends heavily on the number of initiatives addressing them. The more initiatives, the more disparate information is being collected from the outside.

On top of that, in extreme cases it's not even clear what is the purpose of this information. Somehow everybody agrees that having customer age information is good, but they are not sure of how to cluster it. And if they know how to group ages, this grouping is likely to change in the next six months. Worse yet, they start talking about targeting promotions according to age.

So you design, or request to the potential suppliers, some provision for storing additional entity data in your application. You don't know exactly what information you'll store, how much of it or how often. Buy you surely know that no matter how complete the data model definition appears, someone in the near future is going to request to store something that was not in the data model design.

The marketing dilemma

Whether you're internal developer, consultant or purchaser, you need to be able to answer with a big smile and a resounding YES to the question: does your application support additional data fields? This helps a lot, as any concerns about suitability of the solution can be addressed, "hey, we can always add this bit later" is the answer to all the problems.

No matter how hard you argue about this being the right solution or not, nothing beats the convenience of "unlimited" additional "fields" that can be added to the application.

The implementors dilemma

Now it's time to deliver. How do you comply with the requirement?

Imagine that you're creating a standard, packaged application. To support additional attributes, you start exploring and arrive at pretty much three options.

Provide some interface from your application that allows users to change the data model. Of course this interface must be portable across as many database flavors you're going to support, adapting to the specifics of each data model. Or just make everything of the most generic "string" data type you're able to find. In any case, your database API needs to have support for data definition statements. Last time I checked, the most popular ones did not.
For each entity add to it additional columns such as "ADDED_ATTR_1",
"ADDED_ATTR_2", "ADDED_ATTR_3", and so on. The trouble with this approach is that you're not meeting marketing requirements. The number of fields is not "infinite". On top of that, the names are not exactly descriptive and you have the same problems of data model specific. Note that regardless of your choice, your app will need also some metadata tables to store what those additional attributes mean in the particular customer context.
In this way, you arrive at the EAV design. You simply define a table
like this:
```
ADDITIONAL_INFO(
  ENTITY_NAME
  ENTITY_ID
  ATTRIBUTE_ID
  ATTRIBUTE_VALUE
)
```
That's it, you have designed a generic facility that is able to store customer age. Simple, if customer with ID 35687 is 25 years old you just store ('CUSTOMER',35687, 'AGE','25') Repeat that for all customers. GPS Coordinates? No problem, ('CUSTOMER',35678,'LOCATION','1.123 2.45')

And here, you have it, the origin of the EAV model in relational databases.

Changed beyond recognition

By itself, this technique (should I call it "pattern"?) is not good or bad. As stated, it is a solution for a problem and seems to be a good compromise between flexibility, customization and usability.

Those big applications (Siebel & SAP are the examples closer to me) ship with at least rudimentary facilities to customize their behavior. These facilities are no match for a full featured IDE like Eclipse of Visual Studio, since they are intended to handle small changes in functionality so they don't need to have the bells and whistles of your regular Java or C# IDE. But using those you can easily add to the views the additional data items coming from this table. It's all solved, right?

Yes, up to this point everything fits with the original plan. You have a generic mechanism for extending entities that is flexible enough to accommodate all future needs. However, two things stand in the way to data model nirvana.

First, you'll have to face the problem of exploding data volumes in this table. Let's suppose you have X customers and wish to add four additional data items to the customer entity. You then have 4*X records in this table. Worse yet, each time you display your customer data you are accessing this table four times, making it a point of contention far worse than the original customer table. If you use this table to add detail to transactions the volume explodes even more.

Second, you essentially give up on the database providing any kind of integrity on those additional data items. In general, you cannot enforce integrity constraints, nullability or uniqueness, unless you immerse yourself in writing some complicated triggers. You were doing this because you wanted to store arbitrary data, therefore it seems ilogic to try to enforce rules on what is essentially arbitrary data.

Third, and this is the worst problem, this can easily go out of control. Specially with packaged applications, one must always resist the temptation to rewrite them. As it often happens, the usual way of deploying such packages entails some choices when there is a mismatch between the business process as supported by the package and the same process run by the company. Guess who usually wins? The company, of course. After all, the process has been there for years and even may be regarded as competitive advantage, so why they should bend to the standard packaged way?

Very few companies take the brave step of changing their processes to fit the package. Instead, they customize the application to fit with their purposes. But this when taken to the extreme, specially in business driven projects with no technical expert on the business side leaves the original package unrecognizable. In some cases, I've seen EAV entries used as foreign keys to other tables!

Aside from the problems that this creates at version upgrade time (some packaged applications have not been upgraded in 12 years just because of the cost of redoing all the customization) this also invalidates any assumptions made by the application original developers. Assumptions made about performance, usage scenarios, volumetrics, everything can be invalidated.

In fact, my completely unscientific research indicates that there is an ongoing new trend against packaged applications as the universal solution for all in-house development problems. More specifically, big IT analysis groups are discouraging business to go the packaged application route and write their own from scratch if they anticipate having to change or customize more than 20% of the functionality they are going to use. The argument is that it's cheaper on the long term, as completely overriding application functionality often takes more work than creating that functionality in the first place. On top of that, the development environments for those packaged applications are usually orders of magnitude less productive than standard IDE's.

The performance tuning dilemma

Whereas classic data modeling techniques translate "entities" into tables and attributes into columns, the EAV model tries to store different entities in the same table and also allows for entities to have a variable number of attributes. This works under some circumstances, but almost never works with relational databases.

As for performance, the overall consensus around EAV is that it's bad. In some cases it's just bad as when you get a rainy day just when you intended to have that nice bike ride, in some others bad as when you go thru extensive dental surgery. Some of the problems you may encounter when trying to optimize an EAV design:

As hard as they try, RDBMS optimizers still have lots of problems dealing with a high number of table joins in the same statement. The access plans will be dominated by he need to access the EAV tables over and over.

Yes, I know, relational databases were supposed to be built to solve the problem of handling joins. But never in the scale required for the EAV explosion. In a well designed data model, very few if any operations need to join 12 tables. But remember, EAV will require a join for each attribute accessed.

CPU utilization will be very high since you'll want to access EAV attributes using an index lookup, trying to avoid scanning a large table when you're interested only in always a small subset of it. What is retrieved in a "sane" data design as part of the entity now requires an extra lookup and table access per attribute.

Contention on data and index pages will be very high as those attributes are all the time accessed by everybody even in the simplest of cases. Database without row-level locking will suffer the most, as without partition you cannot ensure that the attributes for one entity and another are in different pages or tables.

What can be done to improve performance of EAV applications? The problem when tuning such designs is that, at least in my case, I always try to avoid changing the data model as part of the exercise. Changing the data model and putting all those attributes in tables where they belong means to change the data model and by definition this means changing also controller logic. Which lands us in very expensive territory, if possible at all (do you have source code?)

The best I've been able to do with those applications without touching the data model has been always around storage layout. Partitioning the table using DMBS features. Slicing the big table based on ENTITY_NAME will reduce overall contention and confuse the optimizer a bit less because you're now more or less localizing access according to entity.
Making sure that the EAV tables are cached (if appropriate) and do not fragment with frequent updates is another way of at least avoiding the performance deteriorate even more.
Apply (1) and (2) to indexes if possible. Needless to say, ensure that
```
(ENTITY_NAME,ENTITY_ID)
```
is the primary key.
Check SQL statements and verify that they always access the table using ENTITY_NAME and ENTITY_ID values. Yes, someone knows that there is not a single object other than customer 24 with ID=24 in the whole system, but the RDBMS cannot use the primary key to access it.
Avoid usage of EAV values as foreign key, if possible.

None of those techniques will produce any spectacular performance improvements, so your reputation may suffer. But you can always point your clients to this article so they understand a bit better the situation. And someone would possibly think again before committing this mistake in the future.

Does this means that EAV is essentially bad? No, I was not trying to express that. What I was trying to express was that this EAV idea is of very limited applicability in relational databases. It will not scale or support anything beyond the simplest "let's store the second telephone number here".

For those that really need this kind of flexibility, my advice would be to look for hash table based object storage or document indexing. Of course, you'll miss the ability of the relational database to join things together, so perhaps a mixed approach would be better.

And by any means, follow the expert advice: if you need to change more than 20% of your purchased package (make that 15%) throw it away and write your own.

Wednesday, 28 January 2009

SQL Server Express installation

How things have changed, that what I was thinking when I finished the install of SQL Server Express edition.

But, as almost always, I've started with the end of the story.

A close relative is soon going to change roles at work. She's moving to a system administration position, where, among other tasks, she's going to have to deal with a lot of MS SQL server tasks.

I was therefore asked about the best way to learn "SQL" After some more questions it became clear to me that "SQL" was in fact Microsoft SQL Server. I've always been amazed at how well Microsoft has done in the marketing side, since for quite a lot of people, even technical ones, the word "SQL" means "Microsoft SQL Server", apparently ignoring all the rest of the relational databases out there.

Since I know too well that the SQL language is not that standard, my advice was to get some sort of limited free SQL Server edition and use that as a learning platform. You can probably learn any other RDBMS and translate most of your knowledge to MS SQL, but there will be always parts of the language, storage and indexing concepts that are specific to each product.

How wrong I was, not anything sort of pared down RDBMS. MS makes available for download the full SQL server software suite, including the multidimensional database engine. There are restrictions on its use, of course, and I must confess that I read them, understood them, confirmed that it was Ok to use for learning and then and forgot the details. Remember, you should read them to make sure you're not accidentally infringing. If you want to learn the ins and outs of the MS product this is the way to go, as nothing is missing from the commercial package.

I've always been very supportive of these kind of offerings. People get a chance to get marketable skills and vendors get a pool of people that at least is familiar with the software that can help paying customers get the most out of their purchased product. Before the big players made those kind of packages available, I've seen too many sales reps declaring "piracy" on a junior developer taking software home just to be able to learn more in his/her spare time.

Back to the story. I proceeded to download the package. At this point, I faced the reality I have been able to ignore for the last few years. Windows server product installations have become complex. Way more than they used to be. I had first to read the prerequisites and install and update the required OS components. After a couple of reboots, I was ready to execute the installation program.

But I (wrongly) assumed that installation was going to be more or less the same as it was before. In the past, to get a basic installation working it was just a matter of accepting the defaults and letting the installer do its thing. Oh well, and perhaps a reboot or two. The end result was probably not efficient or secure, but at least got you to point of having something that worked. However, this path of least resistance has been taken to the extreme by Microsoft, to the point of being the root of much of the problems Windows is famous for.

First, there are the security problems. In terms of security, Microsoft always in the past compromised safety in the name of usability. That philosophy bounced back to them in many dire ways. Users got used to the idea that they could change anything in their machines without restrictions and started to complain when Vista was released, as it was for them an anomaly to respond to UAC prompts. The lives of malware creators were much easier as in this default environment any user level action could trigger a system change without further permission.

Application installations also dumped by default into the hard disks tens of megabytes of code, templates, galleries and whatever else that were not really going to be used in 90% of the cases. This also created a competitive disadvantage since latest netbooks have limited disk capacity and thus cannot easily accommodate the standard Office installations.

Seems that Microsoft is trying to fix those problems. In MS SQL Server case, I was confronted with a lot of choices. The installation package wants you to specify exactly what you want to install, where and which user accounts should run the services. Which made me trying to get on with the most obvious answers without checking the documentation first. The installation program then went ahead, only to be stopped at the last stage with an error stating the user account I specified was not valid to run the service. Complaining to myself, I then read the documentation and realized that I had not given the correct access rights to the user account.

Ok, I thought, let's tweak the service startup parameters and change the account for the SQL Service. Ooops, since the initialization scripts of the database were not executed this instance cannot be started, regardless of the user you choose. Ok, let's run again the installation. No way, you've already an installation of SQL server in the machine and the install program rejects overwriting it. Ok, let's remove the instance that cannot start. No way, the instance cannot be removed without the service being started at least once first, which cannot happen until you the installation executes the installation scripts. Which cannot happen until the instance can be started, which cannot happen until....

I found a way (and perhaps it's not the best, I'm no expert on database installation procedures) of this chicken and egg problem by using the uninstall program icon. This program does not need your database instances to be in an usable state and just wipes all the installation from the machine. After doing that, I ran again the install program taking care this time to specify the correct user account with the correct access rights and everything went fine.

Moral of the story? Microsoft listened to criticisms. They tried to make their products more secure by default and less bloated by default. By doing that, they've sacrificed the ease and convenience that were once sales arguments for their products. I guess that this is no longer relevant for them, giving that they have a market dominand position in a lot of markets, databases however not being one of them.

This also gives good arguments in favor of a standard for application packaging. In the last few years, the only software installs that I've actually done myself have been Debian packages. They are self-contained, install automatically the dependent packages (no need for you to track and install separate updates or service packs) and offer secure reasonable defaults that you can change later. And best of all, no reboots. Unless, of course, you're upgrading your kernel.

If Microsoft could dictate such standard, lives would be much simpler for everybody. Unfortunately, Microsoft has different ways of updating across its own range of products, so this is probably never going to happen.