Monday, 23 March 2009

GAE - Abstracting complexity and sharding counters

The history of application development is littered with inventions addressing the problem of filling the conceptual gap between a machine serially executing some instructions acting on data and a business person trying to automate some process. Over the years, layer upon layer of abstractions have been developed, from assemblers, compilers, operating systems to the highly abstracted REST and SOA specs of today.

If you're exploring the Google AppEngine (GAE) you probably have already come across sharded counters.

Basic counters explained

Many applications present some kind of counter to the end user: message boards tell you the number of responses to a topic. Auction systems give you how many people are bidding for an item. Digg tells you how many people have rated an item. And so on, there are too many examples of counters used everywhere.

For performance reasons, it is a common practice to keep those counts stored in the database. It's not that they cannot be calculated at page rendering time, but in some cases the performance hit of retrieving those records and counting them is considered to be excessive.

As always with performance tuning, there is a compromise here, as keeping the counter accurate each time something in the database related to the count changes is expensive and complex in terms of database operations. But for web apps where the count is being retrieved hundreds of times more than needs updating, the trade off is usually worthwile.

Let me remark that as always, "usually" is the pivotal concept. There are specific situations where this does not apply and is not worth performance wise to keep the counter updated. Always apply common sense first, apply some commmon sense second, and third do some performance testing.

The Google Datastore and counters

Since the Google Datastore does not have any analytical capabilities and given its constraints in response times, any counters except the most trivial ones, cannot be calculated at page rendering time. There is simple no other option, as the GAE puts sever restrictions on the amount of data that can be fetched and the time it should take a page to be served.

This is all well and good, Google in this case is simply watching your back so that your users get good response times and its infrastructure is not abused.

However, due to the way the Datastore engine works, a single data item is stored in a single machine. That can create problems if lots of application instances try to update the counter at the same time. Google's advice to solve that is that you "Shard" your counter. "Sharding" a counter means to create many counters and incrementing one of them randomly. Each time you need to present the counter value, you issue a query to retrieve all the counters and add them together.

This nicely avoids the botteleneck in high concurrency situations, but notice how this has increased the overhead of retrieving the counter, and more important, forces the designer to make a decision on the data type of a simple counter.

This breaks the rule of abstractions making things simpler. Instead of the engine being able to nicely scale to your requirements, you need to tweak the engine to adapt to the expected workload. Of course, if someone has to consider that decision is probably happy to face scalability problems, as that can only mean that his/her site is being successful.

I'm not picking up on Google for this, but this is another example where GAE strikes me as unbalanced. On one side it offers excellent tools to quickly and scalably put together web applications, with a low barrier of entry in terms of learning. On the other side, it makes you face design considerations that seem to be more the province of specialized parallel processing engineers.

GAE, continued

I've continued exploring Google AppEngine since my last post. Honestly, I've not had much time to devote to self education or blogging lately, so at this moment my opinions are not yet based on real world, live application, experience. And they are mostly opinions, not hard, evidence based, facts.

GAE fundamentals

I'm new to Phyton and Django. Both of them have been a pleasant surprise. The Phyton language is focused on being practical on a day to day use and have good performance. It's also well designed in two ways: one, there is nothing that looks odd or added as an aftertought, althought I'm sure that the language was not born with all the features it has today. That speaks very well of the language designer.

Second, the libraries and syntax almost lead you to choose the "right" approach to solve a problem, in a way that is structured and readable instead of going the quick hack route. (Perl asbestos suit:on) I've always hated how you could in Perl devise many different solutions to the same problem, the quickest to write being almost always the least readable and plain ugly (Perl asbestos suit:off) While it is perfectly possible to write good structured and object oriented code in PHP and non-.NET VB, there is nothing in both languages to prevent a quick and dirty approach or encourage doing it the right way.

In fact, Python blocking convention could be an advantage here. Enforcing proper indentation discourages deep nesting of code blocks, increasing the chance that a problem is break down into smaller, more manageable pieces. Of course, crappy code can be written in any language you want, and the likelyhood of crapiness increases with the popularity of a programming language, so this is by no means a guarantee that you'll always come across clean and nice Python code. Off topic question, does anyone know what is the shape of the crapiness increase function? Linear? Log? Exp?

If you can forgive the lack of a statement terminator, the weak typing, or the odd blocking scope syntax, then Phyton could not be a bad choice as your next general purpose programming language.

The Django motto "A framework for perfectionists with deadlines" holds true everywhere you look at. I'm convinced that there's a huge productivity advantage using Django over a raw, no framework, approach. I'm no expert in web frameworks, so I'm not sure if the productivity delta is convincing enough for a PHP or .NET web developer to switch. But Django surely is easy to learn, extensible, consistent, scalable, and covers the whole gamut of web app scenarios.

Unfortunately, what is built into GAE is not 100% Django. Not even 1.0 Django. Apparently, Django 1.0 was not released on time for GAE to include it. Also, the GAE storage engine is not even remotely similar to a relational database, and thus the persistence part of the Django framework cannot work unchanged under GAE. To make things a bit worse for the newcomer, instead of not including the supported Django modules, code that imports and references them is accepted by the GAE SDK, only to find those modules missing at run time. Of course, if someone digests all the Django AND GAE documentation beforehand this is not a problem, as you're not going to use a Django feature missing in GAE. But for those learning by exploring, example, and reading documentation at the same time, it makes learning somewhat frustrating and does not add to the "perfectionist with deadlines" motto.

In summary, I've learned a lot and still continue learning, but looks like Google has done overall a good choice on those two fundamentals for their App Engine.

Warning: in my opinion, hands on experience is a primary source of knowledge, at least as important as learning the theoretical fundations. Therefore, what follows could be an entirely wrong opinion on what is GAE market niche not much based on experience but on what I've learned from second hand or lab experiments. In fact, I've yet to set up a live GAE application.

An engine looking for apps

The way one desings your data model is very much influenced by the limitations of the persistence layer that you use. For any given system, you tend to delegate each task to the layer that is best suited to handle them. That means that if you're using a relational database, that layer usually handles querying and data integrity in as much as possible. And in a relational database, that "in as much as possible" is a lot, probably most of it.

Application design for GAE datastore is no different. Python code is going to handle the roles that the Datastore does not support, and given that there are no joins , analytical functions or complex WHERE clauses, that's going to be a lot. In fairness, Google are not calling their storage engine a database. Because it's not. Google calls it a "Datastore", which much more accurate describes what it does and what it does not. It took me a while to realize the difference, as I was hoping that my initial impressions were just a consequence of ignorance. So I kept looking for capabilities similar to a relational engine. But they are simply not there. On the other hand, transactions are there, indexes are there, and a query language is there, but with restrictions that don't put them on par with your basic RDBMS.

So the GAE Datastore is ideally suited to store and retrieve data, and huge amounts of it. But I keep asking myself, what is the point of being able to store such huge amounts of data?

Of course, if you are a Google engineer, this is a no brainer. There are lots and lots of uses for this data, anything that enhances your knowledge and/or service to your customers, and thus your revenue. And probably Google's first choice would be to target ads. But from flat reporting suppporting day to day business processes to decision support related information, those vasts amounts of data are an invaluable asset to your business. So you should be able to take advantage of them, right?

Yes, but currently only if you're a Google engineer. Because then you have access to a nice MapReduce implementation with a few (thousands?) nodes to run it. Armed with that, you can analyze, slice and dice and otherwise play with your data to your heart's content. But unfortunately, you cannot do any of that with GAE. At least today.

And that is what I keep asking myself each time I revisit the GAE. As it stands today, there are alternatives that provide, if not the huge storage and scalability, at least the analytical and reporting power that current applications need today. So nobody is going to move its shopping cart, blog or bulletin board to GAE. Nobody that needs database complexity above the basic level, at least. So what are the kind of applications Google expects to host in GAE? Today, anyone thinking of taking advantage of cloud computing is looking at the likes of Amazon EC or watching closely how Microsoft is adding relational capabilities to its Azure storage engine and waiting for them to be mature enough to jump in.

I can only think of three answers to this. One, Google is just seeding the field and creating some nifty toys expecting that the next Web 3.0 revolution is sparkled by something hosted in AppEngine. This revolutionary application has not yet written or even imagined by anyone, but Google expects GAE to be the platform of choice for running it. Two, Google is going to expand GAE so that its storage engine offers more capabilities that make it attractive to existing applications. It surely would be nice to have access to MapReduce-scale resources or more sophisticated database functionality built in. Third, Google is trying to compete with Amazon EC and Microsoft Azure and falling behind them.

I just hope that it's not the third one. In any event, I've learned to look at AppEngine database performance as the combination of Python and Datastore tuning. This is different from relational performance, where one looks to non-SQL code as the part that usually slows down database operations.