Wednesday, 15 October 2008

The Google Application Engine

In the beginning, there was timesharing

In the beginning, there was timesharing. This was in the early, early days of computing, before most of us were even born. At the time, computers were big, expensive, complex and extremely difficult to operate. Not to mention underpowered in comparison with, say, a domestic media player. Since most business did not have the resources -read money- or even the need to use one of those monsters full time, you basically rented time from the machine. This model of using applications was perceived as a win-win, as the customer did not have to pay the up front costs of maintaining all that infrastructure and the supplier could maximize its return on what was at the time, a significant investiment.

Over the years, as demand for usage of computers increased and prices went down, more and more business found that it was profitable in the long run to own and operate the monsters by themselves. Thus, the timesharing business model gradually faded, althought some parts of it survived. Specially for the big mainframe iron, IBM still continues to "rent" their software and "cap" the CPU cycles that a modern mainframe uses based on a monthly fee.

I'm not sure if this model is still used literally. I'm sure that 10 years ago it was still in use (anyone remembers the defunct Comshare company?) But as most everything in the computer industry, the model has not survivied literally, it has morphed over time while keeping the same conceptual underpinnings.

Cloud Computing

"Cloud Computing" is the latest industry buzzword. Basically, it means that when you use applications, those applications are not running in hardware you own, you don't even know what hardware is used. It also means that the provider of those applications has to worry about availability, scalability and other concerns that were responsibility of the company IT department. It also means that the provider can take advantage of economies of scale, by for example reusing transparently its machine resources to support many different customers across different timezones. It can also spread geographically its machines to save on communication costs and improve response times, and apply other optimizations.

So, are we seeing the return of timesharing? Some would say that yes, this is essentially the same model applied in the 60s, only using bigger, more powerful and cheaper machines. And yes, this is a repeat of the SaaS (Software as a Service) wave that was the latest industry buzzword five years ago. Of course there are a lot of similarities, but there are a couple of points that make things different now.

First companies adopting cloud services are making a choice from their current established infrastructure and the supplier offerings. They have experience, can compare costs and are much more conscious of the trade offs, limitations and opportunities that each model has. This was not the case with timesharing, since the costs were so high that none could really afford to supply internally the same computing resources. This time it's a choice, not an obligation.

Second, some software categories have become "commoditized" For some software categories (Mail clients, Office productivity apss, even CRM) one can choose from different offerings and there is a clear relationship of the price/features ratio. The timesharing of the past was almost always done on either higly specialized number crunching applications (hardly classificable as commodities) or in completely custom built business software.

The Google Application Engine (GAE)

In the past months we've seen the launch of Amazon E3, a notable offering that is proving to be a cost effective alternative to housing or hosting providers and also have the resources to scale should the needs of your application increase. As a counter offering, Google has launched its beta Application Engine. Its main difference is that Amazon has split their offerings into infrastructure categories, such a storage service (S3) and a "super hosting" service (EC) whereas Google's offering is centred around an application hosting environment that includes more built in facilities, but is also more restricted than Amazon's.

Basically, Google offers to share its enormous infrastructure, for as yet undisclosed price. The advantage is obvious, Google has proved already that they can scale to a massive number of users and storing massive amounts of data without disruptions and without degrading the quality of the service provided to the users. They have been developing and refining their own technology stack for years, all of that with the eye on massive scalability. It's very unlikely that anybody outside Yahoo, Amazon or Microsoft can match that, much less with limited resources in comparison with what Google can put together.

Google has not yet provided all the details, and the service is still in beta. But some of the main points of the offering are already apparent. There are already lots of reviews where you can find their strengths and weaknesses, written by people much more knowledgeable than me at web development. But for me, the interesting part is their storage engine. Google provide an abstraction called the "datastore", that is, an API that allows programs hosted in the App Engine servers to store and retrieve data. You don't have any other means of accessing the data persisted than the datastore API, so you better familiarize with it before delving further in any application design.

Google datastore, limited?

The most striking feature of the datastore is that neither it's relational, nor it tries to be. It's a sort of object oriented database, where Python (the only language the App Engine supports so far) objects defined in terms of Python classes are stored and retrieved. Optionally, indexes can be defined on properties, but an index only indexes a single property. Primary keys are usually automatically generated, and relationships between objects are handled via "Reference" types.

The datastore can then be queried and objects retrieved. The Google Query Language (gql) has familiar SQL syntax, but outside of the syntax is a completely different animal. It neither allow joining entities, nor to use very common combinations of conditionals in its WHERE clause. There are no analytical functions (COUNT, SUM) and there is a limit of 1000 objects per result set.

Is this a rich environment for building data intensive applications? It all depends on your idea of "rich" and "data intensive" Of course all the missing features can be worked around in some way or another. But, and this is the difficult part, probably they should not be. By that I mean that anyone can provide iterative code to perform the joins, aggregations and whatever else you find missing from gql, but that does not mean you should do it.

For one thing, the GAE has strict limits on the CPU cycles that a request can consume, so you'll probably use them quickly if you write too much code. Those limits are there to protect Google shared resources (nobody likes its machine cycles being sucked up by the guy from the next door) So joining entities like you were joining tables is out of the question. Same for counting, aggregating and other time consuming operations.

Scaling in a scale that you cannot even start to imagine

All those limitations are, at the very least, irritating. Some of them even make completely unfeasible some kind of applications. At first, I was felt deceived by GAE. However, after browsing some more documentation I soon realized what GAE is all about.

Some background is in order: I live and breathe business applications. By that, I mean that I'm used to the scale of ERPs or Data Warehouses that span, at best (one would say sometimes worst) thousands of users. That's a formidable challenge in itself, and one that takes a dedicated army of engineers to keep running day to day. The usual mid sized system is aimed at a medium size business or a department inside a big corporation, never tops 150 users. And I've devoted a significant part of my professional efforts to study those environments and to make them perform better.

Now, Google plays in another different league. They provide consumer applications. While business users can be in the thousands, consumers are in the millions range. That takes another completely different mindset when planning and scaling applications. The GAE restrictions are not there just because Google engineers are timid or have developed an incomplete environment. They are there because GAE can make your application scale very big. Awfully big. With that in mind, you can start to explain to yourself why GAE datastore has so many restrictions. Those restrictions are luxuries that you can only afford when you can limit your user base to most a few thousands users. If you want to scale beyond that, you have to give up on some things.

Consumer applications are different

In short, don't expect to see shortly an ERP built atop of GAE. It's not the ideal environment for that. But it's equally unlikely that there is ever consumer level demand for your own ERP, isn't it? Think of GAE then as an ideal platform for consumer level applications. Then it starts to make sense. Think of Amazon EC as a super scalable hosting service, where you basically can make your application scale in terms of hardware, but you're going to be on your own when defining how to make that hardware perform.

Even so, GAE and related environments bring a new set of challenges, at least for me, in terms of how to efficiently store and retrieve data. I'll keep posting my findings, and I'm sure it will be a lot of fun to explore and discover.

No comments:

Post a Comment