Database Performance Tuning: GAE, continued

I've continued exploring Google AppEngine since my last post. Honestly, I've not had much time to devote to self education or blogging lately, so at this moment my opinions are not yet based on real world, live application, experience. And they are mostly opinions, not hard, evidence based, facts.

GAE fundamentals

I'm new to Phyton and Django. Both of them have been a pleasant surprise. The Phyton language is focused on being practical on a day to day use and have good performance. It's also well designed in two ways: one, there is nothing that looks odd or added as an aftertought, althought I'm sure that the language was not born with all the features it has today. That speaks very well of the language designer.

Second, the libraries and syntax almost lead you to choose the "right" approach to solve a problem, in a way that is structured and readable instead of going the quick hack route. (Perl asbestos suit:on) I've always hated how you could in Perl devise many different solutions to the same problem, the quickest to write being almost always the least readable and plain ugly (Perl asbestos suit:off) While it is perfectly possible to write good structured and object oriented code in PHP and non-.NET VB, there is nothing in both languages to prevent a quick and dirty approach or encourage doing it the right way.

In fact, Python blocking convention could be an advantage here. Enforcing proper indentation discourages deep nesting of code blocks, increasing the chance that a problem is break down into smaller, more manageable pieces. Of course, crappy code can be written in any language you want, and the likelyhood of crapiness increases with the popularity of a programming language, so this is by no means a guarantee that you'll always come across clean and nice Python code. Off topic question, does anyone know what is the shape of the crapiness increase function? Linear? Log? Exp?

If you can forgive the lack of a statement terminator, the weak typing, or the odd blocking scope syntax, then Phyton could not be a bad choice as your next general purpose programming language.

The Django motto "A framework for perfectionists with deadlines" holds true everywhere you look at. I'm convinced that there's a huge productivity advantage using Django over a raw, no framework, approach. I'm no expert in web frameworks, so I'm not sure if the productivity delta is convincing enough for a PHP or .NET web developer to switch. But Django surely is easy to learn, extensible, consistent, scalable, and covers the whole gamut of web app scenarios.

Unfortunately, what is built into GAE is not 100% Django. Not even 1.0 Django. Apparently, Django 1.0 was not released on time for GAE to include it. Also, the GAE storage engine is not even remotely similar to a relational database, and thus the persistence part of the Django framework cannot work unchanged under GAE. To make things a bit worse for the newcomer, instead of not including the supported Django modules, code that imports and references them is accepted by the GAE SDK, only to find those modules missing at run time. Of course, if someone digests all the Django AND GAE documentation beforehand this is not a problem, as you're not going to use a Django feature missing in GAE. But for those learning by exploring, example, and reading documentation at the same time, it makes learning somewhat frustrating and does not add to the "perfectionist with deadlines" motto.

In summary, I've learned a lot and still continue learning, but looks like Google has done overall a good choice on those two fundamentals for their App Engine.

Warning: in my opinion, hands on experience is a primary source of knowledge, at least as important as learning the theoretical fundations. Therefore, what follows could be an entirely wrong opinion on what is GAE market niche not much based on experience but on what I've learned from second hand or lab experiments. In fact, I've yet to set up a live GAE application.

An engine looking for apps

The way one desings your data model is very much influenced by the limitations of the persistence layer that you use. For any given system, you tend to delegate each task to the layer that is best suited to handle them. That means that if you're using a relational database, that layer usually handles querying and data integrity in as much as possible. And in a relational database, that "in as much as possible" is a lot, probably most of it.

Application design for GAE datastore is no different. Python code is going to handle the roles that the Datastore does not support, and given that there are no joins , analytical functions or complex WHERE clauses, that's going to be a lot. In fairness, Google are not calling their storage engine a database. Because it's not. Google calls it a "Datastore", which much more accurate describes what it does and what it does not. It took me a while to realize the difference, as I was hoping that my initial impressions were just a consequence of ignorance. So I kept looking for capabilities similar to a relational engine. But they are simply not there. On the other hand, transactions are there, indexes are there, and a query language is there, but with restrictions that don't put them on par with your basic RDBMS.

So the GAE Datastore is ideally suited to store and retrieve data, and huge amounts of it. But I keep asking myself, what is the point of being able to store such huge amounts of data?

Of course, if you are a Google engineer, this is a no brainer. There are lots and lots of uses for this data, anything that enhances your knowledge and/or service to your customers, and thus your revenue. And probably Google's first choice would be to target ads. But from flat reporting suppporting day to day business processes to decision support related information, those vasts amounts of data are an invaluable asset to your business. So you should be able to take advantage of them, right?

Yes, but currently only if you're a Google engineer. Because then you have access to a nice MapReduce implementation with a few (thousands?) nodes to run it. Armed with that, you can analyze, slice and dice and otherwise play with your data to your heart's content. But unfortunately, you cannot do any of that with GAE. At least today.

And that is what I keep asking myself each time I revisit the GAE. As it stands today, there are alternatives that provide, if not the huge storage and scalability, at least the analytical and reporting power that current applications need today. So nobody is going to move its shopping cart, blog or bulletin board to GAE. Nobody that needs database complexity above the basic level, at least. So what are the kind of applications Google expects to host in GAE? Today, anyone thinking of taking advantage of cloud computing is looking at the likes of Amazon EC or watching closely how Microsoft is adding relational capabilities to its Azure storage engine and waiting for them to be mature enough to jump in.

I can only think of three answers to this. One, Google is just seeding the field and creating some nifty toys expecting that the next Web 3.0 revolution is sparkled by something hosted in AppEngine. This revolutionary application has not yet written or even imagined by anyone, but Google expects GAE to be the platform of choice for running it. Two, Google is going to expand GAE so that its storage engine offers more capabilities that make it attractive to existing applications. It surely would be nice to have access to MapReduce-scale resources or more sophisticated database functionality built in. Third, Google is trying to compete with Amazon EC and Microsoft Azure and falling behind them.

I just hope that it's not the third one. In any event, I've learned to look at AppEngine database performance as the combination of Python and Datastore tuning. This is different from relational performance, where one looks to non-SQL code as the part that usually slows down database operations.

Database Performance Tuning

Monday, 23 March 2009

GAE, continued

GAE fundamentals

An engine looking for apps

No comments:

Post a Comment

Useful links