Database Performance Tuning: 2015

Monday, 22 June 2015

Django and Python best practices

After a few weeks reading an extensive, and quite complex, Python/Django code base, I’ve realized that there are a few simple practices that can make a significant difference in how effectively and quickly one can pick up an application. Not being up to now an intensive Python user, I was expecting to catch up on the code with more or less the same level of effort it takes me to grasp a piece of C, SQL, or Java.

But it hasn’t happened as quickly as I expected.

I’ve found myself tracing with a debugger the application, not looking for bugs, but trying to understand what it does. In my mind, this is an admission of defeat: I can’t understand the code by reading it, I’ve to watch it in motion to be sure my mental image of what the code does and what the code actually does match.

Debugging is the task of verifying why your mental image of the code should do is not matching with what it actually does. Not the opposite. When you don’t know what some code does, you should be able to know it by reading it.

And I’ve realized why it was taking so much time. Python is so powerful and expressive that has its own shoot-yourself-in-the-foot factor that can be, with a big enough code base, equally dangerous than C, SQL or Java shoot-yourself-in-the-foot pitfalls.

So I’ve put together this short guide with the list of things I’d want to see in code that I’ve not written myself. Which are really the list of things I want to keep an eye when I write Python code in the future.

So this being a case of either my code reading abilities being weak or the code not being well written, I of course prefer to blame the code. Not the people to wrote it, of course: I’ve real world examples of each of these entries, but the point is not to shoot blame around, but rather to make the code more readable, shareable, and in the future, easier for newcomers to understand without resorting to tracing it with a debugger.

Don’t fight or otherwise reinvent Django or the standard library

Django provides built-in functionality to validate your data. To enable referential integrity. To deliver web pages. To do a lot of things. Django has been around for years, and improved over and over by a lot of people. So before implementing anything, think twice. Look around in the Django docs and check if there’s something already built in to do that.

In particular, use Django forms to validate data. Use Django validators in models. Use Django ForeingKey, use clean_data to actually … clean data. Don’t reinvent the wheel if there’s a perfectly good, already debugged and reusable, wheel available.

Use the Python standard comment syntax

Document every single parameter your function accepts. If your function has side effects, document them. If your function throws exceptions, state so in the documents.

The only acceptable exception for this rule is for methods that override or implement an existing Django convention. That would be an unnecessary restatement of what is already said. Except of course if your override a standard framework method and add some special contract.

It is much, much worse to have misleading documentation than no documentation at all because it creates cognitive dissonance. If you’re changing a method and not updating the documentation, you’re just confusing future readers that will discover sooner or later that the documentation does not match what the code does, and they’ll throw away the documentation anyway and throw a few expletives at you or your family. So it is best to throw away the documentation than to keep it obsolete before someone else loses time unnecessarily discovering that it is outdated.

Use Python parameter passing to… actually pass parameters

Python is famous for its readability, and its duck typing prevents a lot of mistakes. Named parameters and default values are a convenient way to plainly state what a method does. Method signatures can also be read by your IDE and used in code completion.

There are native ways to pass arguments to method calls. Don’t use JSON or HttpRequest to pass parameter values to a function that is not an URL handler. Period.

See the point on kwargs for more details.

Be explicit. Be defensive

When you consider augmenting the signature of a function by adding more parameters, just add them and provide sensible defaults.

You may think that you’re just making your code future-proof by using variable parameter lists. You are wrong. You’re just making it more confusing and difficult to follow.

Consider this code

 class base(object):  
   def blah(self, param1, **kwargs)  
   ...

 class derived(base):  
   def blah(elf, param1, **kwargs)

Don’t do it. If when you create it, blah does not need more than 1 argument, declare it like so. Leaving **kwargs forces the reader of the code to go thru the whole function body to verify if you’re actually using it. Don’t worry about future proofing your code, any half decent IDE will tell you which methods could have issues by your change in much less time than you can think about it. So just declare this:

 class base(object):  
   def blah(self, param1, param2=None)  
   ....

And future readers of your code will be able to tell what your function accepts.

MVC: the whole application is NOT a web page

According to Django’s own site:

In our interpretation of MVC, the “view” describes the data that gets presented to the user. It’s not necessarily how the data looks, but which data is presented. The view describes which data you see, not how you see it. It’s a subtle distinction.

So, in our case, a “view” is the Python callback function for a particular URL, because that callback function describes which data is presented.
Furthermore, it’s sensible to separate content from presentation – which is where templates come in. In Django, a “view” describes which data is presented, but a view normally delegates to a template, which describes how the data is presented.

Where does the “controller” fit in, then? In Django’s case, it’s probably the framework itself: the machinery that sends a request to the appropriate view, according to the Django URL configuration.

That does not mean that you have to use the same parameter passing conventions as an HTTP request. If you do that, you’re giving up on all the parameter validation and readability that Python provides.

There are native ways to pass arguments to method calls. Don’t use JSON format or HttpRequest to pass parameter values to a function that is not an URL handler. Yes, I'm repeating a sentence from the previous point here because it is very important to keep this in mind.

Avoid the temptation to create über-powerful handler() or update() methods/objects that can do everything inside a single entry point in a "util" or "lib" module (with a possibly associated evil kwargs parameter list) The fact is, this single entry point will branch to a myriad places and it will be a nightmare to follow and change in the future, becoming the dreaded single point of failure that no one wants to touch even with the end of a long sitck.

Instead, move the functionality related to each data item as close to the data as possible. Which means, to the module where it is declared. These should be much smaller and easier to test and manage than the über monsther methods.

Then, use the controller to glue together all these small pieces to build a response to your clients.

kwargs is EVIL. Deeply EVIL. Root-canal-extraction-level evil.

kwargs is a Python facility designed to provide enormous flexibility in some situations. Particularly, decorators, generators and other kind of functions benefit greatly from being able to accept an arbitrary number of parameters. But It is not a general purpose facility to call methods.

The ONLY acceptable use of kwargs in normal application development, that is, outside framework code, is when the function actually can accept an arbitrary number of arguments and its actions and results are not affected by the values received in the kwargs parameter list.

In particular, the following code is NOT ACCEPTABLE:

 def blah(**kwargs):  
   if ‘destroy_world’ in kwargs:  
     do_something()  
   if ‘save_world’ in kwargs:  
     do_something_else()

See how many things are wrong with this function? Let’s see: first, the caller does have to either read the documentation you provided in the function to know what are acceptable values to send in the kwargs dictionary. Second, a small syntax error when composing the arguments for the function call can make a significant difference in what the function does. Third, anyone reading your code will have to go thru all the function body to understand what valid kwargs arguments are.

And finally, why stop there? Why don’t you define all of your methods accepting **kwargs and be done with parameter lists? Can you imagine how completely unreadable your code will become?

Seriously, each time you use kwargs in application code, a baby unicorn dies somewhere.

DRY - Don’t repeat yourself

If you’re doing it more than twice, it is worth thinking about it. Consider this code:

 def blah(self, some_dict):  
   if ‘name’ in some_dict:  
     data[‘name’] = some_dict[‘name’]  
   if ‘address’ in some_dict:  
     data[‘address’] = some_dict[‘address’]  
   ....  
   ....  
   if ‘postal_code’ in ‘some_dict’:  
     data[‘postal_code’] = some_dict[‘postal_code’]

Why not use this instead?

 def blah(self, some_dict):  
   allowed_entries = [‘name’, ‘address’, ... ‘postal_code’]   
   for entry in allowed_entries.keys()   
     if entry in some_dict:   
       data[entry] = some_dict[entry]

Or even better and surely more pythonic and satisfying:

 def blah(self, some_dict):  
   allowed_entries = [‘name’, ‘address’, ... ‘postal_code’]   
   data = { key : some_dict[key] for entry in allowed_entries if key in some_dict }

There are a lot of advantages of doing this. You can arbitrarily extend the list of things that you transfer. You can easily test this code. The code coverage report will keep giving you 100% no matter how many values you include in some_dict. The code is explicit and simple to understand.

And even better, someone reading your code will not have to go thru a page or two of if statements just to see what you’re doing.

Avoid micro optimizations

You may code this thinking you’re just writing efficient code by saving a function call:

 if a in some_dict:  
   result = some_dict[a]  
 else  
   result = some_default_value

instead of

 result = some_dict.get(a, some_default_value)

Now, go back to your console and time these two examples executing a few thousand times. Measure the difference and think how many .0001’s of a seconds you’re saving, if any. Now, go back to your app and remember the point about using the Python standard library and the Django provided functionality.

Monday, 4 May 2015

Javascript: has everyone forgot how it became what it is?

From time to time there's the occasional question from people that think that I'm really smart and ask me about advice on which programming language they should learn to ensure they have a good career ahead. Of course, I try always to answer these questions instead of focusing on the real question, which is why they think I'm so smart when in fact I am not.

And things always end up being a debate about how important is to know Javascript and how much of a future it has. Then you stumble upon debates about how good Node.js because you are using the same language on client and server, and what a great language Javascript is for writing the server side of an application.

And while I think that it is important that everyone knows Javascript, I don't think it is going to be the only programming language they are going to need. Or that they are going to work in Javascript a lot. Because Javascript is good for what it was created for, not for writing performance sensitive server code or huge code bases.

And when raising that point, it seems most people seem to forget how Javascript became what it is today.

See, in the beginning of the web, there was only one browser. It was called Mosaic, and did not had any scripting capabilities. You could not do any client side programming in web pages. Period. If you wanted your web pages to change, you had to write some server code. Usually using something called CGI and a language that was able to read/write to standard input/output. But let's not disgress.

Then came Netscape. A company where many of the authors of the original Mosaic code ended up working in. These guys forgot about their previous Mosaic code, started from scratch and created a web browser that was the seed that started the web revolution. Besides being faster and more stable than Mosaic, the Netscape browser known as Navigator had a lot of new features, some of them became crucial for the development of the world wide web as we know it today. Yes, Javascript was one of those.

So they needed a programming language. They created something with a syntax similar to Java, and even received permission from Sun Systems (owner of Java at the time) to call it Javascript. Legend says Javascript was created in 10 days, which is in itself no small feat and speaks volumes about the technical abilities of the Netscape team, most notably in this case of Brendan Eich

At that point, Javascript was a nice and welcome addition to the browser, and to your programming toolbox, because it enabled things that previously were simply not possible with a strict client-server model.

Then it all went boomy and bubbly, and later crashy. The web was the disruptive platform that ... changed everything. The server side (running Perl, Java, ASP or whatever) plus the client side executing Javascript was soon used to create sophisticated applications that replaced their desktop counterparts, but also being universally available from anywhere, instantly accessible, without requiring any client capable of running anything but a browser and a TCP/IP network stack.

Javascript provided the missing piece in the puzzle necessary for replacing applications running in desktops and laptops of the time with just a URL typed in the address bar of a web browser. Instantly available and updated, accessible from any device, anywhere. Remember, there were no mobile smartphones back then.

That of course ringed a lot of bells at Microsoft. They saw the internet and the browser as a threat to their Windows desktop monopoly and Microsoft, being the smart people they are, set out to counter that threat. The result was Internet Explorer.

Internet Explorer was Microsoft's vision of a web browser integrated in Windows. It was faster than Netscape's Navigator. It was more stable. It crashed less. It came already installed with Windows, so you did not have to download anything to start browsing the web. Regardless of the anti-monopoly lawsuits arising from how Microsoft pushed Internet Explorer in the market, the truth was the Internet Explorer was a better browser than Netscape's Navigator in almost any dimension And I say almost because I'm sure someone can remember something where Navigator was better, but I sincerely can't.

And it contained a number of technologies designed by Microsoft to regain control of their Windows desktop monopoly. Among them, the ill-fated ActiveX technology (later to become one of the greatest sources of security vulnerabilities of all times) and the VB scripting engine. That was part of Microsoft "embrace, extend, extinguish" tactic. Now, you could write your web page scripts in a Visual Basic dialect instead of Javascript.

Internet Explorer practically crushed Navigator out of the browser market, leaving it with 20% or so of their previous 99% market share. It was normal at the time for web developers to place "works best with Internet Explorer" stickers on their pages, or even directly refuse to to load a page with any other browser than Explorer and pop up a message asking you to use Explorer to view their pages. Microsoft was close to realizing their dreams of controlling the web and keeping their Windows desktop monopoly untouched.

And then came Mozilla. And then came the iPhone. Which are other stories, and very interesting by themselves, but not the point of this post...

What is interesting from an history perspective at that point is that developers were using many proprietary IE features and quirks, yet their web page scripts were still mostly written in Javascript. Not in VBScript. And VBScript faded away like ActiveX, FrontPage and other Microsoft ideas about how web pages should be created. Web developers were happily using Microsoft proprietary extensions but kept using Javascript.

Why that happened? Why embrace lots of proprietary extensions to the point of making your pages unreadable outside of a specific browser but keep your code in Javascript instead of the Microsoft's nurtured VBScript? Basically two reasons: first, there were still a significant minority of non-Internet Explorer users browsing the web, so Javascript programs worked on both browsers with little changes from one to another. Second: VBScript sucked. You may think that developers immersed in a proprietary web were choosing Javascript over VBScript because the language was superior. And it was. But this was not the case of choosing among the very best available. It was just a matter of keeping that remaining 20% happy and at the same time picking up the one of the two languages that sucked less.

Mind you, Javascript had no notion of a module or package system. No strong typing. Almost no typing at all. No notion of what a thread was. No standard way of calling libraries written in other languages.

But if after reading all these missing items you think Javascript sucks, you have to see VBScript to appreciate the difference. A language sharing all the deficiencies of Javascript, and then having more of its own. VBScript sucked more than Javascript. Javascript was the lesser of two evils.

And as of today, Javascript still has all those deficiencies. Don't think of Javascript as the language of choice for writing web page front ends. Think of it as your only choice. You don't have any other alternatives when working with web pages. Period. Javascript is not used because it is the best language, it is used because it is the only one available.

It was much later when Google created the V8 Javascript interpreter, making Javascript fast enough to be considered acceptable for anything else beyond animations and data validations. It was even later when Ryan Dahl, the creator of Node.js, had the crazy idea of running V8 on a server and have it handle incoming http requests. Node.js works very well on a very limited subset of problems, and fails completely outside those.

The corollary is: Javascript will be around for ages. You need to know it if you want to do anything at all on the client side. And know it well, together with the framework of the week if you want to do anything at all on the client side. But it will not be the language where in the future web servers are programmed in.

Phew. And all this still does not completely answer the question of which programming languages you need to know. Javascript is one of them, for sure, but not the most important or the most relevant. It is a necessary evil.

Three guys in a garage, the NIH syndrome and big projects

As the old adage says, experience is the root (or perhaps was it the mother?) of science. Nowhere near software development, it seems. For what is worth, not a week passes without another report of a disastrous software project being horrendously late, over budget, under performing or all of these at the same time.

Usually, these bad news are often about the public sector. Which are usually great news to those ideologically inclined to think that government should be doing as little as possible, even nothing, in our current society. This argument usually does not take into account that these government run projects are almost always awarded to private contractors. Apparently, these same contractors are able to deliver as expected when they the money does not come from public funds, hence the blame should sit squarely on the way government entities manage those projects not on these contractors, right?

I have bad news for these kind of arguments: it is just that these publicly funded failures are just more visible than the private ones. With various degrees of transparency, and by their own nature, these kind of projects are much more likely to be audited and reviewed than any privately funded project.

That is, for each Avon Canada or LSE failure that is made public, we have many, many more news of public sector failures such as healthcare.gov or Queensland failures. So next time you consider that argument, think if it is simply that you don't hear about private sector projects going wrong so often. Is it really because all goes well? Or is it because simply the private sector is more opaque and thus can hide its failures better?

Anyway, I'm digressing here. The point is that big projects, no matter where the funding comes from, are much more likely to fail. Brooks explained it in the TMM (mandatory reading) years ago. Complex projects means complex requirements which means bigger systems with more moving parts and interactions, which technically are challenging enough but are nothing compared to the challenges of human communication and interaction that rise exponentially as the number of people involved increases.

What is more surprising how often when one reads the post mortem of these projects there is some kind of odd technical decision that contributes to the failure. This is usually discussed in detail, with long discussion threads pondering how one solution can be better than another and inevitably pointing to the NIH syndrome. This can take the shape of using XML as a database, a home grown transaction processor or home grown database, using a document database as structured storage (or viceversa), using an unstructured language to develop an object oriented system and so son.

There is an explanation for focusing on technical vs. organizational issues when discussing these failures: technical bits are much more visible and easier to understand. Organizational, process or methodology issues, except for those involved directly in the project, are much more opaque. While technical decisions usually contribute to a project's failure, fact is that very, very complex and long projects have been successfully executed with way more rudimentary technologies in the past, so it is only logical to conclude that the the technology choices should not be that determinant in the fate of a project.

And we usually quickly forget something that we tend to apply in our own projects: the old "better the devil you know" adage. More often than not, technologies are chosen conservatively, as it is far easier to deal with their knows weaknesses than to battle with new unknowns. We cannot, of course, disregard other reasons in these choices. Sometimes there are commercial interests from vendors interested in shoehorning their technology, and these are difficult to detect. But we have to admit that sometimes the project team believed the chosen option as the best for the problem at hand. Which leads to the second point in the post: the NIH syndrome.

What someone unfamiliar with a strange choice of technology can be dismissed just as another instance of "Not invented here" syndrome. But what is NIH for someone claiming to "know best" for a specific technology area, it is perhaps the most logical decision for someone else. What looks attractive as a standalone technology for a specific use case may not look so good when integrated into a bigger solution. This is why people still choose to add text indexing plugins to a relational database instead of using a standalone unstructured text search engine, for example.

Another often cited reason for failure in these projects -and in all software projects in general- is that there are huge volumes of changes introduced once the project started. What is missing here is not some magic ingredient, but a consistent means of communication that states clearly changing anything already done is going to cost time and money. Projects building physical things -as opposed to software- seem to be able to get along with this quite well, if only because one does not have any issues explaining the effects of change on physical things. But the software world has not yet managed to create the same culture.

So now that we've reasonably concluded that technology choices are not usually the reasons for a project failing, that change is a constant that needs to be factored in and there is no way to avoid it, and that there is a strong correlation between project size and complexity, is there a way of keeping these projects to fail?

In my opinion, there is only one way: start small. Grow as it succeeds, and if it does not, just discard it. But I hear you crying, there is no way for these projects to "start small", as their up front requirements usually cover 100s of pages full of technical and legal constraints that must be met from day one. These projects don't have "small" states. They just go in a big bang and have to work as designed from day one. And that's exactly why they fail so often.

Otherwise, three guys in a garage will always outperform and deliver something superior. They are not constrained by 100s of requirements, corporate policies and processes, or financial budget projections.