Monday, 22 June 2015

Django and Python best practices


After a few weeks reading an extensive, and quite complex, Python/Django code base, I’ve realized that there are a few simple practices that can make a significant difference in how effectively and quickly one can pick up an application. Not being up to now an intensive Python user, I was expecting to catch up on the code with more or less the same level of effort it takes me to grasp a piece of C, SQL, or Java.

But it hasn’t happened as quickly as I expected.

I’ve found myself tracing with a debugger the application, not looking for bugs, but trying to understand what it does. In my mind, this is an admission of defeat: I can’t understand the code by reading it, I’ve to watch it in motion to be sure my mental image of what the code does and what the code actually does match.

Debugging is the task of verifying why your mental image of the code should do is not matching with what it actually does. Not the opposite. When you don’t know what some code does, you should be able to know it by reading it.

And I’ve realized why it was taking so much time. Python is so powerful and expressive that has its own shoot-yourself-in-the-foot factor that can be, with a big enough code base, equally dangerous than C, SQL or Java shoot-yourself-in-the-foot pitfalls.

So I’ve put together this short guide with the list of things I’d want to see in code that I’ve not written myself. Which are really the list of things I want to keep an eye when I write Python code in the future.

So this being a case of either my code reading abilities being weak or the code not being well written, I of course prefer to blame the code. Not the people to wrote it, of course: I’ve real world examples of each of these entries, but the point is not to shoot blame around, but rather to make the code more readable, shareable, and in the future, easier for newcomers to understand without resorting to tracing it with a debugger.

Don’t fight or otherwise reinvent Django or the standard library


Django provides built-in functionality to validate your data. To enable referential integrity. To deliver web pages. To do a lot of things. Django has been around for years, and improved over and over by a lot of people. So before implementing anything, think twice. Look around in the Django docs and check if there’s something already built in to do that.

In particular, use Django forms to validate data. Use Django validators in models. Use Django ForeingKey, use clean_data to actually … clean data. Don’t reinvent the wheel if there’s a perfectly good, already debugged and reusable, wheel available.


Use the Python standard comment syntax


Document every single parameter your function accepts. If your function has side effects, document them. If your function throws exceptions, state so in the documents.

The only acceptable exception for this rule is for methods that override or implement an existing Django convention. That would be an unnecessary restatement of what is already said. Except of course if your override a standard framework method and add some special contract.

It is much, much worse to have misleading documentation than no documentation at all because it creates cognitive dissonance. If you’re changing a method and not updating the documentation, you’re just confusing future readers that will discover sooner or later that the documentation does not match what the code does, and they’ll throw away the documentation anyway and throw a few expletives at you or your family. So it is best to throw away the documentation than to keep it obsolete before someone else loses time unnecessarily discovering that it is outdated.

Use Python parameter passing to… actually pass parameters


Python is famous for its readability, and its duck typing prevents a lot of mistakes. Named parameters and default values are a convenient way to plainly state what a method does. Method signatures can also be read by your IDE and used in code completion.

There are native ways to pass arguments to method calls. Don’t use JSON or HttpRequest to pass parameter values to a function that is not an URL handler. Period.

See the point on kwargs for more details.

Be explicit. Be defensive


When you consider augmenting the signature of a function by adding more parameters, just add them and provide sensible defaults.

You may think that you’re just making your code future-proof by using variable parameter lists. You are wrong. You’re just making it more confusing and difficult to follow.

Consider this code

 class base(object):  
   def blah(self, param1, **kwargs)  
   ...

 class derived(base):  
   def blah(elf, param1, **kwargs)  

Don’t do it. If when you create it, blah does not need more than 1 argument, declare it like so. Leaving **kwargs forces the reader of the code to go thru the whole function body to verify if you’re actually using it. Don’t worry about future proofing your code, any half decent IDE will tell you which methods could have issues by your change in much less time than you can think about it. So just declare this:

 class base(object):  
   def blah(self, param1, param2=None)  
   ....

And future readers of your code will be able to tell what your function accepts.

MVC: the whole application is NOT a web page


According to Django’s own site:
In our interpretation of MVC, the “view” describes the data that gets presented to the user. It’s not necessarily how the data looks, but which data is presented. The view describes which data you see, not how you see it. It’s a subtle distinction. 
So, in our case, a “view” is the Python callback function for a particular URL, because that callback function describes which data is presented.
Furthermore, it’s sensible to separate content from presentation – which is where templates come in. In Django, a “view” describes which data is presented, but a view normally delegates to a template, which describes how the data is presented.

Where does the “controller” fit in, then? In Django’s case, it’s probably the framework itself: the machinery that sends a request to the appropriate view, according to the Django URL configuratio
n.
That does not mean that you have to use the same parameter passing conventions as an HTTP request. If you do that, you’re giving up on all the parameter validation and readability that Python provides.

There are native ways to pass arguments to method calls. Don’t use JSON format or HttpRequest to pass parameter values to a function that is not an URL handler. Yes, I'm repeating a sentence from the previous point here because it is very important to keep this in mind.

Avoid the temptation to create über-powerful handler() or update() methods/objects that can do everything inside a single entry point in a "util" or "lib" module (with a possibly associated evil kwargs parameter list) The fact is, this single entry point will branch to a myriad places and it will be a nightmare to follow and change in the future, becoming the dreaded single point of failure that no one wants to touch even with the end of a long sitck.

Instead, move the functionality related to each data item as close to the data as possible. Which means, to the module where it is declared. These should be much smaller and easier to test and manage than the über monsther methods.

Then, use the controller to glue together all these small pieces to build a response to your clients.

kwargs is EVIL. Deeply EVIL. Root-canal-extraction-level evil.


kwargs is a Python facility designed to provide enormous flexibility in some situations. Particularly, decorators, generators and other kind of functions benefit greatly from being able to accept an arbitrary number of parameters. But It is not a general purpose facility to call methods.

The ONLY acceptable use of kwargs in normal application development, that is, outside framework code, is when the function actually can accept an arbitrary number of arguments and its actions and results are not affected by the values received in the kwargs parameter list.

In particular, the following code is NOT ACCEPTABLE:

 def blah(**kwargs):  
   if ‘destroy_world’ in kwargs:  
     do_something()  
   if ‘save_world’ in kwargs:  
     do_something_else()  

See how many things are wrong with this function? Let’s see: first, the caller does have to either read the documentation you provided in the function to know what are acceptable values to send in the kwargs dictionary. Second, a small syntax error when composing the arguments for the function call can make a significant difference in what the function does. Third, anyone reading your code will have to go thru all the function body to understand what valid kwargs arguments are.

And finally, why stop there? Why don’t you define all of your methods accepting **kwargs and be done with parameter lists? Can you imagine how completely unreadable your code will become?

Seriously, each time you use kwargs in application code, a baby unicorn dies somewhere.

DRY - Don’t repeat yourself


If you’re doing it more than twice, it is worth thinking about it. Consider this code:

 def blah(self, some_dict):  
   if ‘name’ in some_dict:  
     data[‘name’] = some_dict[‘name’]  
   if ‘address’ in some_dict:  
     data[‘address’] = some_dict[‘address’]  
   ....  
   ....  
   if ‘postal_code’ in ‘some_dict’:  
     data[‘postal_code’] = some_dict[‘postal_code’]  

Why not use this instead?

 def blah(self, some_dict):  
   allowed_entries = [‘name’, ‘address’, ... ‘postal_code’]   
   for entry in allowed_entries.keys()   
     if entry in some_dict:   
       data[entry] = some_dict[entry]   

Or even better and surely more pythonic and satisfying:

 def blah(self, some_dict):  
   allowed_entries = [‘name’, ‘address’, ... ‘postal_code’]   
   data = { key : some_dict[key] for entry in allowed_entries if key in some_dict }

There are a lot of advantages of doing this. You can arbitrarily extend the list of things that you transfer. You can easily test this code. The code coverage report will keep giving you 100% no matter how many values you include in some_dict. The code is explicit and simple to understand.

And even better, someone reading your code will not have to go thru a page or two of if statements just to see what you’re doing.

Avoid micro optimizations


You may code this thinking you’re just writing efficient code by saving a function call:

 if a in some_dict:  
   result = some_dict[a]  
 else  
   result = some_default_value  

instead of

 result = some_dict.get(a, some_default_value)  

Now, go back to your console and time these two examples executing a few thousand times. Measure the difference and think how many .0001’s of a seconds you’re saving, if any. Now, go back to your app and remember the point about using the Python standard library and the Django provided functionality.

Monday, 4 May 2015

Javascript: has everyone forgot how it became what it is?

From time to time there's the occasional question from people that think that I'm really smart and ask me about advice on which programming language they should learn to ensure they have a good career ahead. Of course, I try always to answer these questions instead of focusing on the real question, which is why they think I'm so smart when in fact I am not.
 
And things always end up being a debate about how important is to know Javascript and how much of a future it has. Then you stumble upon debates about how good Node.js because you are using the same language on client and server, and what a great language Javascript is for writing the server side of an application.

And while I think that it is important that everyone knows Javascript, I don't think it is going to be the only programming language they are going to need. Or that they are going to work in Javascript a lot. Because Javascript is good for what it was created for, not for writing performance sensitive server code or huge code bases.

And when raising that point, it seems most people seem to forget how Javascript became what it is today.

See, in the beginning of the web, there was only one browser. It was called Mosaic, and did not had any scripting capabilities. You could not do any client side programming in web pages. Period. If you wanted your web pages to change, you had to write some server code. Usually using something called CGI and a language that was able to read/write to standard input/output. But let's not disgress.

Then came Netscape. A company where many of the authors of the original Mosaic code ended up working in. These guys forgot about their previous Mosaic code, started from scratch and created a web browser that was the seed that started the web revolution. Besides being faster and more stable than Mosaic, the Netscape browser known as Navigator had a lot of new features, some of them became crucial for the development of the world wide web as we know it today. Yes, Javascript was one of those.

So they needed a programming language. They created something with a syntax similar to Java, and even received permission from Sun Systems (owner of Java at the time) to call it Javascript. Legend says Javascript was created in 10 days, which is in itself no small feat and speaks volumes about the technical abilities of the Netscape team, most notably in this case of Brendan Eich

At that point, Javascript was a nice and welcome addition to the browser, and to your programming toolbox, because it enabled things that previously were simply not possible with a strict client-server model.

Then it all went boomy and bubbly, and later crashy. The web was the disruptive platform that ... changed everything. The server side (running Perl, Java, ASP or whatever) plus the client side executing Javascript was soon used to create sophisticated applications that replaced their desktop counterparts, but also being universally available from anywhere, instantly accessible, without requiring any client capable of running anything but a browser and a TCP/IP network stack.

Javascript provided the missing piece in the puzzle necessary for replacing  applications running in desktops and laptops of the time with just a URL typed in the address bar of a web browser. Instantly available and updated, accessible from any device, anywhere. Remember, there were no mobile smartphones back then.

That of course ringed a lot of bells at Microsoft. They saw the internet and the browser as a threat to their Windows desktop monopoly and Microsoft, being the smart people they are, set out to counter that threat. The result was Internet Explorer.

Internet Explorer was Microsoft's vision of a web browser integrated in Windows. It was faster than Netscape's Navigator. It was more stable. It crashed less. It came already installed with Windows, so you did not have to download anything to start browsing the web. Regardless of the anti-monopoly lawsuits arising from how Microsoft pushed Internet Explorer in the market, the truth was the Internet Explorer was a better browser than Netscape's Navigator in almost any dimension And I say almost because I'm sure someone can remember something where Navigator was better, but I sincerely can't.

And it contained a number of technologies designed by Microsoft to regain control of their Windows desktop monopoly. Among them, the ill-fated ActiveX technology (later to become one of the greatest sources of security vulnerabilities of all times) and the VB scripting engine. That was part of Microsoft "embrace, extend, extinguish" tactic. Now, you could write your web page scripts in a Visual Basic dialect instead of Javascript.

Internet Explorer practically crushed Navigator out of the browser market, leaving it with 20% or so of their previous 99% market share. It was normal at the time for web developers to place "works best with Internet Explorer" stickers on their pages, or even directly refuse to to load a page with any other browser than Explorer and pop up a message asking you to use Explorer to view their pages. Microsoft was close to realizing their dreams of controlling the web and keeping their Windows desktop monopoly untouched.

And then came Mozilla. And then came the iPhone. Which are other stories, and very interesting by themselves, but not the point of this post...

What is interesting from an history perspective at that point is that developers were using many proprietary IE features and quirks, yet their web page scripts were still mostly written in Javascript. Not in VBScript. And VBScript faded away like ActiveX, FrontPage and other Microsoft ideas about how web pages should be created. Web developers were happily using Microsoft proprietary extensions but kept using Javascript.

Why that happened? Why embrace lots of proprietary extensions to the point of making your pages unreadable outside of a specific browser but keep your code in Javascript instead of the Microsoft's nurtured VBScript? Basically two reasons: first, there were still a significant minority of non-Internet Explorer users browsing the web, so Javascript programs worked on both browsers with little changes from one to another. Second: VBScript sucked. You may think that developers immersed in a proprietary web were choosing Javascript over VBScript because the language was superior. And it was. But this was not the case of choosing among the very best available. It was just a matter of keeping that remaining 20% happy and at the same time picking up the one of the two languages that sucked less.

Mind you, Javascript had no notion of a module or package system. No strong typing. Almost no typing at all. No notion of what a thread was. No standard way of calling libraries written in other languages.

But if after reading all these missing items you think Javascript sucks, you have to see VBScript to appreciate the difference. A language sharing all the deficiencies of Javascript, and then having more of its own. VBScript sucked more than Javascript. Javascript was the lesser of two evils.

And as of today, Javascript still has all those deficiencies. Don't think of Javascript as the language of choice for writing web page front ends. Think of it as your only choice. You don't have any other alternatives when working with web pages. Period. Javascript is not used because it is the best language, it is used because it is the only one available.

It was much later when Google created the V8 Javascript interpreter, making Javascript fast enough to be considered acceptable for anything else beyond  animations and data validations. It was even later when Ryan Dahl, the creator of Node.js, had the crazy idea of running V8 on a server and have it handle incoming http requests. Node.js works very well on a very limited subset of problems, and fails completely outside those.

The corollary is: Javascript will be around for ages. You need to know it if you want to do anything at all on the client side. And know it well, together with the framework of the week if you want to do anything at all on the client side. But it will not be the language where in the future web servers are programmed in.

Phew. And all this still does not completely answer the question of which programming languages you need to know. Javascript is one of them, for sure, but not the most important or the most relevant. It is a necessary evil.

Three guys in a garage, the NIH syndrome and big projects

As the old adage says, experience is the root (or perhaps was it the mother?) of science. Nowhere near software development, it seems. For what is worth, not a week passes without another report of a disastrous software project being horrendously late, over budget, under performing or all of these at the same time.

Usually, these bad news are often about the public sector. Which are usually great news to those ideologically inclined to think that government should be doing as little as possible, even nothing, in our current society. This argument usually does not take into account that these government run projects are almost always awarded to private contractors. Apparently, these same contractors are able to deliver as expected when they the money does not come from public funds, hence the blame should sit squarely on the way government entities manage those projects not on these contractors, right?

I have bad news for these kind of arguments: it is just that these publicly funded failures are just more visible than the private ones. With various degrees of transparency, and by their own nature, these kind of projects are much more likely to be audited and reviewed than any privately funded project.

That is, for each Avon Canada or LSE failure that is made public, we have many, many more news of public sector failures such as healthcare.gov or Queensland failures. So next time you consider that argument, think if it is simply that you don't hear about private sector projects going wrong so often. Is it really because all goes well? Or is it because simply the private sector is more opaque and thus can hide its failures better?

Anyway, I'm digressing here. The point is that big projects, no matter where the funding comes from, are much more likely to fail. Brooks explained it in the TMM (mandatory reading) years ago. Complex projects means complex requirements which means bigger systems with more moving parts and interactions, which technically are challenging enough but are nothing compared to the challenges of human communication and interaction that rise exponentially as the number of people involved increases.

What is more surprising how often when one reads the post mortem of these projects there is some kind of odd technical decision that contributes to the failure. This is usually discussed in detail, with long discussion threads pondering how one solution can be better than another and inevitably pointing to the NIH syndrome. This can take the shape of using XML as a database, a home grown transaction processor or home grown database, using a document database as structured storage (or viceversa), using an unstructured language to develop an object oriented system and so son.

There is an explanation for focusing on technical vs. organizational issues when discussing these failures: technical bits are much more visible and easier to understand. Organizational, process or methodology issues, except for those involved directly in the project, are much more opaque. While technical decisions usually contribute to a project's failure, fact is that very, very complex and long projects have been successfully executed with way more rudimentary technologies in the past, so it is only logical to conclude that the the technology choices should not be that determinant in the fate of a project.

And we usually quickly forget something that we tend to apply in our own projects: the old "better the devil you know" adage. More often than not, technologies are chosen conservatively, as it is far easier to deal with their knows weaknesses than to battle with new unknowns. We cannot, of course, disregard other reasons in these choices. Sometimes there are commercial interests from vendors interested in shoehorning their technology, and these are difficult to detect. But we have to admit that sometimes the project team believed the chosen option as the best for the problem at hand. Which leads to the second point in the post: the NIH syndrome.

What someone unfamiliar with a strange choice of technology can be dismissed just as another instance of "Not invented here" syndrome. But what is NIH for someone claiming to "know best" for a specific technology area, it is perhaps the most logical decision for someone else. What looks attractive as a standalone technology for a specific use case may not look so good when integrated into a bigger solution. This is why people still choose to add text indexing plugins to a relational database instead of using a standalone unstructured text search engine, for example.

Another often cited reason for failure in these projects -and in all software projects in general- is that there are huge volumes of changes introduced once the project started. What is missing here is not some magic ingredient, but a consistent means of communication that states clearly changing anything already done is going to cost time and money. Projects building physical things -as opposed to software- seem to be able to get along with this quite well, if only because one does not have any issues explaining the effects of change on physical things. But the software world has not yet managed to create the same culture.

So now that we've reasonably concluded that technology choices are not usually the reasons for a project failing, that change is a constant that needs to be factored in and there is no way to avoid it, and that there is a strong correlation between project size and complexity, is there a way of keeping these projects to fail?

In my opinion, there is only one way: start small. Grow as it succeeds, and if it does not, just discard it. But I hear you crying, there is no way for these projects to "start small", as their up front requirements usually cover 100s of pages full of technical and legal constraints that must be met from day one. These projects don't have "small" states. They just go in a big bang and have to work as designed from day one. And that's exactly why they fail so often.

Otherwise, three guys in a garage will always outperform and deliver something superior. They are not constrained by 100s of requirements, corporate policies and processes, or financial budget projections.

Sunday, 2 November 2014

Accidental complexity, Excel and shadow IT

You probably have felt like this. You've devoted some intense time to solve a very complex and difficult problem. In the process, you've researched the field, made attempts to solve it in a few different ways. You've came across and tested some frameworks as a means of getting close to the solution without having to do it all by yourself.

You've discarded some of these frameworks and kept others. Cursed and blessed the framework's documentation, Google and StackOverflow all at the same time. You have made a few interesting discoveries along the way, as always, the more interesting ones coming from your failures.

You've learned a lot a now are ready to transfer that knowledge to your customer, so you start preparing documentation and code so that it is in a deliverable state. Your tests become more robust and reach close to 100% coverage. Your documents start growing and get data from your notebook, spreadsheets and test results. Everything comes together nicely and ready to be delivered as a coherent and integrated package, something that your customer will value, appreciate and use for some time in the future (and pay for, of course)

And along the way, you've committed to your version control of choice all your false starts. All the successes. You've built a rather interesting history.

Of course, at this point your customer will not see any of the failures, except when you need to refer to them as supporting evidence for taking the approach you propose. But that's ok, because the customer is paying for your results and your time, and he's not really interested in knowing how to get these. If your customer knew how to do it in the first place, you would not be there doing anything, after all.

But it is exactly on these final stages where the topic of accidental versus essential complexity raises its head. You've spent some time solving a problem, solved it and yet you still need to wrap up the deliverables so that they are easily consumed by your customer(s). This can take many different shapes, from a code library, a patch set or a set of documents stating basic guidelines and best practices. Or all of them.

The moment you solved the problem you mastered the the essential complexity, yet you have not even started mastering the accidental complexity. And that still takes some time. And depending on the project, the problem, and the people and  organization(s) involved this can take much more time and cost than solving the problem itself.

Which nicely ties to the "Excel" part of the post, which at this point you're likely asking yourself what exactly Excel has to do with accidental complexity.

The answer is: Excel has nearly zero accidental complexity. Start Excel, write some formulas, some headers, add some format, perhaps write a few data export/import VBA scripts, do a few test runs and you can proudly claim you're done. Another problem solved, with close to 100% of your time devoted to the essential complexity of the problem. You did not write any tests. You did not used any kind of source control. You did not analyzed code coverage. You did not document why cell D6 has a formula like =COUNTIF(B:B;"="&A4&TEXT(A6)) You did not document how many DLLs, add-ons or JDBC connections your spreadsheet needs. You did not cared about someone using it on a different language or culture where dates are expressed differently. Yet all of it works. Now. Today. With your data.

That is zero accidental complexity. Yes, it has its drawbacks, but it works. These kind of solutions are usually what is described as "shadow IT", and hardly a day passes without you coming across one of these solutions.

What I've found empirically is that the amount of shadow IT on an organization remains roughly proportional to the size of the organization. You may assume that larger organizations should be more mature, and by virtue of that maturity would have eliminated or reduced shadow IT. Not true. And that is because the bigger the organization, the bigger the accidental complexity is.

If you look at the many layers of accidental complexity, you'll have some of them common to organizations of all size, mostly at the technical level: your compiler is accidental complexity. Test Driven Development is accidental complexity. Ant, make, cmake and maven are all accidental complexity. Your IDE is accidental complexity. Version control is accidental complexity.

But then there's organizational accidental complexity. Whereas in a small business you'll likely have to talk to very few individuals in order to roll out your system or change, the larger the organization the thickest the layers of control are going to be. So you'll have to have your thing reviewed by some architect. Some coding standards may apply. You may have to use some standard programming language, IDE and/or  framework, perhaps particularly unsuited to the problem you are solving. Then you'll have to go thru change control, and then... hell may freeze before you overcome the accidental complexity, and that means more time and more cost.

So at some point, the cost of the accidental complexity is way higher than the cost of the essential complexity. That is when you fire up Excel/Access and start doing shadow IT.

Monday, 18 August 2014

The code garage - What to do with old code?

From time to time I have to cleanup my hard disk. No matter how big my partitions are, or how bigger the hard disk is, there always comes a point where I start to be dangerously close to run out of disk space.

It is In these moments when you find that you forgot to delete the WAV tracks of that CD you ripped. That you don't need to have duplicate copies of everything you may want to use from both Windows and Linux because you can keep these in an NTFS partition and Linux will be happy to use them without prejudice.

And it is in these moments when I realise how much code I've abandoned over the years. Mainly in exploratory endeavours, I've written sometimes what in retrospective seem to be substantial amounts of code.

Just looking at abandoned Eclipse and Netbeans folders I find unfinished projects from many years ago. Sometimes I recognise them instantly, and always wonder at how subjective the perception of time is: in my mind that code is fairly fresh, but then looking at the timestamps I realize that I wrote that code seven years ago. Sometimes I wonder why I even thought that the idea was worth even trying at the time.

Yet here they are: a JPEG image decoder written in pure Java whose performance is about only 20% slower than a native C implementation. A colour space based image search algorithm complete with a web front end and back end for analysis. A Python arbitration engine that can scrape websites and alert of price differences applying Levenshtein comparisons across item descriptions. Enhancements to a remote control Android app that is able to drive a Lego Mindstorm vehicle over Bluetooth. That amalgamation of scripts that read EDI messages and extracts key data from them. Like seven different scripts to deal with different media formats, one for each camera that I've owned. And many more assorted pieces of code.

The question is, what I should do with this code? I'm afraid of open sourcing it, not because of patents or lawyers but because its quality is diverse. From slightly above alpha stage to close to rock solid. Some has test cases, some does not. In short, I don't feel it is production quality.

And I can't evade the thought that everything one writes starts in that state: we tend to judge the final product and tend to think that it was conceived in that pristine shape and form from the beginning. I know that's simply false: just look at the version control history of any open source project. But I want to have that smooth finish, clean formatting, impeccable documentation and fully automated build, test and deploy scripts from day one.

Yet some of this could be potentially useful to someone, even to me at some time in the future. So it is a shame to throw it away. So it always ends up surviving the disk cleanup. And I'll see it again in a few years and make myself the same question... why not have the equivalent of the code garage? Some place where you could throw all the stuff you no longer use or you don't think are going to use again and leave it there so anyone passing by can take a look and get the pieces if he/she is interested in them?

Monday, 14 April 2014

Heartbleed: the root cause

I can't resist on commenting this, because Heartbleed is the subject of countless debates in forums. In case you've been enjoying your privately owned tropical island for the past week or so, Heartbleed is the name given to a bug discovered in the OpenSSL package. OpenSSL is an Open Source package that implements the SSL protocol, and is used across many, many products and sites to encrypt communications between two endpoints across insecure channels (that is, anything connected by the internet is by definition insecure)

The so-called Heartbleed bug accidentally discloses part of the server memory contents, and thus can leak information that is not intended to be known by anyone else but the OpenSSL server. Private keys, passwords, anything stored in a memory region close to the one involved in the bug can potentially be transmitted back to an attacker.

This is serious. Dead serious. Hundreds of millions of affected machines serious. Thousands of million of password resets serious. Hundreds of thousands of SSL certificates renewed serious. Many, many man years of work serious. Patching and fixing this is going to cost real money, not to mention the undisclosed and potential damage arising from the use of the leaked information.

Yet the the bug can be reproduced in nine lines of code. That's all it takes to compromise a system.
Yet with all its dire consequences, the worst part around Heartbleed for me is what we're NOT learning from it. Here are a few of the wrong learnings that interested parties extract:
  • Security "experts" : this is why you need security "experts", because you can't never be safe and you need their "expertise" to mitigate this and prevent such simple mistakes to surface and audit everything right and left and write security and risk assesment statements.
  • Programmers: this Heartbleed bug happened because the programmer was not using memory allocator X, or framework Y, or programming language Z. Yes, all these could have prevented this mistake, yet none of them were used, or could be retrofitted easily into the existing codebase.
  • Open Source opponents: this is what you get when you trust the Open Source mantra "given enough eyeballs, all bugs are shallow" Because in this case a severe bug was introduced without no one realizing that, hence you can't trust Open Source code.
All these arguments are superficially coherent, yet they are at best wrong but well intentioned and at worst simply lies.

In the well intentioned area we have the "Programmers" perspective. Yes, there are more secure frameworks and languages, yet no single programmer in his right mind would want want to rewrite something of this complexity caliber without at least a sizeable test case baseline to verify it. Where's that test case baseline? Who has to write it? Some programmer around there, I guess, yet no one seems to have bothered with it. In the decade or so that OpenSSL has been around. So these suggestions are similar to saying that you will not be involved in a car crash if you rebuild all roads so that they are safer. Not realistic.

Then we have the interested liars. Security "experts" were not seen anywhere during the two years that the bug has existed. None of them analyzed the code, assuming of course that they were qualified to even start understanding it. None of them had a clue that OpenSSL had a bug. Yet they descend like vultures on a dead carcass on this and other security incidents the demonstrate how necessary they are. Which in a way is true, they were necessary much earlier ago, when the bug was introduced. OpenSSL being open source means anyone at any time could have "audited" the code and highlighted all the flaws -of which there could be more of this kind- and raised all the alerts. None did that. Really, 99% of these "experts" are not qualified to do such a thing. All bugs are trivial when exposed, yet to expose them one needs code reading skills, test development skills and theoretical knowledge. Which is something not everyone has. 

And we finally have in the deep end of the lies area we have the Open Source opponents perspective. Look at how this Open Source thing is all about a bunch of amateurs pretending that they can create professional level components that can be used by the industry in general. Because you know, commercial software is rigurously tested and has the backing support of commercial entities whose best interest is to deliver a product that works as expected.

And that is the most dangerous lie of all. Well intentioned programmers can propose unrealistic solutions, the "security" experts can parasite the IT industry a bit more but that creates at best inconvenience and at worst a false sense of security. But assuming that these kinds of problems will disappear using commercial software puts everyone in danger.

First, because all kind of sotfware has security flaws. Ever heard of patch Tuesday? Second, because when there is no source code, there is no way of auditing anything and you rely on trusting the vendor. And third, because the biggest OpenSSL users are precisely commercial entities.

However, as easy it is to say if after the fact, it remains true that there are ways of preventing future Heartbleed-class disasters: more testing, more tooling and more auditing could have prevented this. And do you know what is the prerequisite to do all these things? Resources. Currently the core OpenSSL team consists of ... two individuals. None of which are paid directly for development of OpenSSL. So the real root cause of Heartbleed is lack of money, because there could be a lot more people that could be auditing and crash proofing OpenSSL, if only they were paid to do it.

But ironically, it seems that there is plenty of money on some OpenSSL users, whose business relies heavily on a tool that allows to securely communicate over the Internet. Looking from this perspective, Heartbleed could have prevented if any of the commercial entities using OpenSSL had invested some resources on auditing or improving OpenSSL instead of profitting from it.

So the real root cause of Hearbleed lies in these entities taking away without giving back. And when you look at the list, boy, how they could have given back to OpenSSL. A lot. Akamai, Google, Yahoo, Dropbox, Cisco or Juniper,  to name a few, have been using OpenSSL for years, benefitting from the package yet not giving back to the community some of what they got. So think twice before basing part of your commercial success on unpaid volunteer effort, because you may not have to pay for it at the beginning, but later on could bite you. A few hundred of millions of bites. And don't think that holding the source code secret you're doing it better, becase in fact you're doing it much worse.

Monday, 26 August 2013

What is wrong with security: "don't use bcrypt"

You know, security is lately one of my biggest sources of irritation. More so when I read articles like this one. On the surface, the article is well written, even informative. But it also shows off most of what is currently wrong with computer security.

Security, like most other areas of the IT world, is an area of specialization. If you look around, you'll see that we have database, operating system, embedded system, storage and network experts. While it is true that the job role that has the best future prospect is the generalist that can can also understand and even learn deeply any subject, it is also true that after a few years of working focused on a specific subject, there is a general tendency to develop more deep knowledge in some subjects than others.

Security is no different in that regard, but has one important difference with all the others: what it ultimately delivers is the absence of something that is not even known. While the rest of the functions have more or less clearly defined goals in any project or organization, security can only provide as proof of effectiveness the lack -or a reduction- of security incidents over time. The problem is, while incidents in other areas of computing are always defined by "something that is not behaving as it should", in security an incident is "something that we did not even know could happen is actually happening"

Instead of focusing on what they don't know, the bad security focus on what they know. They know what has been used so far to exploit an application or OS, so here they go with their vulnerability and antivirus scanners and willingly tell you if your system is vulnerable or not. Something that you can easily do yourself, using the exact same tools. But is not often you hear from them an analysis of why a piece of code is vulnerable, or what are the risky practices you should avoid. Or how the vulnerability was discovered.

And that is part of the problem. Another part of the problem is their seemingly lack of any consideration of the environment. In a similar way to the "architecture astronauts" the security people seem to live in a different world. One where there is no actual cost-benefit analysis of anything and you only have a list of know vulnerabilities to deal with, and at best a list of "best practices" to follow. Such as "don't use bcrypt"

And finally, security guys are often unable to communicate in a meaningful way according to their target audience. Outside a few specialist, most people in the IT field (me included) lack the familiarity with the math skills required to understand the subtle points of encryption, much less the results of the years of analysis and heavy number theory required to even attempt to efficiently crack encryption.

Ironically, the article gets some of these points right. At the beginning of the article, there is an estimation of cracking cost vs. derivation method that should help the reader make an informed decision. There is advice about the bcrypt alternatives and how they stack one against each other.

But as I read further the article, it seems to fall into all these security traps: for example, the title says "don't use bcrypt", only to say on its first paragraph "If you're already using bcrypt, relax, you're fine, probably" Hold on, what was the point of the article then? And if you try to read the article comments, my guess is that unless you're very strong on crypto, you'll not fully understand half of them and will come up confused and even more disoriented.

But what better summarizes what is wrong with security is the second paragraph: "I write this post because I've noticed a sort of "JUST USE BCRYPT" cargo cult (thanks Coda Hale!) This is absolutely the wrong attitude to have about cryptography"

How is detailing the reason for using bcrypt a wrong attitude about attitude? The original article is a good description of the tradeoffs of bcrypt against other methods. That is not cargo cult. Not at least in the same way as "just use a SQL database", "just use a NoSQL database", "just use Windows" or "just use Linux" are cargo cult statements. Those statements are cargo cult only when taken out of context. Like the DBA that indexes each and every field in a table in the hope that sacrificing his disk space, memory and CPU to the cargo cult church will speed up things.

But the original article was not cargo cult. Not more than the "don't use bcrypt" article is cargo cult.


I guess that what I'm trying to say is that there are "bad" and "good" security. The "bad" security will tell you all about what is wrong with something and that you should fix all this immediately. The good security should tell you not only what is vulnerable, but also how to avoid creating vulnerabilities in the future. And provide you ready made and usable tools for the job. And articles like "don't use bcrypt" are frustrating in that they give almost what you need, but in a confusing and contradictory way.

When I choose a database, or operating system, or programming language, or whatever tool to do some job, I do it having only a superficial knowledge the trade offs of each option. But I don't have to be an specialist in any of these to decide. I don't know the nuts and bolts of the round robin vs. priority based and how O(1) task schedulers work. Or the details of a B-Tree vs. hash table index implementations. Or the COW strategy for virtual memory. I know the basics and what works best in each situation, mostly out of experience and education. True, with time I will learn the details of some of these as needed. But a lot of the time software developers are making really educated best guesses. And the more complex the subject -and crypto is one of the most- the more difficult these decisions are.

If I want to encrypt something, I want to have an encrypt function, with the encryption method as a parameter and a brief explanation of the trade offs of each method. And make it fool proof, without any way of misusing it. Yes, someone will find a way of misusing it and probably will be a disaster. Find ways of finding these misuses.

So please security guys, give us tools and techniques to prevent security issues. With a balanced view of their costs and benefits. And let the rest of the world sleep safely in their ignorance of 250 years of number theory. That is your real job. Creating huge repositories of vulnerabilities and malware signatures is not good enough. That in fact does little to protect us from future threats. Give us instead the tools to prevent these in the first place. And in a way that everyone can understand them.Thank you.

Friday, 17 May 2013

IT Security: the ones following the rules are those without enough power to override them

With all the talk about IT governance, risk management, security compliance and all that terminology, it seems that most IT people ignore the realities of the environment they are working on.

As an example, let's have a corporate security department, defining security standards and imposing them on the IT organization for almost all possible situations. All in the name of keeping the company away from security incidents, yes. They dismiss all objections about usability, convenience, and even how the security standards are relevant or not to the company business.

That latter point is a pet peeve of mine. It is very easy to define security standards if you ignore everything else and just apply the highest levels of security to everyone. By doing that, nobody is ever going to come back to you and say that the security is not good enough, because you are simply applying the strongest one. However, unless your company or organization is actually a secret security agency, you're seriously restricting usability and the ability of the systems to actually help people doing their jobs. But hey, that's not on my mission statement, right?

What they forget is that applying these standards implies adding overhead for the company. All these security policies not only add time and implementation cost to the company, but also create day to day friction in how people use their tools to accomplish their work.

Not unsurprisingly, the end result is that all these policies end up being overriden by exception. Let's see a few examples coming from real life. Or real life plus a bit of exaggeration to add some humor (note, in the following paragraphs you can replace CEO with whatever role has enough power to override a policy)
  • Everyone has to enter a 16 digit password that has at least two digits, special characters and use words that do not appear in the Bible. That is, until the CEO gets to type that.
  • Everyone has to use two factor authentication, until the CEO loses his/her RSA token or forgets to take it to the beach resort.
  • Nobody can relay or forward mail automatically to external accounts. Until the CEO's mailbox becomes full and Outlook does not allow him/her to respond to a critical message.
  • Nobody can connect their own devices to the office network. Until the CEO brings to the office his/her latest iPad.
  • Nobody can share passwords, until the CEO's assistant needs to update the CEO location information in the corporate executive location database. Security forbids delegation for some tasks and this is one of them.
  • Nobody can use the built in browser password store, until the CEO forgets his/her password for the GMail account that is receiving all the mail forwarded from his coporate account.
  • All internet access is logged, monitored and subject to blacklist filters. Until the CEO tries to download his/her son latest Minecraft update.
  • No end user can have admin rights on his/her laptop, until the CEO tries to install the latest Outlook add-on that manages his/her important network of contacts.
  • USB drives are locked, that is, until the CEO wants to see the interesting marketing materials given away in a USB thumb drive in the last marketing agency presentation, or wants to upload some pictures of the latest executive gathering from a digital camera.
I'm sure you can relate these examples to your real world experience. Now, except for a few perfectly understandable cases of industries or sectors where security is actually essential for the operations of the company, what do you think will happen? Experience tells me that the CEO will get an exception for all these cases.

The corollary is: security policies are only applicable for people without enough power to override them. Which often means that the most likely place for a security incident to happen is in... the higher levels on the company hierarchy. Either that or you make sure the security policy does not allow exceptions. None at all, for anyone. I'm sure that would make the higher company executive levels much more interested in the actual security policies and what they mean for the company they are managing.

Monday, 15 April 2013

Record retention and proprietary data formats

My recent experience with an application upgrade left me considering the true implications of using proprietary data formats. And I have realized that they are an often overlooked topic, but with profound and significant implications that are often not addressed.

Say you live in a country where the law requires you to keep electronic records for 14 years. Do you think it is an exaggeration? Sarbanes-Oxley says auditors must keep all audit or review work papers from 5 to 7 years.You are carefully archiving and backing up all that data. You are even copying the data to fresh tapes from time to time, to avoid changes in tape technology leaving you unable to read that perfectly preserved tape -or making it very hard, or having to depend on an external service to restore it.

But I've not seen a lot of people make themselves the question, once you restore the data, which program you'll use to read it? Which operating system will that program run on? Which machine will run that operating system?

First, what is a proprietary data format? Simple, anything that is not properly documented in a way that would allow anyone with general programming skills to write a program to extract data from a file.

Note that I'm leaving patents out of the discussion here. Patents create additional difficulties when you want to deal with a data format, but do not completely lock you out of it. It merely makes things more expensive, but you'll definitely be able to read your data, even if you have to deal with patent issues, which are another different discussion altogether.

Patented or not, an undocumented data format is a form of customer lock in. The most powerful there is, in fact. It means that you depend on the supplier of the programs that read and write that data forever. But the lock in does not stop here. It also means that you are linking your choices of platform, hardware, software, operating system, middleware, or anything else your supplier has decided that is a dependency to read your data.

In the last few years, virtualization has helped somewhat with the hardware part. But still does not remove it completely, in that there could be custom hardware or dongles attached to the machine. Yes, it can get even worse. Copy protection schemes are an additional complication, in that they make it even more difficult for you to get at your data on the long term.

So in the end, the "data retention" and "data archiving" activities are really trying to hit a moving target, one that is very, very difficult to actually hit. Most of the plans that I've seen only focus on some specific problems, but all of them fail to deliver an end to end solution that really address the ability to read the legacy data on the long term.

I suppose that at this point, most of the people reading this is going back to check their data retention and archiving plans and looking for gaping holes in the plans. You found them? Ok, keep reading then.

A true data archiving solution has to address all the problems of the hardware and software necessary to retrieve the data over the retention period. If any of the steps is missing, the whole plan is not worth spending in. Unless of course you want your plan to be used as mean for auditors to thick the corresponding box in their checklist. It is ok for the plan to say "this only covers xxx years of retention, we need to review it in the next yyy years to make sure daat is still retrievable", it is at least much better and more realistic than saying "this plan will ensure that the data can be retrieved in the following zzz years" without even considering that way before zzz years have passed the hardware and software used will become unsupported, or the software supplier could disappear without no one able to read the proprietary data format.

There is an easy way of visualizing this. Instead of talking about the business side of record retention, think about your personal data. All your photos and videos of your relatives and loved ones, taken over the years. All the memories that they contain, they are irreplaceable and also they are something you're likely to want to access in the long term future.

Sure, photos are ok. They are in paper, or perhaps in JPG files, which are by the way very well documented. But what about video? Go and check your video camera. It is probably using some standard format, but some of them use weird combination of audio and video codecs, with the camera manufacturer providing a disk with the codecs. What will happen when the camera manufacturer goes out of business or stops supporting that specific camera model? How you will be able to read the video files and convert to something else? That should make you think about data retention from the right point of view. And dismiss anything that is in an undocumented file format.
 

Monday, 11 February 2013

I just wanted to compile a 200 line C program on Windows

Well, 201 lines to be exact. How fool I was.

Short story: we have a strange TIFF file. There has to be an image somehow stored there, but double clicking on it gives nothing. By the way, this file, together with a million more of them, contains the entire document archive of a company. Some seven years ago they purchased a package to archive digitized versions of all their paper documents, and have been dutifully scanning and archiving all their documents there since then. After doing the effort of scanning all those documents, they  archived the paper originals off site, but only organized them by year. Why pay any more attention to the paper archive after all? In the event of someone wanting a copy of an original document, the place to get it is the document archiving system. Only in extreme cases the paper originals are required, and in those cases yes, one may need a couple of hours to locate the paper original, as you have to visually scan a whole year of documents. But is not that of a big deal, especially thinking about the time saved by not having to classify paper.

All was good during these seven years, because they used the document viewer built into the application that works perfectly. However, now they want to upgrade the application, and for the first time in seven years have tried to open one of these files (that have the .tif extension) with a standard file viewer. The result is that they cannot open the documents with a standard file viewer, yet the old application displays them fine. Trying many standard file viewers at best displays garbage, at worst crashes the viewer. The file size is 700K in size, the app displays them perfectly, so what exactly is there?

Some hours of puzzling, a few hexdumps and a few wild guesses later, the truth emerges: the application is storing files with the .tif extension, but was using its own "version" of the .tif standard format. Their "version" uses perhaps the first ten pages of the .tif standard and then goes on its own way. The reasons for doing this could be many, however I always try to keep in my mind that wise statement: "never attribute to malice what can be adequately explained by incompetence"

The misdeed was, however, easy to fix. A quite simple 200 line C program (including comments) was able to extract the image and convert it to a standard file format. At least on my Linux workstation.

I was very happy with the prospect of telling the good news to the business stakeholders: your data is there, you've not lost seven years of electronic document archives, it is actually quite easy and quick to convert these to a standard format and you can forget about proprietary formats after doing that. However, I then realized that they used Windows, so I had to compile the 200 line C program in Windows just to make sure everything was right.

Checking the source, I could not spot any Linux specific things in the program, all appeared to be fairly vanilla POSIX. However what if they are not able to compile it, or the program does something differently? This is one of the moments when you actually want to try it, if only to be absolutely sure that your customer is not going to experience another frustration after their bitter experience with their "document imaging" system and to also learn how portable your C-fu is across OSs. Too many years of Java and PL/SQL and you get used to think that every line of code you write has to run unchanged anywhere else.

So I set myself to compile the C source in Windows before delivering it. That's where, as most always, the frustration began. The most popular computing platform became what is now, among other things, by being developer friendly. Now it seems that it is on its way to become almost developer hostile.

First, start with your vanilla Windows OS installation that likely came with your hardware. Then remove all the nagware, crappleware, adware and the rest of things included by your friendly hardware vendor in order to increase their unit margins. Then deal with Windows registration, licensing or both. Then patch it. Then patch it again, just in case some new patches have been released between the time you started the patching and now that the patching round has finished. About four hours and a few reboots later, you likely have an up to date and stable Windows instance, ready to install your C compiler.

Still with me? In fairness, if you already have a Windows machine all of the above is already done, so let's not make much ado about that. Now we're on the interesting part, downloading and installing your C compiler. Of course, for a 200 line program you don't need a full fledged IDE. You don't need a profiler, or debugger. You need something simple, so simple that you think one of the "Express" editions of the much renowned Microsoft development tools will do. So off we go to the MS site in order to download one of these "Express" products.

So you get here and look at your options. Now, be careful, because there are two versions of VS Express 2012. There's VS Express 2012 for Windows 8 and there's VS Express 2012 for Windows Desktop, depending if you're targeting the Windows store or want to create... what, an executable?. But, I thought Windows was Windows. In fact, I can run a ten year old binary on Windows and will still work. Oh, yes, that's true, but now MSFT seems to think that creating Windows 8 applications is so different than creating Windows Desktop applications that they have created a different Express product for each. Except for paying VS customers, who have the ability to create both kinds of applications with the same product. Express is Express and is different. And you don't complain too much, after all this is free stuff, right?

As I wanted to create a command line application, without little interest in Windows Store, and without being sure of whether an inner circle of hell awaited if I choose one or the other, I simply choose VS Express 2010. That will surely protect me from the danger of accidentally creating a Windows Store application, or discovering that command line apps for example were no longer considered "Windows Desktop Applications" You may think that I was being too cautious or risk averse at this point, but really, after investing so much time in compiling a 200 line C command line utility in Windows I was not willing to lose much more time with this.

Ah, hold on, the joy did not end there. I finally downloaded VS 2010 Express and started the installation, which dutifully started and informed me that it was about to install Net 4.0. How good that the .Net 4.0 install required a reboot, as I was starting to really miss a reboot once in a while since all the other reboots I had to do due to the patching. At least the install program was nice enough to resume installation by itself after the reboot. Anyway, 150 MB of downloads later, I had my "Express" product ready to use.

What is a real shame is that the "Express" product seems to be, once installed, actually quite good. I say "seems" because I did not play with it much. My code was 100% portable in fact, and it was a short job to discover how to create a project and compile it. Admittedly  I'm going to ship the executable to the customer the build with debug symbols, as I was not able to find where to turn off debug information. Since the program is 30K in size, that's hardly going to be a problem, and if it is, it's 100% my fault. To be honest, I lost interest in VS Express 2010 quickly once I was able to test the executable and verify that it did exactly the same as the Linux version.

But the point is, in comparison, I can build a quite complete Linux development environment in less than two hours, operating system installation included, incurring in zero licensing cost and using hardware much cheaper than the one needed to run Windows. Why is that to create a Windows program I need to spend so much time?

What happened to the "developers, developers, developers" mantra? Where is it today? Anyone old enough can remember the times when Microsoft gave away free stacks of floppy disks to anyone remotely interested in their Win32 SDK. And those were the days without internet and when CD-ROMs were a luxury commodity. And the days when IBM was charging $700 for their OS/2 developer kit. Guess who won the OS wars?

Things have changed, for worse. Seriously, Microsoft needs to rethink this model if at least they want to slow their decline. At least, I guess I've discovered one pattern that probably can be applied to any future OS or platform. Today, to write iOS/MacOS programs you need to buy a Mac and pay Apple $100. The day it becomes more difficult, complex, or expensive (as if Apple hardware were cheap), that day will be the beginning of the end for Apple.

Tuesday, 5 February 2013

The results of my 2012 predictions - 3 wrong, 8 right

A bit late, but time to review what has happened with my 2012 predictions. Since the score is clearly favorable to me, please allow me the time to indulge in some self congratulation, and offer also my services as a technology trend predictor at least better than big name market analysis firms. No, not really. But nonetheless having scored so high deserves some self appraisal, at least.

The bad

Windows becoming legacy. I was wrong on this one, but only on the timing. Microsoft's latest attempt to revive the franchise is flopping on the market, to the tune of people paying for getting Windows 8 removed from computers and replaced by Windows 7. Perhaps Redmond can reverse the trend over time, perhaps Windows 9 will be the one correcting the trend. But they have already wasted a lot of credibility, and as time passes it is becoming clear that many pillars of the Windows revenue model are not sustainable in the future.

  • Selling new hardware with the OS already installed worked well for the last twenty years, but the fusion of the mobile and desktops, together with Apple and Chromebooks are already eroding that to a point where hardware manufacturers are starting to have the dominant position in the negotiation.
  • The link between the home and business market is broken. Ten years ago people were buying computers essentially with the same architecture and components for both places, except perhaps with richer multimedia capabilities at home. Nowadays people are buying tablets for home use, and use smartphones as complete replacements of things done in the past with desktops and laptops.
  • On the server side, the open source alternatives gain credibility and volume. Amazon EC is a key example where Windows Server, however good it is, it is being sidetracked on the battle for the bottom of the margin pool.

JVM based languages. I was plain wrong on this one. I thought that the start of Java's decline would give way to JVM based alternatives, but those alternatives, while not dead, have not flourished. Rails keeps growing, PHP keeps growing and all kind of JavaScript client and server based technologies are starting to gain followers.

As for compuer security... well, the shakeup in the industry has not happened. Yet. I still think that the most of the enterprise level approach to security is plain wrong, focused more on "checklist" security than on actual reflection of the dangers and implications of their actions. But seems that no one has started to notice except me. Time will tell. In the end, I think this one was more of a personal desire than a prediction in itself.

The good 

Mayan prophecy. Hey, this one was easy. Besides, if it were true, I won't have to acknowledge the mistake on a predictions result post.

Javascript. Flash is now irrelevant. Internet connected devices with either no Flash support at all or weak Flash support have massively outnumbered the Flash enabled devices. jQuery and similar technologies now provide almost the same user experience. Yes, there are still some pockets of Flash around, notably games and the VMWare console client, but Flash no longer is the solution that can be used for everything.

NoSQL. I don't have hard data to prove it, but some evidence -admittedly a bit anecdotal- from its most visible representative, MongoDB, strongly suggest that the strengths and weaknesses of each NoSQL and SQL are now better understood. NoSQL is no longer the solution for all the problems, but a tool that, as any other, has to be applied when it is most convenient.

Java. I have to confess that I did not expected Java to decline so quickly, but as I said a year ago, Oracle had to change a lot to avoid that, and it has not. The latest batches of security vulnerabilities (plus Oracle's late, incomplete and plain wrong reaction) have finally nailed the coffin for Java in the browser, no chances of going back. A pity, now that we have Minecraft. On the server side, the innovation rate in Java is stagnated and the previously lightweight and powerful framework alternatives are now seen as bloated and complex as their standards derived by committee brethren.

Apple. Both on the tablet and mobile fronts. Android based alternatives already outnumber Apple's products in volume, if not in revenue. And Apple still continues to be one of the best functioning marketing and money making machines on the planet.

MySQL. This one really is tied down again to Oracle's attitude. But it has happened, both for the benefit of Postgres and the many MySQL forks (MariaDB, Percona, etc) that keep in their core what made MySQL so successful.

Postgres. In retrospect, that was easy to guess, given the consistent series of high quality updates received in the last few years and the void left by Oracle's bad handling of MySQL and the increasingly greedy SQL Server licensing terms.

Windows Phone. Again, an easy one. A pity, because more competition is always good. As with Winodws 8, it remains to be seen if Microsoft can -or want to- rescue this product from oblivion.

Will there be any 2013 predictions now that we're in February?

On reflection, some of these predictions were quite easy to formulate, if somehow against what the general consensus was at the time. That's why there is likely not going to be 2013 predictions. I still firmly think that Windows will go niche. That is happening today, but we have not yet reached the "Flash is no longer relevant" tipping point. You'll know that we've arrived there when all the big name technologists start saying that they were seeing it coming for years. But they have not started saying that. At least yet.

Anyway, this prediction exercise left my psychic powers exhausted. Which is to say, I don't have that many ideas of how the technology landscape will change during 2013. So as of today, the only prediction I can reliably make is that there won't be 2013 predictions.

Developing Android applications with Ubuntu - II

It has been a few months since my latest post, and I've been quite busy with other interests during these times, but finally got some time to reflect and post a few updates.

Last time I wrote something, it was my intention to start playing around with Android applications.
Note that in this context, "applications" means software packages where the final user is also the one who is paying for the application. Enterprise packages can have notoriously bad user interfaces and people using these can complain as much as they want, but at the end they are being paid for using them, and unless someone can positively prove some productivity gains of a UI upgrade, these user interfaces will remain there now and forever.

Android applications fall squarely on the category where asking someone for money raises the level of expectations. Nowadays, the race to the bottom in pricing applications has left very little margin per unit sold. Very few Android apps cost more than 99 cents, the underlying idea is that you'll make it up what is lost in per unit margins by leveraging the sheer market size of the billions of Android devices and leveraging the sales volume. The end result is that for such low amount of money, the users are expecting polished, well designed, reliable and well behaved applications.

Compound that with the problem of market saturation. "There is an app for that" is a very convincing slogan, and is also true in the Android market. Almost all types of market niches for applications have already been occupied. It's very hard to think of an application that is not either already done well enough to occupy its niche or has enough free good enough alternatives that nobody is seriously thinking of making money selling one. There is always the ad-supported option, of course, but that is something that introduces a lot more uncertainty in the equation.

(now someone will say that the market saturation problem is only an idea problem, and will be probably right. Could be entirely my own problem not being able to come up with new ideas)

So far I've created very few things worth trying to sell, or even give away. But all is not lost, at least this experience has reminded me of an important fact that I have almost forgotten: developing applications is difficult. I mean, one gets used to look only at the server side portions of an application and analyzing them in detail, while essentially ignoring all the other components.

The phone development environment starts by throwing you back to the days of the past. Seemingly innocent development decisions have consequences on CPU and RAM usage that you're used to discard as transient spike loads on a desktop or server, but in those limited machines can make or break the difference between an usable application and one that the OS decides to close because it's taking to long to respond or too much memory to run.

What we take today for granted, such as dealing with different timezones (with different daylight saving time rules changing from year to year), different character sets and different localization rules are the results of lots of people working during lots of time, including doing such unglamorous things as standards committees. Those are amazing achievements that have standardized and abstracted huge portions of application specific functionality, but even so, they are only a small part of the scope that an application has to provide.

And let's face it, the most unpredictable, irrational, demanding and unforgiving component in any software application is the human sitting in front of it. In any application, even the trivial looking ones, there is a lot of user interaction code out there that has to deal with human events happening in crazy order, data entered in weird formats that is expected to be understood and business rules that have to match the regulatory landscape changes of the last fifty years or so.

Further proof of that: the number one category of security vulnerabilities is exploiting memory management errors (buffer overflows, use of orphan pointers) by... usually sending the application malformed input. This is not by accident, dealing with user input correctly is one of the hardest parts of creating a satisfactory user experience.

Let's not even add the regulatory compliance, audit requirements, the integration requirements with the rest of the environment -perhaps using those beloved text files- and the technical standard compliance and cross platform requirements.

All this adds up to a delicate balance between the user experience, the real world metaphors and processes being modeled and implemented, and the technical environment. And all this for 99 cents.
 I'm not dropping completely the idea of selling some day an Android application, but it will have to wait for the right idea to come, and also for the necessary time to execute it properly.

There is also an emerging market for Android applications, one that is starting to surface and gaining momentum, as business adoption of Android and iPhones expands: the enterprise application, mobile version. Yes, expect some of these ugly use interfaces to be ported over to mobile platforms and likely this is the next big revenue source for mobile developers. And of course, I expect these applications to have performance issues, too.

But so far, my biggest learning is not with the ADK, Dalvik, ICS vs. Jelly Bean or Eclipse, for that matter. My biggest learning from all this is that there is a world of difference between focusing on a single area of an application and improving its performance or resource usage and delivering a complete application. That requires a different skill set. And after looking for a while at creating mostly toy Android applications, I'm glad that this experience has reminded me of all this. Too long living in the ivory tower can make you forget that these simple things are, in fact, quite complex.

Sunday, 6 May 2012

Developing Android applications with Ubuntu - I

The journey begins


What? Hey, you are usually focused in ranting about random topics, database performance, and generally proving the world how smart you are. Why then this sudden curiosity for creating an Android application?

It is part curiosity, part opportunity. As they say, opportunities are there waiting for someone that is in the right place at the right time to catch them. I'm not that one, for sure, but still, after the sad news that come from the Java camp, I wanted to explore new ways of writing applications.

Of course, it also helps if the potential audience for your application is numbered in the hundred of millions, if not more.

So, I wanted to develop a simple Android application. Being a Linux aficionado, and looking at the Google docs, Eclipse under Linux seemed like the main opportunity. Let's start with the basics.

Setting up the stage


First, install the Android SDK. Well, the Android SDK is just a zip file that you extract somewhere in your local disk. According to what I read later, one can create whole applications with the SDK without needing any IDE at all. It has been a long time since I created user interfaces out of raw hexadecimal dumps, so I'm not one of those brave souls. In any case, take note of the folder where you extract the Android SDK. You'll need it later.

Android likes you to use Eclipse to create applications. Perhaps, after my long stint with NetBeans it's time to go back to Eclipse again? For some reason, I tend to go from NetBeans to Eclipse and back each year or so. I tend to like the all-included NetBeans philosophy, whereas Eclipse is the place where the minority and cutting edge tools start to appear. This time is back to Eclipse, I guess.

So go to Kubuntu and start Muon. Oh, or Software Center or something similar if you're using Ubuntu. Make sure Eclipse is installed. Start Eclipse to make sure everything is ok. Choose a suitable folder as your workspace.

Next, you can finally go to http://developer.android.com/sdk/eclipse-adt.html#installing and attempt to follow the steps to install the Eclipse infrastructure for Android. You go to Help->Install new software. You add https://dl-ssl.google.com/android/eclipse/ to the Eclipse list of sources, select the Developer Tools, click next and after a quite long pause you get... an error.

Cannot complete the install because one or more required items could not be found.
Software being installed: Android Development Tools 16.0.1.v201112150204-238534 (com.android.ide.eclipse.adt.feature.group 16.0.1.v201112150204-238534)
Missing requirement: Android Development Tools 16.0.1.v201112150204-238534 (com.android.ide.eclipse.adt.feature.group 16.0.1.v201112150204-238534) requires 'org.eclipse.wst.sse.core 0.0.0' but it could not be found


This is one of these errors that if it were not for Google, I'd never be able to resolve. Fortunately, a noble soul has documented the fix, even with a video here. Thanks a million. However, I'm feeling that this is threading into waters that I don't know well enough. There is something very good about the Internet. Being able to tap such huge resources of information is fantastic, but am I really learning something by applying the fix? Yes, that there are people out there that know a lot more than I. Better respect these people and try to contribute something back, like with this article.

Are you ready to create your first Android app? Not yet. When you restart, Eclipse warns you that you have not selected an Android SDK. Go and define one, choosing the right API level for your target and using the folder where you extracted the SDK package. My target is going to be Android version 2.1, just because I happen to have a phone that runs that version.

Now, I'm ready for Hello World.

Wednesday, 21 March 2012

Microsoft is now a niche player


If you're about to purchase a smartphone, a tablet, or even a PC, you probably have already noticed it: Microsoft now has become a niche player.

It is all about how the balance of producers and consumers of content has evolved. When the PC revolution started, PCs were used to create content that was consumed by other means. PCs were, and they are still, used to create music, graphics, movies, books or movies. They were used to enter data. But the content was primarily consumed in non electronic forms. Magazines, theatres, records. Paper, film or vinyl. Computers helped to create content that was consumed in other mediums.

The only exception of this rule was, and still is, data processing applications. Data is entered in an application, and then transformed and retrieved in many ways, but the results rarely go out of the application, perhaps they are interfaced with other applications and is transformed. But the ratio between the amount of transactions entered and the volume of information that is extracted is increasingly smaller. Data is condensed in tiny amounts of information for dashboards, account statements or check balances.

Then things started to go digital. Content created on computers is increasingly consumed only on electronic devices. And the PC was the main device used to consume content. Databases, on the other hand, increased in size and complexity, with each evolution of the technology, each iteration generating bigger and bigger amounts of data. A significant trend is that the most of today's data is directly entered by the end user, be it plane reservations, shopping carts or generated based on clickstreams from web sites. There are less and less data entry clerks, for each iteration of process optimisation attempts to reduce or eliminate the need for human intervention. Warehouses and store shelves are full of bar code labels that reduce or eliminate data entry to its minimal expression.

Ten years ago, if you wanted to do anything useful with a PC, there was little choice but use Windows. It was the result of a three pronged approach: the tight control Microsoft exerted over the hardware manufacturers ensured that Windows was a popular, even cost effective choice, for PC hardware. Their product portfolio covering such a wide surface of applications allowed them to offer very seductive deals to their customers. In the database area, for example, it was not uncommon years ago to hear someone going to standardise in SQL Server, and learning from insiders that the product was throw in the box close to free as part of a much larger deal involving workstation, office and server software. And finally, their lock in in the proprietary formats and protocols kept everyone else from making competing products.

When the PC was the only device capable of running applications for content creation, there was little choice but use Windows. When the mainframe terminal died, the PC was the only alternative for data entry.

The world of today is different. The balance of content creators versus content consumers has shifted. Content can be created and consumed in many different ways, all of them completely digital. There are now orders of magnitude more devices in the world capable of running applications than personal computers running Windows. New classes of devices (phones, tablets, settop boxes, book readers) have separated clearly the roles of creator and consumer. You no longer need to use the same device for creating and consuming content. Data entry happens by means of bar code scanners or users entering the information themselves, and behaviour data is collected automatically by web logs or TV settop boxes.

And almost none, if not all, of those devices run Windows. Windows and windows applications have failed to move to these scenarios, except when they have managed to hide an embedded PC inside the devices (think of ATMs). At this point, I can only see three Windows use cases, and each is getting weaker and weaker.

  • Enterprise applications and office productivity: that is a now niche that is restricted only to people needing five year old applications that depend on Windows being compatible to run them. That plus people at home that want to have a home computing environment similar to the one in the office. This segment is being attacked very effectively by cloud services and apps, but the inertia here is huge, so it's going to last them a few years. It is also the most profitable, so expect Microsoft to fight to death to preserve it.
  • Content creators: people that still need the full power and ergonomy of a desktop or laptop computer to create content. Note that even with the empowerment of the digital technology to create, the ratio of content creators vs. content consumers is still like 1 to 1000. This is not very profitable for Microsoft, but is a key segment because this channel in the past has served to promote content in propietary formats (VB, C#, SliverLight, Office formats, WMA, .AVI, DRM music, .avi,....)  that were essential to increase the desirability of their products for the consumer segment. Unfortunately for them, open standards and/or reverse engineering of formats and aversion to DRM are destroying the virtuous cycle of created content that can only be consumed on the Windows platform.
  • People that simply want a computer for basic tasks (browsing, mail, light content creation) and make a cost conscious purchase. It is actually true that Windows PCs are cheaper than Macs. While this is likely Microsoft's safest niche for now, it is so for a reason: this segment is the bottom of the barrel in terms of profitability. And both Mac and open source based alternatives are eroding market share from both the long and short ends of the profitability spectrum.

Microsoft Windows can now be considered a niche player in these three segments. It is a huge niche, and most anyone else would be happy to own these niches, but still a niche nonetheless. Either because of self complacency, protection of their cash cows, or lack of vision, Microsoft has failed to make any significant presence in any new technology since the year 2000 or so. The cruel irony is that protecting those niches is also what has lead them to losing in other segments. Disruptive players do not care about preserving their legacy because they don't have one to preserve.

Some of you may point to the XBox as a counter example. Check the financials of the Microsoft console division and see how long, if ever, they will recover all the money thrown to make XBox fight for number two or three in the console market before progressing the discussion.

In the database arena, things have been very similar. SQL Server has always been limited in scale by the underlying Windows platform. SQL Server could only grow as far as the type and number of CPUs (Intel or Alpha in the early days) word and RAM size of the Windows OS, and this prevented it being used for big loads, or even small or medium loads if there were plans to make them bigger. Since the definition of "big load" keeps changing with Moore's law, SQL Server has never made any serious inroads beyond the medium sized or departmental database, facing competition from above (Oracle, DB2) and below (open source) Could Microsoft have made SQL Server cross platform and have it running on big iron? Probably, at an enormous expense yes. But that would also miss the nice integration features that made it such a good fit to run under Windows. And also the reason to buy a Windows Server license.

And when SQL Server was seemingly ready for the enterprise, a number of competitors arrived that made unnecessary to host your database on your own server (Amazon). Or to have a relational database at all (NoSQL). Could Microsoft have moved earlier to prevent that? Probably, but that would have required first to foresee it, and it would have happened at the expense of those lucrative Windows licenses sold for each SQL Server instance.

So the genie is now out of the bottle, and Microsoft can't do anything to put him back in. They are now niche players. Get used to it. The next point of sale terminal may not be a PC with a connected cash drawer .