Sunday, 27 February 2011

Anything not scripted is a risk


From time to time, I've encountered a few situations that left me thinking about their root causes. Here are a few examples:

  • People joining the development team needs weeks to be able to build the application, install their development environment or both. Worst cases I've seen take months for someone to be productive.
  • Staging environment is terribly out of date. Nobody dares to propose to refresh it with a copy of the live environment, because they fear of not being able to make it perform at least close to what is now. Which is not very close to the live environment but at least not too much. Well, in fact nobody is quite sure of how close.
  • Disaster recovery is a nightmare. During the disaster recovery tests, the theoretical times to cold restore are way off the committed time. Someone always needs to go down and fix by hand whatever is blocking test progression. 
  • Deployment of changes on the live environment is a hit and miss affair. Even after acceptance testing, nobody really trusts that changes are not going to break anything else.
  • Nobody even dares to think of backing out changes in live environment. Debuggers are installed in production servers as a matter of routine. When a problem happens in live, nobody can reproduce a problem in the development or staging environments. Worse, backing up changes is seen as an effort equivalent to invoking disaster recovery.

The truth is, the larger the system the more likely those symptoms. In all these cases, the reason was the same: there were small -and big- details requiring manual intervention only know by a few, if any. Everyone goes along by copying configuration files from one machine to another, without really knowing what they are.

Until recently, I was convinced that this kind of problems were simply an artifact of the system complexities. With really large and complex systems, there is no single individual in an organization that knows everything about a system.

Oh yes, I hear you cry, this is all because there is no documentation. Lack of documentation is always the problem, isn't it? Sorry, lawyers, accountant and non software engineers: you see every problem as lack of documentation because in your world a change, however small, that is not documented is simply not done. Your tools of the trade are paper and pencil, and the people following processes need necessarily to refer to those papers to do anything. In the computing environment, paper means nothing until it is implemented, and is often faster to implement something than to document it. Thus, things tend to become undocumented over time, if only because there are always higher priorities than to do documentation and nonetheless, the system is working fine. How can this be a priority?

The revelation for me came when speaking with a very respected professional working in one of the largest world internet companies. In such companies, deploying changes has been compared with changing a tyre while the car is in motion. And it really is. We were having a conversation about uptime, and how different can be the "scheduled uptime" from real world uptime, at least in the context of a traditional business application. He was saying that for them, scheduled uptime equals flatly real world uptime, that is, if a system is 24x7, then there are no scheduled downtime windows. Period.

None of the non-customer facing environments I've worked with comply with that rule. Even the ones that boast of high availability and long uptime always exclude their maintenance windows from those counts. It is simply too risky for them to make changes on the live environment without having a number of additional safeguards and tests in place. How can then you make changes on a live environment without fears of breaking it?

Turns out, the key for this "real world uptime" to work is: everything has to be scripted. Changes are applied to a subset of the infrastructure first, tested and then applied to the whole environment. Nothing is manual but the execution of the script. For each change script there is always a rollback script.

Such a simple rule is not easy to enforce, and it may look at first too extreme. 

Diving deeply into the idea, there's another key point. The statement "everything has to be scripted" is applied literally, word by word. No exceptions. Server builds are scripted. Network configuration changes are scripted. Database changes are scripted. Development environments are scripted. Application deployments are scripted. Everything means everything, no exceptions. And everything can be undone with another script. No exceptions, except for destructive changes: there is little point in rolling back a server build to its factory origin state. Database changes are NOT destructive, except when you create a new database.

Such simple idea should not raise much resistance, but in fact it does. I've heard some counterarguments too many times, so in true spirit to this post, I've scripted the responses for you:

"There is no way of scripting this". That was true in the past, especially for MS Windows. The latest releases of Windows have (finally!!) embraced scripting end to end. There are very few, if any, details in server setup that cannot be scripted. If your application/middleware/whatever does not give you the option of scripting the installation and configuration, consider changing to another brand/supplier. Manual configuration is not an option. In this day and age there is simply no excuse for not listing the configuration file values or registry entries needed for configuration. Anyone that is saving configuration options in proprietary binary files is plain wrong.


While the rise of Windows in server environments is partly due to the ease of use and point and click administrative tools, it is time to recognize that those methods only work on extremely small scales. It is worth taking another line from the development community: anything that you repeat three or more times is worth encapsulating. Anyone with more than three servers should be scripting everything, unless your choice of a professional life is to hunt mysterious problems and configuration differences.

"But it's all very well documented". Agile "Code is documentation" adage is perfectly applicable here. Do you trust the script source code or some outdated, incomplete piece of paper that is kept around and passed as some sort of cargo cult sysadmin trivia?

"It's not worth the time to script simple changes" Oh yes, in fact any piece of software is composed of very simple instructions. It's only when the number of these instructions reaches the thousands that you start needing tools that help you remember them and make nice packages that can be reused without knowing the details. The truth is, enough simple things chained together make a complex one, and small changes are part of a bigger picture. Or looking at it in another way, which role do you want to play as a sysadmin? Do you want to be the factory line worker, mouse clicking these changes by hand over and over? Or you prefer to invest your time in setting up solutions that actually add value to the company? I have already made my choice.

"hey, I'm a sysadmin, not a developer" No, you're not a developer. Developers deal with scripts that are thousands of magnitude more complex than deployment scripts. If you were a programmer, your complexity rating of the typical deployment script would be on the lower bottom of the scale. If you call yourself a sysadmin and are not able to grok a script, then perhaps you should think of steering your career to somewhere else. Of course, applying programming best practices, such as not hard coding literals and structure properly your code, will make your efforts much more efficient. But typical 100 line scripts don't require this level of self control and are easily refactored, so don't be shy of creating something that a developer would judge as a bad program.

Please, sysadmins, do not take offense. I'm not saying that sysadmins are a bunch of not good enough developers. Development and sysadmin are different roles, with focus on very different subjects.

"It's too complex to be scripted" This is a good one, because it implies that you're judging things by your limited knowledge of the system, and the system operators know better than you. The proper answer in this case is to ask for the equivalent paper documentation. Surely it will be equally long and complex, wouldn't it? Check and review the equivalent manual steps before giving any credit to this argument.

Considering how many times I have seen this pattern repeated, I cannot imagine why is not an universal rule. Perhaps is that I'm not in contact with the leading practices, or that I'm myself getting outdated. But is worth repeating:

Everything software related in operations should be scripted. No exceptions. Any exception is a risk worth watching.

No comments:

Post a Comment