WorldTurner Blog
Stateless computing as I see it
Stateless computing as I see it
Stateless computing has had some press coverage in the past. What it means to me is that you don't deliver your software as some package that you need to install on a computer, but rather as the entire configuration of the computer as a whole.
That doesn't sound too far-fetched, considering what many are already doing, and I don't it should be considered far-fetched. Some points though:
Self-modifying code
Self-modifying code used to be a no-no, at least in general-purpose business software. (In embedded systems, high-performance assembly code or in self-optimized just-in-time compilers, it's acceptable). The boundaries blur a bit when scripting languages are considered; many Ruby and Python apps to modify their own code through frameworks, but they typically do that only once at startup, and don't change the way that the applications logically operate.
However, when it comes to running Unix servers, it is usually considered acceptable, or at least common practice, to change the system while it is in operation, without entirely overseeing the consequences of the changes. Adding or removing or upgrading system packages may very well change the way that the system operates as a whole, including the functions of any business applications that are running on the system. Worse, these changes are usually not tracked, stored in version control systems, executed in the same way on all systems which are supposed to have the same configuration, etc.
Tools like, Puppet, Chef, CFengine, etc. (let's call the the PCCE tools) address some of this, in particular the consistency of the configuration over multiple systems. They allow some tracking and version control, but only of the changes that are made by themselves. It is still possible to make changes and modify packages that are not configured by them.
PCCE tools typically configure the system after it has been imaged; they use a base OS image, perhaps one that already includes the tool itself, and then go about configuring them like they are programmed to do. Another approach is to configure as much as possible in the image already. The image could be a virtual machine image or a disk image that is used by physical hardware. Then either:
- You make the disk(s) on which your system resides read-only (not those where you app's writable data reside, obviously)
- Or you make sure that after each reboot, a pristine image is applied again to the system
Advantages
The major advantage is that you have a very good idea what your system is running. This makes it much easier to understand what is going on in the system. It helps to investigate problems; you can quickly build another system with the exact same configuration to investigate and test fixes. It makes sure that no changes are made to runtime systems that should actually have been made in the business application's source code or in the system configuration documentation, so you don't run into surprising problems when you upgrade the application or build a new instance of the system.
In large organizations (and I suppose in smaller ones as well), the root cause of most production issues are changes that people made on production systems without telling anyone, without applying them to the source of the application or system configuration, without testing. In good DevOps fashion, this can be attributed equally to developers and system administrators; both fall for the appeal of a quick fix that makes the users happy and makes them go home early, only to cause even bigger problems some time later. There is no malice assumed, but like writing code without proper testing, it's a flaw in your engineering skills if you do that. And current approaches to system management invite this way of changing things, since it is impossible on most operating systems to adequately track all changes and attribute them to those who made them. In software development, people are used to the fact that the CI server will point a stern finger at you when you break the build, forget to check in code or drop the test coverage level. But in system administration that rarely happens now.
How?
There are some issues to resolve before you can successfully use such an approach to stateless computing. You need to decide what type of image you want to make. The most obvious one looks to be a disk image, but other approaches like xCat's statelite are also possible. And then you need to decide how to build such an image. You can see the approach that I like most in my previous blog posts. Possible approaches are:
- Building (bootable) images on another machine using scripts, chroot and PCCE
- Starting a base image in a VM (or on physical hardware), configure it using scripts and PCCE, then snapshot it
- Use a tool like rPath to build your images, and do the fine-tuning with PCCE
Configuration!
When a business application has configuration items, they are usually in one file which you can modify. These usually involve settings that must be changed depending on the environment in which the application is run. For Development, (Integration) Testing, Acceptance Testing and Production environments, things like database servers and queue names usually need to be changed. Or the application may be run for multiple different customers (SAAS) or in different regions and be configured differently for each. If the application was a really good application, then only those items that actually do need to be changed are in this configuration file; other items are in a 'static' configuration that is not modifiable.
On a typical Unix system, there are very many configuration files. However, only a few of those need actual changes that need to be made after the entire image is built. Only those that actually differ between the environments that the application will run in, need to be changed after the image is baked. It makes a lot of sense to put all these configuration items in one file, or in one directory, or perhaps as one item on a configuration server on your network, and then use a PCCE tool (using templates) to apply these changes every time that the system boots. All other configuration should be done during the creation of the image.
If you didn't separate the configuration and place it outside the image, you would have to build images for each possible configuration, and worse, you would have to build different images for each phase in a DTAP environment, with no guarantee that they would actually behave in a similar fashion. I believe that configuration is actually the most difficult part of this story, and I haven't made up my mind about the best way to do this. Certainly looking for flexible best practices, products, etc. in this area.
The whole approach works best when the servers that you manage are somewhat similar or have multiple similar groups; if not, there would be too much variation configuration and the variation would go to a deeper level. This is especially clear when you are considering different CPU architectures or operating systems; it is immediately apparent that these require completely new images, and you want as few images as possible to keep things manageable. Yet, if you have fewer images than hosts in your environment, you still have less to worry about than in a situation in which you are configuring individual machines.
Convention over configuration
Some of the configuration problems could be alleviated by conventions. Especially conventions for commonly used middle-ware and back-ends contribute to a reduction in the number of configuration items. By incorporating the environment name in the name of each of these resources, you don't actually have to configure these resource names anymore. If your application is called X and your databases DEV_X, TEST_X, QA_X and PROD_X, then you don't have to configure the database names anymore; you just configure the type of environment that you are in and derive your name from that. The same applies to these resources:
- Database servers
- Database names
- Messaging middle-ware servers
- Queue names
- Logging hosts
- Monitoring servers
- License servers
- Directory servers
- etc
Writable data!
Another thing to tackle is writable data. Many Unix packages and business applications write data in their install directory. That's not the best place. You may already have had to change that to point to your NAS or SAN; if not, you're going to need to do that now. You can then make a further distinction between 'write-and-forget' data, such as logging output, and data that needs to be read and written (such as business data).
Conclusion
I think this approach can really work. By treating complete machines (or even sets of machines in a multi-tiered environment) as the final deliverable you make your environment more predictable and less complex. Technically, it is clear that it can be done. Organizationally, it should also be clear that this can only be done if development and operations work closely together, as in order to do this, knowledge of both is required in an intimately-entwined way. Yet, this need is one of the stated goals of the DevOps movement, and is an implied requirement anyhow in smooth application delivery and operations, so it shouldn't hold you back from pursuing stateless computing.
Disclaimer: everything in this post or any other post on my blog is purely my own opinion and does not (necessarily) represent the opinion of anybody else, including business partners, current or former employers, current or former clients, etc.
Erwin BolwidtPosted at 07:59PM Mar 06, 2011 by Erwin Bolwidt in DevOps | Comments[2]
Posted by Justin Cormack on March 12, 2011 at 08:54 PM CET #
Posted by Spike Morelli on March 17, 2011 at 08:28 AM CET #