Principles of Automation06 Dec 2015
I’m currently involved in a large scale automation project, a key part of this project is aligning a large number of teams, platforms and technologies. To provide guidance to all involved, the aims and principles of the project have been documented, to ensure consitency during the design and proof of concept implementation. Whilst writing this document I deliberately pulled references from non-IT sources, to reinforce how fundamental these principles are to success.
- Why Automate
- Why don’t we fix it with X?
- What is the alternative?
- Automation is new (no it isn’t)
- Our principles
In order for an organization to respond to modern business requirements the requirements on IT Infrastructure have changed massively over the past 10 years.
Legacy IT environments consisted of large critical applications that often consisted of three tiers, presentation, application and database. These would be managed as a standalone unit. In stark contrast a modern IT environment consists of multiple small, agile, applications that specialize in a small subset of an overall business requirement. These applications are highly dependent on one another in order to function correctly. This evolution in software methodology along with ubiquitous access to virtualization technology has resulted in an explosion in the number of virtual machines an organization must manage. Managing the lifecycle of these VM’s has become one of the primary differentiators between IT organizations that are successful, and those that are not.
“Teams who treat their infrastructure as a software system can scale the number of systems they manage without needing to scale their team size to match. They can make changes and improvements to their production infrastructure frequently and rapidly, with short turnaround times, low failure rates, and fast fix times.”
Infrastructure as code
Why don’t we fix it with X?
Often IT automation efforts start with a selected tool, for example and organization might pick Puppet or Chef, they will then look at their existing processes and replicate this workflow in their chosen automation tool. However as many organisations have found this often results in complex and difficult to manage interactions between the automation systems and existing legacy solutions, which often leads to poor flexibility and high rates of failure.
What is the alternative?
Another approach is to begin by defining our requirements, make use of recognised industry best practices, and design high-level process flows before selecting supporting infrastructure and tools based on their abilities to meet these requirements. In documenting and maintaining our automation processes outside of a particular toolset, we will enable a number of things: - It will help avoid the classic ‘if we’d thought of that up front’ issues which often hit automation efforts. - Processes that are automated may be moved between products quickly and accurately. - When designing our processes we will quickly be able to identify which process / tool chain / system an action should reside in.
Automation is new (no it isn’t)
Automation is not new. Many industries have effectively embraced automation in order to effectively scale and manage what would otherwise be an impractical task. For example supermarket supply chains, manufacturing etc. Although each of these is very different there are lessons we can learn from each. Within this document I will be primarily referencing three sources:
To ensure our automation is effective we will establish a number of Principles of Operation, these will be referenced and enforced throughout the automation process, it is therefore essential that all parties involved agree to these principles.
IT automation is not mechanization – this would be the scripting of the process.
IT automation is the facilitation of the automated delivery of the business requirements.
Example: “Business requires automated deployment of application X.”
Working Definition of Automation
Automation in a working context means more than just automatic machinery. Machinery implies mechanization. Automation also means the system information to direct and control the people, materials, and machines, or as coined by many, systemization. Automation, then, is made up of two components, like a vector: the mechanization or material flow, and systemization, the information flow.
Automation includes every interaction IT has with the requester, including: - The interface the requester utilises to issue the request - The logic determining where the VM should be built - The deployment and configuration of the VM - The lifecycle of the VM – patching, management and eventual decommissioning
The automated infrastructure must facilitate the customers (internal or external) complete self-service of all Managed Virtualization components.
Example: A customer requires two new Windows VM’s be built in their ‘WEB’ cluster.
Automation requires the destination (resource pool / host etc) is captured as the customer enters the request, such that no break points or manual intervention is required at a later point in the process
Antipattern - Hand-cranked virtual infrastructure
Hand-cranked virtual infrastructure uses virtualization tools to manage hardware resources, but doesn’t provide them to users dynamically or with a self-service model. I’ve seen organizations do this with expensive virtualization software, and even cloud-capable software like VMWare vCloud. They keep the old, centralized model, requiring users to file tickets to request a server. An IT person creates each server, working under an SLA that gives them a few days to do the work, and then returns the login details to the user. No matter how impressive and expensive the software used is, if the service model is neither dynamic nor self-service, it won’t support infrastructure as code.
Infrastructure as Code
Modern IT environments are highly complex, interdependent ecosystems. As such ‘uncontrolled’ automation of these environments will lead to highly complex logic and processes that are unmanageable and fragile.
Therefore care must be taken when designing and implementing this environment to ensure that all components meet the below requirements:
- Single purpose – each ‘component’ (application, script, workflow) should complete carryout a single task. When a component is responsible for multiple tasks it invariably becomes complicated.
- Resiliency – components will rely call each other to perform actions or retrieve information, however they must be resilient to the failure of other components and validate all information provided
- Retry on failure
- Component discovery (there may be multiple locations a component cloud be called from, if an error code is returned from one, another should be tried)
- Graceful failure – if a component fails and can not fulfil a request this must be handled correctly, and detailed information provided.
- Standardised API – standards will be created for the passing of information from one component to another. For example when passing VM information between components a standard set of information is provided (whether it is needed or not by the client application). This standardisation will make interoperation between components more reliable.
- Clear data models – It is important clear distinctions and relationships between entities are documented and honoured by components in the automation.
Further reading: http://www.cloudscaling.com/blog/cloud-computing/clouds-are-complex-but-simplicity-scales-a-winning-strategy-for-cloud-builders/
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system. – John Gall (1975, p.71)
Establishing an automation platform within any large organisation will be a considerable investment of time, people and money. It is therefore essential that a mordern automation platform meet the following requirements:
- Target agnostic – the automation processes defined should be generic in nature and apply to multiple target platforms, for example we should consider whether our models will apply to:
- VMware vSphere
- VMware vCloud
- VMware Photon
- Amazon Web Services
- By ensuring our processes are applicable to all the above systems we can ensure we are building a flexible platform that is ready for growth and possible changes in the future.
- Extensible – the automation platform should treat all managed systems and services as first class citizens. A hypervisor, VM, resource pool, network, replication service, backups etc. are all components which should be managed individually. Together they create a solution, however each should be discretely managed and it relationship to another entity clearly documented.
“Automation must adapt to changes without repeating the initial investment trends in automation”
System of record
It is essential to the success, stability and auditability of the automation environment that a single source of truth is created, referenced and maintained for each entity that is managed. This will serve a number of key functions within the environment: - Document the desired state of the solution. This will provide a reference we can audit and remediate the physical state of the environment against, as well as other external systems (such as CORE, Environment Manager etc) - Provide an authoritative source of information for all components of the automation environment to make decisions against
In order to ensure the availability and manageability of the environment the system of record must honour the overall aims we set for the automation project as a whole, including: - Simplicity - Flexibility
Using externalized configuration makes it easy to replicate instances of a given service, application, environment, network configuration, or other element of an infrastructure. This is critical for reliable testing. Instances of the same tool which rely on people to manually enter the configuration by poking and clicking at a GUI inevitably become inconsistently configured. Tests that work on a test instance don’t work on production, either because the changes that were tested weren’t made quite the same way, or because other parts of the configuration are inconsistent between instances. Externalized configuration can be automatically loaded into each instance in the testing pipeline in turn. Done correctly, this guarantees the consistency of each instance, and so the reliability of the testing process. This is the basis of change management pipelines ([Link to Come]).
Infrastructure as Code
In order to ensure the environment functions in a consistent and reliable manor, all components which affect the state of the environment must honour the below (taken from Infrastructure as Code): - Idempotent. It should be possible to execute the same script or task multiple times without bad effects. - Pre-checks. A task should validate its starting conditions are correct, and fail with a visible and useful error if not. - Post-checks. A task should check that it has succeeded in making the changes. This isn’t just a matter of checking return codes on commands, but proving that the end-result is there. For example, checking that a virtual host has been added to a web server could involve making an http request to the web server. - Visible failure. When a task fails to execute correctly, it should be visible to the team. This may involve an information radiator (“What is an information radiator?”) and/or integration with monitoring services (“Alerting - tell me when something is wrong”). - Parameterized. Tasks should be applicable to multiple operations of a similar type. For example, a single script can be used to configure multiple virtual hosts, even ones with different characteristics. The script will need a way to find the parameters for a particular virtual host, and some conditional logic or tempting to configure it for the specific situation.
In addition to the above reference requirements we will additionally add the below specific to our implementation: - Any aspect of an entity that is under management must be represented in the system of record. - Any setting that is managed by automation must be audited against the system of record by automation, such that the environment is maintained in a known state. Deviations between the actual and desired state may either be automatically rectified, or exception tickets raised as appropriate.
Consistent, centralised, logging and auditing should be present across the automation environment. All automation tasks should log key information in a standard format, so that exceptions are easily tracked and remediated. All automation exceptions should be captured and investigated, by remediating these we are ensuring the automation success rate is maintained.
“Automation must be robust and tolerant enough to keep functioning, and functioning well even under adverse conditions” Automation will usually entail a sizable investment. If so, the return on this investment is most assuredly based on continuous use. Inoperability due to breakdown, spare parts, operator mistrakes or undue complexity cannot be tolerated.