Our Approach to Runbooks at Octopus Deploy
You probably use software tools to help automate your development processes, from git to build servers to continuous testing suites. These tools let you streamline your development flow to focus on more critical tasks.
Runbooks simplify the “ops” side of DevOps by allowing operations teams to share standard processes.
If you work in operations, you’ve likely completed a runbook detailing a sequence of manual steps or scripted a runbook yourself to help automate routine tasks.
Runbooks, as they’re frequently used in ops teams, are just checklists that people work through manually. At Octopus, we developed a runbook feature to run alongside our deployment tool because runbooks and deployments go hand in hand.
Octopus Runbooks are always automated processes, making DevOps teams more efficient. We use and create runbooks extensively and have a few thoughts on how you should think about designing and executing them.
What Are Runbooks?
A runbook is a series of steps used to complete a routine computer process. Traditionally, runbooks were printed on paper and referred to when needed. Of course, this led to issues when the paper-based checklists weren’t updated and people tried to follow instructions that were no longer valid.
Runbooks in modern IT systems take these instructions and automate them in a repeatable process. In a world where developers have many automation tools, runbooks are a welcome addition to an operational workflow.
You can use runbooks to:
- Automate routine configuration, maintenance and emergency operations tasks.
- Provision infrastructure.
- Install dependencies.
- Restart infrastructure.
- Streamline operations work to focus on critical tasks you can’t automate.
Why We Built Runbooks into Our Platform
Providing our users with runbook functionality made sense because Octopus Deploy already connects to your deployment infrastructure. We know where your deployment targets are and have integrations to communicate with them. By adding runbooks to our platform, we let you perform operations tasks alongside your deployments in one tool.
After testing and using runbooks internally, we released a runbook feature to run alongside our deployment process. Operations tasks and deployment tasks are both dedicated features in Octopus. You can share variables across runbooks and deployments, allowing full integration across both development and operations.
The deployment process takes an artifact built by a CI server to a live deployment, but there are many points in the process where a runbook can help. A runbook can update the deployment targets with the latest dependencies or remove unnecessary infrastructure after a deployment is complete. You may need to restart a server at a specific time or trigger a process on a condition. A runbook is the perfect tool for the job.
Our internal testing showed that runbooks provide discovery and visibility in the deployment process. Since deployment targets are already known, automating operations tasks surfaces logs and metrics that you can use to troubleshoot your system. Tracing can connect interrelated deployment targets and help diagnose broader intersystem issues.
Runbooks also help with security requirements. You can use them to:
- Renew SSL certificates.
- Configure permissions when a new user joins a system.
- Automatically remove someone who shouldn’t have access to a system anymore.
The logs from your runbook are also helpful during auditing, especially if your business is trying to get security certifications like ISO-27001 and SOC II.
How We Think You Should Use Runbooks
When we designed runbooks for Octopus Deploy, we knew adding this feature would give developers an open environment to create complicated runbooks. Despite this open environment, it’s helpful to have a few ground rules to get the most out of runbooks. We’ve spent a lot of time using and thinking about runbooks ourselves, so we have some opinions on how best to use them.
What Problem Is Your Runbook Solving?
A common pitfall for developers is jumping headfirst into a problem without fully understanding the issue. Our runbooks feature lets you create steps to solve many different operations tasks. However, you need to understand the purpose of the runbook to avoid ending up with a bloated process that tries to do too much.
It’s better to map out all the use cases and “gotchas” up front so that the final runbook solves one specific use case.
- What will happen if this runbook fails or gets stuck?
- What dependencies does this runbook rely on to run?
- Does this runbook have one function?
If a potential runbook has many use cases, split the use cases into multiple runbooks and chain them together.
Have We Tried a Runbook Like This Before, and What Did We Learn?
It’s easy to work in silos in a decentralized workforce where different teams solve the same problems. Other teams in your business might already have a runbook to solve a problem like yours.
When designing your runbook, ask whether anyone else has already tried to create a similar runbook. Maybe you can reuse a runbook. Or, if a runbook didn’t work, there might be lessons learned that will affect the design of your runbook.
In Octopus Deploy, you can export runbooks for sharing and reuse. We recommend you build your runbooks using prior knowledge so you don’t have to reinvent the wheel. Our step templates are an excellent option for this, as you can save a parameterized script step and call that in your runbooks.
Constantly Tweaking Runbooks Is a Bad Thing
When testing our runbooks feature, we made some pretty clever runbooks. They did exactly what we wanted them to do, and it was all automated! When systems changed and evolved, though, the runbooks became outdated. We were constantly tweaking and editing the runbooks to keep them current. We learned that, where possible, runbooks should be future-proof. Even if they’ve been idle for a few months, they should still work.
To make this happen, we plan where our runbooks should run.
- Should it be on a deployment target?
- Should it run on a Worker?
Understanding your runbooks’ life cycles goes a long way toward making them less redundant. You don’t want to come back into a runbook and find it connected to sandbox targets that no longer exist.
You should also know whether a person or an automated design process runs your runbooks. You don’t want to bottleneck your system by specifying a manual runbook when a system could automate it. Understanding the dependencies will help you keep your runbooks streamlined and up to date.
What’s the Alerting Policy for the Runbook?
Runbooks in Octopus are automated by default. Automation helps the DevOps practice of continuous integration. Although it’s appealing to set up your runbook and forget about it, you need a plan to initiate relevant approvals and alerts to the appropriate people. Before implementing a runbook, consider what approvals are required and which developer to notify when the runbook runs. Octopus Deploy has inbuilt approvals and alerts to help you set up this process in your workflow.
What Information Do You Need, and How Long Should It Be Retained?
The runbook you create will leave a trail of logs, and making sense of these is helpful for the observability and security of your system. Logging information, metrics and traces make up the bulk of telemetry data. Telemetry data enables system observability by allowing developers to fix or improve the system based on the output.
If you can understand what data you need for system observability, you can design your runbook to capture the relevant data for your teams. If you’re dealing with sensitive data, you shouldn’t capture every type of log, and you shouldn’t need to store all logs indefinitely. Try to plan your runbooks as part of your overall security strategy to meet your certification requirements, like ISO-27001 and SOC II.
Runbooks Best Practices
In our experience, we find that most good runbooks follow a trend. We view best practice runbooks as a series of steps connected by testing and automation:
An ideal runbook is self-descriptive and first inspects the problem to confirm that it has to run. It collects diagnostic information about the system to identify the root cause of the issue. It confirms whether the runbook requires manual approvals before executing code to rectify the problem. After the code executes, the runbook verifies that the system is fixed and notifies the relevant channels. At every point throughout these steps, the runbook is tested and automated. Here are our thoughts on each step in the process:
Quality descriptions for runbooks are essential to promoting reuse and self-documentation. When you’re creating many runbooks, you want to be able to come back to them and understand what each one does. In Octopus Deploy, you can attach a description to a runbook to connect it to a deployment. The descriptions are searchable for specific use cases such as “web server” or “IAM.”
A runbook helps you fix a specific part of a system. It could be installing the latest patch, restarting the system or adding a user to a security list. The first step in the runbook should inspect the system’s current state to determine if it’s degraded or needs fixing. The inspection is a confirmation check to verify that the system requires a patch or that a server is returning a 404 error.
After the runbook confirms the degraded part of the system, your runbook should collect diagnostic information to determine the root cause of any issues.
You can use this information later to prevent the system from returning to that state. It’s important to know what kind of information you need to troubleshoot problems since not all data is valuable.
After diagnosing the problem, your runbook should invoke any manual intervention steps before continuing. The runbook should carry on if the diagnosed error is routine and doesn’t require manual intervention. The key in this step is to request a manual intervention when it’s needed. Requesting an unnecessary manual intervention is counterproductive, and failing to request a manual intervention when it’s necessary could be catastrophic.
The rectify step is where the runbook executes the step to fix the problem. It could be running the restart server code or applying the patch. At Octopus, we let users define a series of steps to complete the runbook process. There are templates for these steps, with integrations into the main DevOps platforms. If the runbook fails, you can pinpoint which step caused the failure and which ones succeeded.
After your runbook executes, it should verify that the issue is fixed by querying the system for health indicators. Health indicators could be confirming the new version or verifying the status of a server. This step is like the green light for deployment builds. Once green, the runbook is successful. If the verification step fails, the runbook will need to be restarted or reconfigured.
If verification succeeds, your runbook should send alerts to the appropriate channels. Notifications could be Slack messages or emails to report the status of the runbook. ChatOps tools like Slack and Teams have excellent mobile interfaces. You can use these interfaces to print status messages to a dedicated Slack channel for every runbook run. While you can dump all the status messages into a Slack channel, you should only ping people about critical errors. We all know what it’s like to be overloaded with notifications that we learn to ignore.
Automation is the value proposition of a runbook. Without automation, we’d still be reading and executing instructions from a binder! At Octopus, you can configure the steps and conditions of the runbook and let it run in the background according to a schedule. When you’ve successfully created an automated runbook, you can export the runbook and use the automation in another environment, saving you lots of time.
Continual testing is a phase in DevOps that runs automated tests on every release before deploying it to production. A similar concept can be applied to runbooks to achieve the same test coverage. In Octopus Deploy, you can execute a runbook in a test environment to validate its performance before using it in production.
Runbooks are a built-in feature in Octopus, backed by positive user feedback. When done well, runbooks save a lot of time and remove manual work. If not planned properly, however, runbooks can be difficult to maintain, their purpose can become unclear and the runbooks won’t achieve their goals.
By knowing who your runbook is for, what it will achieve and how the data is collected, you’ll unlock the value of runbooks. We’ve supplied some best practices that you can follow to structure your runbooks for success.