How Much of Cloud Native Incidents Should Developers Manage?
How much responsibility do developers have for managing incidents in the cloud native space? Although this will vary by organization, it’s clear that some of this responsibility is shifting and becoming a concern for developers. Developers haven’t traditionally been the “on-call” party for incident response, but in cloud native, microservices architectures, this is beginning to change.
Because developers have a better understanding of the “ingredients” they have put into the code, they likely have a better understanding of all the dependencies downstream. While this knowledge won’t necessarily require developers to manage incidents, it does put developers in a position of getting more involved in troubleshooting when incidents do arise, as well as taking part in prevention exercises.
What Is Incident Response, and Why Is It Important?
Incident response is the set of procedures, actions and decisions that are made to resolve, mitigate and prevent unexpected behavior that can negatively affect application or system runtime performance. Incident response processes tend to evolve based on learnings from previous scenarios and are critical to resolving issues fast, so developers can continue to develop and ship software, sustain optimal performance and learn from that process each time.
Failure Is Inevitable: How to Achieve Quick Incident Response
The modularity of cloud native applications makes it easier, on one hand, to shut broken things down. On the other hand, the distributed and dynamic nature of microservices makes identifying the problem a lot more complex, as deployments may be running across several clouds in hundreds or even thousands of containers. In a setting where something will inevitably go wrong, and it is more challenging than ever to find out what, having a plan of action for quick incident response is critical. Two keys to quick response include:
- Seek situational awareness ASAP: Get fine-grained data about services in Kubernetes, for example, from a service catalog, which can deliver core metadata and provide a single view of all services across clusters.
- Do quick troubleshooting and root-cause identification: Observability is critical to getting to the bottom of the problem, such as doing information gathering from logging and distributed tracing.
The Developer Role in Incident Response
Regardless of whether developers do — or should — actively respond to incidents, this kind of insight can help them take part in, or provide information to, operational teams who are responsible for incident response. As we highlighted in our previous article, Cloud Native Day 2 Operations, platform engineers and others responsible for running applications in production agree that Day 2 operations (or anything that happens after deployment), such as security, reliability and observability, are key parts of the early design process.
Developers should be baking aspects of running the software into the code they create. Developers drive, and are the center, of this design, meaning that they are best positioned to provide guidance on incident management. And helping to empower the developer with service ownership, observability strategies and tools, and progressive delivery techniques, which can all be part of a comprehensive developer control platform, can help identify vulnerabilities in the code and mitigate deployment risks before issues ever have a chance to become incidents.
Developers’ Role in Managing Cloud Native Incidents: Pre-Emptive Understanding
Observability and visibility give insight into what is going wrong during an incident and enabling the necessary digging into what happened after the incident. Developers are well-placed to help provide some of this insight.
Visibility is largely about monitoring and indicates that something is going wrong. As containerized applications introduced increasing complexity to software development, new tools and techniques have arisen to enable visibility to investigate and diagnose issues, often specifically for developers. An example of this is the “single source of truth” offered by a service catalog, which can offer a developer instant visibility into the full picture. Visibility is step one, and observability is step two.
Observability is the constant monitoring of both system and business KPIs with the goal of understanding why something is happening or getting to the root cause analysis of the problem. While observability might not seem critical to the immediate amelioration of an incident, it can be a critical tool in avoiding future incidents and coding against them. It can be a tool for incident postmortems, which can be partly automated, as well as the foundation for “observability-driven development (ODD)”: “defining instrumentation to determine what is happening in relation to a requirement before any code is written. Just as you wouldn’t accept a pull request without tests, you should never accept a pull request unless you can answer the question, ‘how will I know when this isn’t working?’”
Insight gathered from shipping and running applications, and examining the outcomes and causes of incidents, will strengthen future development and enable better, data-driven development and product decisions throughout the development process. In this way, the developer’s role is not necessarily linear or as an active participant in resolving issues, but rather taking away key insight from incidents to feed the information back into the coding process.
Preparing for and Guarding against Cloud Native Incidents
In addition to the defensive coding developers can adopt with ODD, there are other ways to ensure that developers are prepared for and can protect against cloud native incidents.
Make a Developer Platform for Day 2 Operations
As our previous article explained, making Day 2 ops more accessible for developers involves empowering developers with a developer control plane or platform that supplies with them the defaults and automation they need when they create a service, which will both relieve cognitive toil and let the developer follow the service through its entire life cycle. Introducing a developer platform gives developers everything they need, from visibility to observability to performance metrics, to manage Day 2 operations and contribute to preventing or mitigating incidents.
Use Game Days
Game days provide a no-fault opportunity to simulate and recreate incidents after the pressure of responding to actual incidents is over. Keeping incident response skills sharp is the name of the game, and practice with scheduled exercises that mimic critical incidents make up the bulk of game days. It is expensive and sometimes impossible to practice with real incidents, which is why game days are the next best, hands-on facsimile.
Summary: Where Does Developer Responsibility for Managing Incidents Begin and End?
Developers may not need to run the whole show for coding, shipping and running their applications, but knowing what their software contains and becoming well-acquainted with the tools that enable rapid situational awareness, mitigation and potential incident prevention help form the possible boundaries of the developer experience and responsibility with regard to managing incidents.
Most importantly for developers, it is key to keep in mind that:
- Incidents are learning opportunities: Development teams can use incidents to examine different types of real-world issues carefully. Both the immediate firefighting and the analysis of root causes in the aftermath are valuable in preventing future incidents, in particular by giving developers insight into how to make their code more resilient and “incident-free” from the outset.
- Incident management for developers takes a long-term view: Performance metrics and “signals” form the basis for tracking incidents over time, identifying and understanding patterns, and discovering root causes of problems to gain a more comprehensive, long-term approach to incident management and introducing and structuring blameless postmortems into the organizational workflow.