Usenix: Continuous Integration Is Just SRE Alerting ‘Shifted Left’
Should Site Reliability Engineering alerts be “shifted left” into the Continuous Integration stage of software deployment, that is before the software is even deployed?
A recent Usenix opinion piece, “CI is Alerting,” written by Titus Winters, Principal Software Engineer at Google, explains how this potential practice could be useful.
As Winters points out, CI systems are systems for automating the build-and-test routine: build the code, and run the tests as often as is reasonable. Adding site reliability engineering alerts to CI should be possible, given that CI alerts should be treated the same way and tested with the same criteria as production alerts. That means CI shouldn’t have 100% passing rates and an Error’s Budget should be added. Brittleness is a leading cause for non-actionable alerts and flaky tests. This can be solved by adding more high-level expressive infrastructure.
Although CI and alerting are guided by different groups, the article makes the argument that they serve the same purpose and even that they use the same data at times. CI on large-scale integration tests are the equivalent to canary deployments and when using high-fidelity test data, reporting large-scale integration test failures in staging are basically the same failures seen in production alerts.
There is a purpose to the parallels. The purpose is that what works for CI can work for alerting and what doesn’t work for CI might not work for alerting and vice versa. This paves the way for the concept of brittleness being a problem.
“Given the higher stakes involved, it’s perhaps unsurprising that SRE has put a lot of thought into best practices surrounding monitoring and alerting, while CI has traditionally been viewed as a bit more of a luxury feature,” Winters writes. “For the next few years, the task will be to see where existing SRE practice can be reconceptualized in a CI context to help explain testing and CI.”
How Alerting Is Like CI
Here’s a production alert:
Engineer 1: “We got a 2% bump in retries in the past hour, which put us over the alerting threshold for retries per day.”
Engineer 2: “Is the system suffering as a result? Are users noticing increased latency or increased failed requests?”
Engineer 1: “No.”
Engineer 2: “Then … ignore the alert I guess. Or update the failure threshold.”
The alerting threshold is brittle but it didn’t come out of thin air. Even if there was no fundamental truth to the specific alert, it’s correlated to what actually matters — degradation in service.
Here’s a unit test failure:
Engineer 1: “We got a test failure from our CI system. The image renderer test is failing after someone upgraded the JPEG compressor library.”
Engineer 2: “How is the test failing?”
Engineer 1: “Looks like we get a different sequence of bytes out of the compressor than we did previously.”
Engineer 2: “Do they render the same?”
Engineer 1: “Basically.”
Engineer 2: “Then … ignore the alert I guess. Or update the test.”
Similarly to the alert, the test failed on criteria that didn’t fully apply. The specific sequence of bytes doesn’t matter as long as the bitmap produced by decoding it as a JPEG is well-encoded and visually similar.
This happens when there isn’t enough high-level expressive infrastructure to easily assert the condition that actually matters. The next best thing is to test or monitor for the easy-to-express-but-brittle condition.
The Trouble with Brittleness
When an end-to-end probe isn’t revealed but collecting aggregate statistics is available, teams are likely to write threshold alerts on arbitrary statistics. In lieu of a high-level way to say, “Fail the test if the decoded image isn’t roughly the same as the decoded image,” teams will test byte streams. Such brittleness reduces the value of testing and alerting by triggering false positives but also serves as a clear indication of where it may be valuable to invest in higher-level design.
Just because brittle isn’t best doesn’t mean brittle is bad. These tests and alerts still point to something that might be actionable. Data surrounding the alert will lead to more clues about the importance of the alert. This is why Winters explains that flaky tests are more negatively impactful than non-actionable alerts. There’s usually fewer data in the testing environment to show whether or not the test failed because of a software-related or test-related issue.
What Is the Pathway Forward?
Treat every alert with the priority it deserves rather than always being on high alert. Consider adding the flexibility of an Errors Budget to CI rather than only having an Errors Budget in alerting and focusing on absolutes with CI. Winters views that as a narrow perspective and recommends refining objectives and adding in an Error Budget for CI because 100% passing rate on CI is just like 100% uptime: awfully expensive.
Some other lessons learned:
Treating every alert as an equal cause for alarm isn’t generally the right approach. This is one of those “alarm snooze” situations. The alarm matters but if it’s not incredibly impactful it’s ok to keep on moving. But that also doesn’t mean the alarm should get thrown out the window because tomorrow is a new day.
Reconsider policies where, if not all CI results are green, no commits can be made. Don’t throw out the alarm — if CI is reporting an issue, investigate. If the root cause is well-understood and won’t affect production then blocking commits quite possibly isn’t the best pathway forward and could be problematic in the long run.
This is a novel idea and Winters says he’s “still figuring out how to fully draw parallels.” For the next few years, “the task will be to see where existing SRE practice can be reconceptualized in a CI context to help explain testing and CI.” He looks for best practices in testing to clarify goals and policies on monitoring and alerting.