Avoid Architectural Change Disaster with App Performance Monitoring
Raygun sponsored this post.
If you’re contemplating a major piece of rework to your application or its architecture, application performance monitoring probably isn’t the first thing that comes to mind. In fact, it’s probably not the second, third, or fourth, either — but maybe it should be.
Today, let’s take a look at the perhaps surprising relationship between application performance monitoring (APM) and major structural code changes. These types of changes inevitably come with a great deal of risk. Strategic use of APM helps you mitigate that risk.
Major Application Changes Create a Lot of Risk
Anything in the software world comes with risk. It’s a constantly and rapidly evolving landscape, so even standing still — keeping up with OS and server updates, and the like — is risky, like hanging out at the edge of a volcano. Once you start building things and shipping features, that risk grows.
But there’s nothing quite as risky as making substantial infrastructure changes. Consider the example of a push to move from a monolith to a series of microservices. If you were just shipping a feature, it might mean additional users and additional revenue. Moving to microservices isn’t something users will buy, appreciate or probably even care about — unless it goes badly, that is.
With a change like this — a major architectural change — you’re gambling that your current efforts will make future features go more smoothly. And you’re simultaneously gambling that you won’t blow anything up in the short term.
That’s a lot of gambling. So you should seize every advantage you possibly can to avoid those inherent risks.
Typically, you’ll hear sound advice on how to seize those advantages. Make sure you have a good automated test suite, for example. Migrate your application in thin slices. That’s all great advice, but it’s, well, typical.
Let’s look at some non-typical advice and see how you can use APM to help manage a major architectural overhaul.
1. Avoid Unnecessary or Non-Justifiable Change
The first source of risk that I’ll cover is one that people overlook way too frequently: I’m talking about the risk of undertaking unnecessary or unhelpful changes. This wastes time and money and, in the case of an architectural change, puts you at risk of inconveniencing users with no upside.
Typically, diligence here involves relying on members of the dev team for expert opinions. They’ll insist that the current architecture is loaded with technical debt and will hamstring future feature work. They might be right, but they might also be experiencing a wee bit of cognitive bias. Working with a newer, cleaner architecture is more pleasant, and, as engineers, let’s face it: we’re all subject to the new-shiny-object syndrome.
Wouldn’t it be nice if we had some actual data related to user experience?
Well, through monitoring applications in the wild, you can get exactly this kind of data. Real User Monitoring (RUM) can tell you about performance bottlenecks and about user experience as a whole. For example, are certain requests taking way longer than a reasonable user would expect to wait? Are lots of users dropping off or ending their sessions at certain points in their journey?
Answering questions like these can give you a lot of insight into what pains your users have, what’s causing those pains and whether a significant architectural shift would address those root causes. Monitoring your users thus gives you the ability to reduce the risk of making change just for change’s sake.
2. Get Ahead of Regression Defects
Earlier, I mentioned an automated test suite. Historically, this represents the most commonly prescribed inoculation against the scourge of regression defects. But, once again, that’s not your only option.
Regression defects, simply put, occur when you release a software change and it breaks something that used to work. Even if you don’t know this term, you can probably relate. Your fitness tracker released some ancillary feature you don’t care about, and now the thing that tells you how many steps you have doesn’t work anymore. This is an absolute user experience killer.
An automated test suite is great for this, but it won’t capture everything. This is doubly true when, rather than minor refactorings, you’re doing major overhauls with lots of moving parts. You need a strategy that works in production in addition to test environments.
APM shines here as well. With crash reporting, you can find out about errors faster than you would otherwise and have fixes for them faster as well. Generally, by the time users actually start reporting a defect, way more users have experienced it, shrugged in frustration and given up. By keeping close tabs on crashes and errors, you can potentially become aware of regression defects in production even before your users do.
This added intelligence helps remove the risk of users experiencing bugs as you do your restructuring.
3. Prevent Performance Degradation
Regressions are behavioral issues, and regression testing can help you catch them. But a truly sophisticated and comprehensive testing approach features all sorts of other flavors of testing as well. This includes various forms of performance testing, such as load testing, stress testing, etc.
So if you have a comprehensive testing strategy, and if that strategy includes well-constructed performance tests, you might have a fighting chance of catching all such problems before production. But that’s a lot of “if.” Take the classic example of Netflix if you want to see how fragile this reasoning is. They have such a uniquely resource-intensive production environment that there is simply no simulating it.
Whether in Netflix’s environment or yours, performance problems prevent a huge risk. This is always true, but it’s never truer than when you’re pushing out a large architectural change. You think that fleet of microservices will route messages as quickly as your monolith did. And your hypotheses has held up through all of your lower environments. But do you know that the same will hold true in production, when you’re fully converted to the new code?
APM again shines here. You can look at both performance and errors under a microscope and engage a remediation plan immediately if things start to get off the rails.
Protect the Bottom Line
Let’s close on a philosophical and business-oriented note. Don’t get me wrong. As a lifelong software developer, I know firsthand the joy of deploying a significantly cleaned codebase and the satisfaction of shipping a really slick, well-implemented feature.
But it’s all for naught if the software and the business start losing money or shedding users.
If too much of that happens, you’ll know only the joy of updating your resume and the satisfaction of submitting open source pull requests for a while. So it’s important to mitigate the biggest risk of all: issues that hurt the bottom line.
APM helps you do this in a variety of ways. Using the aforementioned Real User Monitoring, you can keep an eye on what users are doing, or what they’re not doing. If you notice dips in traffic to things like sign-up and checkout pages, for example, you can act accordingly.
You can, likewise, set special high-priority alerts to keep you informed of unusual errors or problems in business-critical places. Perhaps you can live with slight performance dips or the occasional regression defect elsewhere in the application. So set up your monitoring accordingly, and stay on particularly high alert for issues that threaten the bottom line. If the checkout microservices messages are bouncing back from the payment recorder, you’re going to want to know about it and roll back to the old monolith post haste.
Knowledge Is the Antidote to Risk
No matter what you do, you’ll never eliminate everything risky about changes that you undertake. If you find yourself in a position where you’re overhauling your architecture, it means that you are, by definition, not in a good place. Scrambling to get yourself into a better position will force you to take risks.
Accept that. Let it wash over you and make you shudder a bit, and make peace with it. And then realize that, while you can’t eliminate the risk, you can certainly minimize it.
To minimize the risk, making things as safe as possible requires you to be resourceful. By all means, have a good plan, build yourself a nice test suite, create contingencies and all of the things you would otherwise have done. But make sure that you also take advantage of APM to help you make things safer. If you never before thought of this as a resource at your disposal, understand that it is, and a powerful one at that.
Feature image via Pixabay.