LaunchDarkly sponsored this post.
At LaunchDarkly, we host a monthly Test in Production Meetup in the Bay Area where we ask the DevOps community to show how they test in production, and share learnings and best practices around doing it safely. Testing in production isn’t something everyone is comfortable doing. There’s a lot of risks involved with exposing new functionality to users, especially features you’re still validating. We really enjoy hearing from other organizations about how they do this — test in production safely — what they’ve learned, best practices to live by, and new ways to automate processes.
We recently discussed configurations and feature flags with a panel of speakers. We questioned configurations in general — how much do you want to rely on a single state vs configs as code that you can test, rollback, and fix.
“I think at the end of the day, all of these things are about, maybe not configuration because configuration is just the state of the world, right? It is the truth that is out there. But both canaries and feature flags are just ways that you can gain confidence when there really isn’t any confidence,” said LaunchDarkly’s Principal Technical Account Manager and Chief of Staff, Tim Wong.
Before our discussion, however, the three panelists gave their own talks. Brandon Leach, senior engineering manager at Lookout, spoke about increasing deployment safety with metric-based canaries and how his organization used canaries to move towards continuous development. LaunchDarkly’s Tim Wong, gave a talk about ways you can manage operational risks with feature flagging. And TR Jordan, a Slack head of product marketing who was formerly at Turbine Labs, gave a talk about how his team tries to get configs as code, phases rollouts to prevent downtime, and moves what they can into machine-driven states so they can focus on important config changes.
Watch the video below for the panel discussion and learn more about how these organizations think about decreasing the risks involved with releasing continuously.
Here’s a transcript of our conversation:
Andrea Echstenkamper: First thing I wanted to ask about is kind of the obvious, we’re talking about three different things with a lot in common that a lot of people get confused about, kind of the borders between, does someone want to take a shot at explaining how they think about the differences between configs, feature flags and server canaries or just canaries? It’s a difficult one, maybe.
TR Jordan: I’ll take a shot at it. I think the thing that it boils down to with all three of those is what are the tools in your basket that allow you to build confidence in whatever you’re trying to do. There are sort of different layers of these configurations that you build on top of. Like if you have a new feature flag you want to roll out, how should you protect that? Should you put a feature flag around it? It’s probably the wrong answer. So, I think that when you think about something like canaries, canaries are really one of the more fundamental things that you need to do in a system because that’s a way to protect and build confidence in your code. At the end of the day, if you don’t have confidence in your code, then things aren’t going to work.
Feature flags, and configs on top of that, sort of work their way up the stack. Configuration in many cases is driving behavior at a level that’s less fundamental and feature flags are really at the top of that. So I think there’s a little bit of interplay between all of them, but it’s important to think about how they fit together in order to fit your toolset and fit your specific goals.
Tim Wong: I think at the end of the day, all of these things are about, maybe not configuration because configuration is just the state of the world, right? It is the truth that is out there. But both canaries and feature flags are just ways that you can gain confidence when there really isn’t any confidence. You’re just like, ah, I know before it was okay or good and I want to leverage my knowledge that the current state is good and then that future state is, I don’t know. Feature flags and canaries are both tactics that you can utilize to gain confidence by dipping your toe in. Oh, it’s too hot, don’t jump in versus like, it didn’t seem to work right quickly turn it off or let me roll it up one by one to one percent of the time. Those are both just ways to gain confidence.
Brandon Leach: The configuration piece is something we also do quite a bit. Testing whether or not a configuration works in an environment is pretty straightforward, right? You can deploy that code and not give it production like traffic and test whether or not it works the same way you did in any development environment, right? It can be connected to the database, that’s an easy config change to the test in production, just like you did in staging without doing production load. I think it gets more interesting when you start thinking about where am I going to do a production deployment canary versus some type of feature flag canary. We talk about this a lot. What scenario would you do one or the other? I don’t think that you do one or the other. There’s room for both. I think there’s definitely room for both.
There’s a lot of releases when you move to continuous delivery, right? And then continuous deployment. You might not be releasing a huge feature, right? This is just something that’s been merged to master and it’s being deployed to production and it might have some unforeseen consequences. Your production canary, deployment canary is definitely going to catch that. But if I have a huge feature that I kind of want to be able to turn on and off, maybe for certain customers or whatnot and kind of test that in production, maybe you don’t want to go through the three-hour canary process that you have to do for your production canary. I think there’s definitely room for all of them, right?
Echstenkamper: Definitely not a “choose one” scenario.
Leach:: I think there’s room for all of them.
Echstenkamper:: What are your thoughts on this concept of testing in production and when you hear that idea, what do you really think about how that’s kind of evolved?
Wong: Deep. I think testing in production, I mean, part of it, I think one of the things that’s hard to articulate with this whole concept of DevOps and thinking about the entire system is that the act of making it, putting it out there is a part of the planning. And it’s not, do I make change X? What’s the diff? Can I deploy it? It’s deployed. It’s not simply that, it’s also the, do we canary this? What is the best way of doing it? What day do we do it on? The process, who’s involved. If we’re going to do a war room, like who am I calling? Like, what is the process by which we select those people? It’s all of that. The idea of testing and production is the confidence of we’ve thought about those things and like the impact of this change is okay, we just do the defined procedure we’ve set up. Like we did it already. Like, okay fine, just hit the button and call the people and hit the PagerDuty and we’re good to go. Like we’re confident enough that it’s not a big deal.
Leach: For us and kind of like the little bit of the journey that I talked about here, where we got to continuous deployment. And then there was always still that manual button at the end, right? That was all of our tests that have ran, right? We have unit tests, we have these functional tests, we have these massive integration tests that we spent all this time building. But at the end of the day, there was still that button that was like, are we okay with this, right? I think that that’s where testing and production comes, right? Testing in production gave us the ability to be like, you know what, we have these automated canaries, right? We’re okay just like releasing it to production.
I got the feeling that it was like, whatever the investment we made in all of those tests, we would still have had that manual button if we weren’t doing some type of test and production rolling out of deployment. So yeah.
Jordan: I have a love-hate relationship with the phrase test in production because I think-
Echstenkamper: I like the controversy.
Jordan: Because I think the idea is … I’ll start with the fun part. I think that [the phrase] test in production puts off a lot of people from doing it at all and that’s a bad thing because it is easy to think that if you are testing in production you did not test elsewhere. That idea I think fundamentally questions whether if you have a piece of software and you have built an artifact and you are considering pushing it into production, it sort of asks this very aggressive question, are you sure? Is it really good? Is it there? Can I press that manual button that we have created and I think that that sets up a, for better, for worse, it sets up a culture of everyone implicitly asking that question at every deploy, which brings a lot of drama to the moment.
Leach:: The funny thing is like, that manual button, like what I learned is like that was actually somebody looking at a bunch of graphs. It was like somebody looking at a bunch of time series data and trying to figure out if there was any deviation in the time series data based on the release. With an automated canary, right, metric-based canary, is like you can watch hundreds of metrics at once without somebody actually having to sit there and do that. So yeah. I agree with you that it puts off some people, right? Because it makes them think that like we didn’t do enough testing before, but my position would be, I don’t think you can do enough testing before. I don’t think that it’s actually possible to be like, look, we’ve run, I mean, unless you spent like tons and tons and tons of money like trying to like emulate production, I just don’t think it’s worth it.
Jordan: I absolutely agree. One of my coworkers used to work at a place where they spent 10 times the amount on staging than they’d done in production. They still were riddled with bugs. They made, I mean, essentially, the reason he works with me and he didn’t stay there is because they wouldn’t put the effort, they didn’t believe that it was possible to test in production safely so they tested poorly before production. They spent more money getting worse results, which was so deeply frustrating and drove away a bunch of engineers.
I think that’s where, that’s where like, for a while, for about six months, I replaced that phrase in my vocab, verification in production, which is a mouthful and bad and it was a mistake. I think it’s really important to just sort of admit to yourself that everyone is testing in production. If you can get to that, if you can realize that I’m testing in production poorly, aggressively, dangerously with no regard for my users’ happiness, that brings, I mean, that’s a bad state to be. So you can have that conversation around, okay, can we automate some of these checks? Can we make something where we catch 90 or 95 or 98 percent of the problems and quickly turn them off with the feature flag. If you can do that, the more automated you can get that, the more confident you can get around the fact that you will never fully be confident is I think crucial to the whole idea.
Leach: From our standpoint, it was huge, like so, like you get to continuous delivery. Like engineers merge his code into master, his code has merged into master, he knows it’s going out. And then it was a huge hit on engineering efficiency that he would have to babysit that all the way through staging, all the way getting into production. And finally when it got into production, like kind of validating that it was okay. For us, it was like getting to continuous deployment was the actual motivator and it wasn’t engineering efficiency.
Echstenkamper: It’s true. You’re going to be testing in production no matter what you do beforehand because there’s still nothing like production. You’re kind of starting to touch on continuous delivery and continuous deployment. Something we’re talking about recently at LaunchDarkly is this idea of progressive delivery, which is really after continuous delivery and deployment, doing something where you’re slowly rolling it out to one segmented group at a time. So, that’s the next stage. We might do a test in production on that.
I wondered, do any of you have questions for each other before we open it up to the audience? Did anything strike a nerve in each other’s thoughts?
Leach: I guess I have a question for you then. This is a question I’ve always kind of had about feature flags and stuff like that. Like, how have you seen people manage those if statements in a way that’s sane in their code and that you don’t end up with so many. It kinds of ends up with a mess, and how do you end up going like, frameworks for cleaning them out and maintaining code with all these if statements?
Wong: I think there are a lot of structural things from a cultural standpoint that you can do. However, it has to be like a top-line metric. You have to be aware like, hey, we’re purposely implementing tech debt, and if you don’t stay on top of it, it just proliferates and you end up in a situation where you’re like, is it even safe for me to retire this flag? And you do it and it’s like, oh, well, everything exploded, that’s bad.
Even at LaunchDarkly, sometimes it is a, like, hey, let’s gather around and like this flag, we should get rid of it all. It’s been in there since like 2015. What happens if you turn it off and then when we do it, it’s like, oh, that was not expected. But it definitely does, it is something that ends up being talked about during engineering iterations. Okay, we had shipped feature X, which ones are operational flags that we’re going to keep so that we can degrade gracefully and which ones are just like derpy, like the bugging flags that should be retired or should it be operationalized into something, not the way it is? Not something that was just in there for the purposes of shipping.
It is a discussion that has to be had at the spec level. When we spec a feature, it’s like what does it do and what business purpose does it, and how are you planning to implement that? Is it one flag, two flags, six flags, 20 flags? What systems does it touch? How are you going to propagate that behavior across multiple environments if it’s different flags and such? Like that’s a discussion that happens at ideation and spec. Features are not considered done until like those things are taken care of.
Leach: There’s a point at which you just rip them out. Like once it’s just become part, do you remove them? Do you know what I mean? Once the features like turned on, you know it’s going to be on, is there like go back and just like take out the feature?
Wong: So we have a couple of classifications for flags and we also have these kinds of like flag removal parties where we’ll just like, okay guys, Thursday we’re going to have a cleanup-a-thon of getting rid of all the flags that don’t deserve to be around anymore. We end up tagging and using filtering to like understand, okay, like for example, a functionality. I’m also an incident manager on call a lot. Things we have are like okay, this flag degrades this core function that we have. If this flag should turn on, it automatically pages certain people such that the maintainer of the feature gets paged along with the incident manager on duty and all those details are captured in the audit log.
So we certainly have flags that live forever. They’re designed to live forever. They have defined, like hitting this button means that this system is degraded and you should see this effect and that’s documented. We have run books on when to hit those flags. There’re parts of like the arsenal of, the incident manager.
Leach: Like contingency flags that you turn and-
Wong: We do.
Leach: Okay, that’s interesting.
Wong: We do.
Leach: Never thought about that.
Jordan: One of the things that we’ve seen is, at my previous company, we sort of classified flags in the same way that you’re describing and my experience with them has been that the further, the more complex the functionality a feature flag hides, the higher its value. Going from, we’re going to roll out our new front end, is that feature flag worth its weight in gold. The feature flag that’s like “change the button to blue” is a little less valuable. The more you can find opportunities to take those low-level flags, the ones that seem kind of transient and don’t make a huge difference to user experience, if you can find ways to shove that functionality elsewhere, then that’s awesome.
One of the things that we see a lot is the “I want to do this release” feature flag. If you can push this up into NGINX, HAProxy envoy and say, I’m going to switch traffic and then I’m going to tear it down the deployment and no one cares about it and I never had to change code, that’s really valuable because otherwise, you end up with a feature flag per deploy which nobody wants.
Wong: Nobody wants. Something that we are playing with is some heuristics. Because like that’s a well-defined process, like, I want to do it to deploy. It’s kind of a low value but it’s a super high-value flag but only in that moment.
Wong: In that moment it’s super high-value. But like, a year from now, that flag is garbage. I can’t turn it “on” or “off” because like on means status quo and off means completely busted because like we retired all that old stuff already. Unless you clean it up, you end up with a lot of those.
Echstenkamper: Ready for audience questions then?
Jordan: Sure. So one of the things that I think, I’ll use feature flags as an example. I’ve seen some real janky systems that attach feature flags to users. So in like your off DB, there is a column on every user that says like what features should this person have enabled? In that case, you’re picking up some of the benefits of a platform like LaunchDarkly. But a big part of the difference between those things is that application state isn’t changed by operators of the system, it’s changed by users of the system. And it’s fine if users change things and behavior changes, but that’s in the realm of unknown to unknowns.
I think there are trade-offs here. If you think about the idea of like, configuration is something that you desire to be true. Moving stuff into application state admits that you may not know what should happen when you set every bit of configuration. It basically allows you to take parts of your system and say, I’m not going to trust that deployment is just something that happens. I’m going to have a dedicated set of services like Spinnaker that does my deployment for me. And Spinnaker has state and Spinnaker’s an application, and I may have to go debug Spinnaker occasionally and that’s fine, and I’m going to configure it to say, to have a certain process. The further you can push configuration towards human-readable and simple, the more you can essentially prove that your configuration is correct.
My view that’s a little more extreme than the talk itself is that almost all configuration is bad because it’s untestable. You can’t do anything other than deploy it to production and hope it works and throw some magic dust on top of it like incremental deployments. But the more you can minimize that state and turn it into code and logic that you can test or state that you can fix and roll back. Those are problems that we have great tools for solving and fixing. Config is just not a wonderfully solved problem.
Wong: It’s kind of weird to say like, oh yeah, our database migration took like a month. That suggests like, oh man, it took forever. What ended up happening was since you’re writing to both databases at the same time, you are staying on. In your example, you’re doing Oracle to Postgres? So you’re going to be writing and reading from Oracle which is the same state as you’re in. You’re going to write one percent to 100 percent of the same information to Postgres. And perhaps, I don’t know about your company, but like scope creep happens and I’m sure a part of your migration is also going to be like, oh, and we’re changing the schema or and we’re changing something.
And actually, it saved our bacon because we didn’t realize that nulls are treated differently in like two different implementations. We had like immediately metadata come out that’s like, the error rate on this box is now huge and we’d just flipped the flag so we knew like, okay, that’s probably the safe code that we had. And then once we gained confidence that like, okay, we’re reading 100 percent from both, the application code is just comparing every object as it comes out of the database and the boxes were not falling over, that’s when we started pulling back on the prod after migrating all the data in. And actually, it saved our bacon because we didn’t realize that nulls are treated differently in like two different implementations.
We had like immediately metadata come out that’s like, the error rate on this box is now huge and we’d just flipped the flag so we knew like, okay, that’s probably the safe code that we had. And then once we gained confidence that like, okay, we’re reading 100 percent from both, the application code is just comparing every object as it comes out of the database and the boxes were not falling over, that’s when we started pulling back on the prod after migrating all the data in.
But overall that process took like a month. So, it was not like a, all right, we’re going to do it as fast as we can in an hour of downtime. It wasn’t like that, no. But overall that process took like a month. So, it was not like a, all right, we’re going to do it as fast as we can in an hour of downtime. It wasn’t like that, no.