Controlling the Machines: Feature Flagging Meets AI
Have you ever stopped to consider how many movie plotlines would have been solved with a feature flag? Well, you probably haven’t — but since I spend most of my time working on different scenarios in which teams use feature flags to drive feature releases, it crosses my mind a lot. There are more than six Terminator movies, and if Cyberdyne had just feature-flagged Skynet, they could’ve killswitched the whole problem away! We could make the same analogies to The Matrix or any of a dozen other movies.
Cinema references aside, there are real translations of how these controlled release scenarios apply in the technology space. Artificial intelligence is ushering in a time of great innovation in software. What started with OpenAI and GPT-3 quickly accelerated to what seems like new models being released every week.
We’ve watched GPT-3 move to 3.5 and then to GPT-4. We’re seeing GPT-4’s 32K model emerge for larger content consumption and interaction. We’ve watched the emergence of Llama from Meta, Claude from Anthropic and BARD from Google — and that’s just the text-based LLMs. New LLMs are springing up for image creation, enhancement, document review and many other functions.
Furthermore, within each of these AI model domains, additional versions are being released as new capabilities are unlocked and trained in new ways. I can’t help but see the parallel to software development in the realm of AI models as well. These LLMs have their own software lifecycle as they are enhanced and shipped to users.
Each vendor has its own beta programs supporting segments of users being enabled for models. Product management and engineering teams are evaluating the efficacy of these models versus their predecessors and determining if they are ready for production. There are releases of these new models, in the same way you’d release a new piece of software, and along with that, there’s been rollbacks of models that have already been released.
LLMs as a Feature
Looking at the concept through that lens, it becomes easy to see the connection between AI models and the practice of feature flagging and feature management. We at LaunchDarkly talk a lot about controlling the experience of users, enabling things like beta programs or even robust context-based targeting with regard to features that are being released. The same concepts translate directly to the way users consume any AI model.
What if you wanted to enable basic GPT-3.5 access for the majority of your users, but your power users were entitled to leverage GPT-4, and your most advanced users were able to access the GPT-4-32K model that supports significantly longer character limits at a higher cost? Concepts like this are table stakes for feature flagging. Even Sam Altman at OpenAI talks about the availability of a killswitch concept that lives within GPT-4. Essentially, we’ve come full circle to The Terminator reference and he is advocating for a means to disable it if things ever got too scary.
In this example, we’re getting the model from a LaunchDarkly feature flag, deciding what sort of token length to leverage based on the model selected and feeding that model into our OpenAI API call. This is a specific example leveraging the OpenAI API, but the same concept would translate to using something like Vercel’s AI package, which allows a more seamless transition between different types of AI models.
Within the application itself, once you log in, you’re presented with the option to opt-in to a new model as needed, as well as opt-out to return back to the default model.
Measuring the Model
As these models mature, we’ll want more ways to measure how effective they are against different vendors and model types. We’ll have to consider questions such as:
- How long does a model take to return a valid response?
- How often is a model returning correct information versus a hallucination?
- How can we visualize this performance with data and use it to help us understand where the “right model to use when” is?
- What about when we want to serve the new model to 50% of our users to evaluate against?
Software is in a constant state of evolution; this is something we’ve become accustomed to in our space, but it’s so interesting how much of it still relies on the same core principles. The software delivery lifecycle is still a real thing. Code is still shipped to a destination to run on, and released to users to consume. AI is no different in this case.
As we see the LLM space become more commoditized, with multiple vendors offering unique experiences, the tie-ins into concepts like CI/CD, feature flagging and software releases are only going to grow in frequency. The way organizations integrate AI into their product and ultimately switch models to gain better efficiency is going to become a practice the software delivery space will need to adopt.
At LaunchDarkly Galaxy 23, our user conference, I’ll be walking through a hands-on example of these concepts using LaunchDarkly to control AI availability in an application. It’ll be a session focused on hands-on experiences, showing live what this model looks like in a product. With any luck, we’ll build a solid foundation of how we can establish a bit more control over machines and protect ourselves from the ultimate buggy code, which results in the machines taking control. At minimum, I’ll at least show you how to write in a killswitch. =)