How to Rewrite Your Bedrock Application While Remaining Operational
In January 2016, Chargify decided to do a code rewrite. Our application was six years old. We felt that we had learned enough — and were constrained enough by the existing architecture — that by starting fresh we would actually come out ahead.
I like to believe that we were not naive. We’d done our research and were aware of the warnings. We knew Joel Spolsky says this is a “Things You Should Never Do.” But there were also voices like David Heinemeier Hansson, who suggests there’s a time and a place for the rewrite. That viewpoint resonated with us.
So, we split our team into two. One half of the team was assigned to maintaining “Chargify Classic” because it was our revenue generator and a great application that wasn’t going away anytime soon. The other half of the team embarked on the new application, code-named “Comet.”
The Big Rewrite
A few months into Comet, Chargify entered into acquisition talks with Scaleworks. The partners at Scaleworks rightfully identified Comet as one of the biggest risks to the business, and we mutually decided to put the rewrite on the shelf and refocus solely on Classic.
We had made good progress on Comet and had the chance to explore some really great ideas, but it would have almost surely taken longer than we hoped. We were only four months along, and this was already becoming obvious.
Why do rewrites almost always take longer than everyone expects? I believe it’s because of something I call “bedrock code.” Every application is built upon foundational code that we often take for granted. It’s old code that doesn’t change much and undergirds much of your architecture. Bedrock code is both a blessing and a curse. It’s often the reason you feel constrained, since it sets a direction and affects the code and the decisions that come after it.
Your bedrock code was written a long time ago, either by developers that are no longer with your company or less experienced versions of your current developers. Just like geological bedrock, it’s hard to change or move. But, also like geological bedrock, it’s massive and important. And, it often doesn’t really need to change.
Which code is bedrock code?
- Your user and authentication system is bedrock code. You may not like the way it works, but you don’t want to have to change it or rewrite it just so you can make the high-value change that drove the rewrite in the first place.
- Your system of “observers” is bedrock code. Whether you’ve been intentional about your observers or they’re implicit, almost every app has some form of them — they ensure that when “A” happens, then “B” over in this other place should also happen.
These aren’t interconnections you want to have to think about because you’ve probably forgotten about most of them, but when you miss a connection, your users will probably notice immediately.
The nice thing about bedrock is that it’s usually covered by a layer of fertile ground that can be easily changed. You can bulldoze it flat or pile it up. You can plant things in it that will grow bigger and better over time. This fertile ground is where your team is building most of the time, and probably where you want them to stay. Even if you’re feeling constrained and want to rewrite, it’s possible to use the fertile ground to achieve the same results as a rewrite, and the rest of this article introduces some strategies to do just that.
Changing the Wheels on a Moving Bus
At Chargify, the Comet project energized us because it showed us what was possible. It allowed us the freedom to break our existing conventions to discover more sustainable and user-centric ways of doing things. I knew we needed to get some of those ideas into Classic. So, instead of a rewrite, we decided to change the wheels on the bus while it was moving.
Such a project requires more discipline and constraint-driven work than a rewrite, but it allows you to stay in the fertile ground rather than get mired down in bedrock. As compared to a rewrite, you’re more likely to finish this kind of project successfully and in a timeframe you can actually estimate.
We followed a few simple strategies for success:
1. Double writing: compute and store data the “old way” and the “new way” simultaneously
2. Double checking: compare the results of your before and after, based on double writing
3. Slow rollout: don’t expose the rewritten portions to everyone all at once
For Chargify, one of the fundamental changes we wanted to make was to get away from a “rolling balance” transaction system and move to a more flexible invoice system for bill presentment. When I started building Chargify in 2009, one of the perceived “most important” concepts was that of the “current balance due” for a subscription. So, I devised a system where every balance-changing entry — called a transaction in our application — stored two important pieces of data:
- The balance delta (an increase or decrease to the balance)
- The current balance
This is essentially the idea of a financial journal or ledger, and has been around in accounting for hundreds of years. It’s a solid choice for accurate data, but it’s not always the best choice for helping customers understand their SaaS bill. (And, unfortunately, I naively built a single-entry ledger instead of a double-entry ledger, which imposes limits and makes reporting difficult. But that is a topic for another time…)
The problem with ledgers is that entries are, by definition, immutable. If you make a mistake or need to make a change, you must make another entry to “correct” the balance. You end up with a technically correct history of events, but a potentially confusing view for the customer. SaaS customers think in terms of “this month’s bill” and “last month’s bill.” If you need to correct last month’s bill for them (maybe you accidentally overcharged them), they want to clearly see that bill changed, not that you tacked another journal entry onto the end of a list that includes this month’s bill. For us, the transaction system was an impediment to human understanding. But it was also bedrock in our system — everything depended on it.
I would wager that this is common in most applications as they grow — the data you are storing is important but insufficient. And, just storing additional data misses the mark — you want to capture entirely different data or arrange it in entirely new ways. For this scenario, I suggest double writing.
Double writing works like this:
- Don’t change anything about the data you’re already storing — just keep doing what you’re already doing. (The goal is to not break anything that already exists!)
- Also, decide what data it is that you really want. Design that data (schema, collection methods, etc) exactly the way you want it, as if the old system doesn’t even exist.
- Finally, work out how to store both the old data and the new data simultaneously.
Usually, your new data can’t be derived statically from the old data, or else you would have already done that. Usually, there is key moment-in-time information — that can’t be easily reconstructed after-the-fact — that affects your new data.
An example from our case is pricing information. Our old ledger entry would accurately reflect the total amount charged for an item, but what if that item price was actually composed of tiered pricing data? How much did each tier cost that day and for that user? Our new data squirrels away this metadata for each “line item.” Where possible, our new line item data is tied to the old transaction data, but is usually much richer and accurate, telling a more complete story.
What is the best way to capture your new data?
For us, we wanted to change as little as possible about our existing code paths. Instead, we “tapped in” at specific places in the existing code and built up an in-memory “registry” of data. By contrast, the old code paths essentially use the database as the “registry”; a common problem in apps built iteratively by many developers. These code paths save and re-save the same row of data multiple times as logic unfolds.
As inefficient as this is, I see it happen often in mature applications as new features are tacked on over time. As tempting as it is to change, it is often risky and difficult… so, we’re taking the approach now of being as efficient as possible with the new data only. In the next section, I’ll show how the new data will afford us an opportunity to come back and improve the old code paths later.
Once the data registry is built via the old code paths — sometimes across classes or very different parts of the application — the new data is persisted as efficiently as possible (i.e. in as few database writes as possible). We do our writes “in-band,” although you could consider doing the writes out-of-band (i.e. in a background job) if you want to impart a smaller impact on your performance.
In the end, what you should have is code that does exactly what it always did, but with the addition of new data, possibly in a new format, that sits next to your old data. This gives you insights you didn’t have before. It is beneficial to write and store the data even if you haven’t yet written code to use it, as the next sections will show.
Believe it or not, you are going to make mistakes with your double writing! No matter how many tests you write, you’re going to encounter scenarios in production that you didn’t account for in your double write strategy. The good news is that there is a scalable way to find out about it — you can find some way to compare or correlate your old data with your new data.
For us, it was fairly easy to find the point of comparison: for every subscription in our system, the (old-data) rolling current balance can be compared to the (new-data) calculated balance of unpaid invoices. When there is a mismatch, we know we’ve missed a scenario in our hook point or data translation.
We set up a daily job to look for these discrepancies and open an issue whenever one is found.
This is a common strategy for refactoring code paths, also. Github’s scientist allows you to run refactored code alongside the old code and then compare the outcomes. The only difference is that double writing focuses on refactoring data rather than code.
Moving all of your users from something old and stable to something new and unstable is a risky proposition. It’s better to make moves like this slowly. Double writing and double checking give you a wonderful platform for a slow rollout. In fact, you can start double writing data before you even start to use it for anything except the double checking.
Once you’re making use of the new data, double writing dictates that the system continue to work in the presence of the old data. This has allowed Chargify to add our rich new Relationship Invoicing features (such as Customer Hierarchies and WhoPays) in a piecemeal fashion, without having to change everything all at once. Some features (such as presenting the bill to the user) are updated to use the new invoices, while other features (such as our dunning system that attempts to automatically collect on past due accounts) are left to use the old transactions. Since the numbers match, we are able to wait to update the dunning code later, making the development of the new features less imposing since not every area of the codebase needs to be changed.
I also recommend allowing users to opt-in to new features enabled by your double writing. This gives you a slowly expanding set of data for the double-checking system. Rather than flooding the double-checking system and overwhelming your team with fixes, bugs become more manageable and fewer users are adversely affected by them.
Bear in mind that, at some point, you’ll probably need to write a translation where historical data is converted to the new data. At this point, you’ll need to decide — how much fidelity do I need in the historical data? As mentioned, it is often difficult to reconstruct history and get the same full-fidelity new data you’re getting through double writing. For us, the tradeoff is that historically older invoices will be accurate but have less of the insights and richness as compared to their newer counterparts created through double writing.
At Chargify, our rollout plan for Relationship Invoicing is a four-phase plan:
- First, new users can opt-in to the new data/features (results in a trickle of users)
- Later, new users get the new data/features by default (results in a steady stream of users)
- Later yet, existing users can choose to opt-in to the new data/features (requires translating the historical data, results in a high flow of users)
- Lastly, existing users are all given the new data/features (opens the floodgate)
Be aware that, until you get to step three or four, there are going to be users who wonder why they don’t have access to “the new stuff.” We’ve definitely experienced that, and it can be discouraging for both the users and your team supporting them. My advice is to clearly communicate your reasoning: you value correctness and reliability for the install base. You’ve earned their trust, and this is one of the ways you retain it. We believe that it is better to have a slow rollout plan that works smoothly than a fast plan that harms users and destroys trust.
Changing the wheels on the bus while in motion requires a careful plan, but it can be done. And, it doesn’t require stopping the bus entirely, which most users will appreciate.
By employing double writing, double checking, and a slow rollout, you can change fundamental things about your application without needing to rewrite. There will be times where you wish you had just rewritten the whole thing, but stay strong! Think of all of the code you’re not rewriting because you stayed in the fertile ground. And, think of the safety nets these strategies give.
For Chargify, the biggest ideas from Comet have landed using a team of developers that was half the size of the rewrite team. On top of that, I’d estimate we’ve changed the wheels in less than half the time of a rewrite. That’s a result that should give confidence to anyone considering these strategies in lieu of a rewrite.
Feature image via Pixabay.