When Losing $10,000 to Cryptojacking Is a Good Investment
When we received an abnormally high bill from Heroku several months ago, I immediately knew the root cause: Someone had exploited our architecture and hijacked our compute resources. I could guess what had happened in detail because we had known long before that such exploitation was possible. And we still permitted it. Not only that, faced with similar choices, I, as the chief technology officer, would make the same decisions again.
To understand the “how” and the “why,” we first need to talk about Wilco’s architecture. We’re a young company, just over a year old, that wants to enable developers to acquire and practice skills. Our uniqueness is in the way we do it: a full-blown simulation of a workplace.
Wilco users’ experience closely resembles a job in a tech company. They have a GitHub repository to work on, a Slack-like corporate messenger (with simulated coworkers) and a production environment: data, users, load — the whole nine yards.
In practice, this is very challenging. GitHub, Slack and other SaaS tools are built to be used by teams. In our case, we needed to create an isolated mini-organization for every user. Let’s see how we do it and why it was exploitable.
How Wilco Works
Every Wilco user gets their own production instance. They have a URL for a frontend and a backend that can communicate with each other, as well as a database instance that saves their data. This way, users can test their changes (using CI/CD), and we can simulate things like performance issues by sending requests that mimic traffic.
It’s imperative that our users’ production environments aren’t black boxes. We need to give them visibility into what’s going on, to debug problems and configure things — like they would on a real job.
The Race to MVP
The “original sin” was our desire to get Wilco’s MVP ready as soon as possible and get it into the hands of real users. Before investing the time and money to build a proper, scalable solution, we needed to test a lot of assumptions about whether there’s even demand for a product like ours. We also needed the ability to develop and kill features quickly and cheaply while retaining the flexibility to double down on something if it’s proven successful.
The easiest way was to connect each repo to two Heroku apps, one for the frontend and one for the backend. That way, we were able to create one organization but give each user access to only their apps. We’d push the code to the right apps, and Heroku would handle the rest.
In addition to being efficient, it was also cost-effective. We got a database for free using Heroku’s free tier (RIP) and could pay only for the time users utilized the resources. So if a user spent only a few hours a month with their apps, we’d pay just for those few hours.
You might have already noticed a problem with our plan. Heroku wasn’t built for what we were doing, so scaling it would be difficult. We knew about the limitations and even mapped them:
- Heroku doesn’t have application-level authorization and roles. If you’re invited to an app, there are no limitations on what you can do with the app.
- There’s a limit on the number of applications that can be created under the same account — 200 apps. It means we can only support 100 users (each has a backend and a frontend app) under one organization.
- The sophisticated use cases devs are used to aren’t possible. Microservices, queues, cache and staged rollout are either a downright no-go or very complicated because Heroku was not built for these kinds of scenarios.
For each of the risks, we wanted to understand the impact:
- No app-level authorization. Users can play with the configuration if they have this kind of control. It’s possible to upgrade app dynos or add add-ons that will cost us money. Unfortunately, Heroku doesn’t support putting limitations on an app’s spend, or have notifications when a threshold is exceeded, so we’d need to constantly keep an eye on this. That said, it didn’t break the siloing — users could only affect their own apps.
- The 200 apps limit. We’d hit a wall for every 100 users, but as a short-term fix, we could manually create a few more organizations.
- Use-case simplicity. Not a problem; we can focus on the easier use cases for now.
So what’s the worst-case scenario? A big bill from Heroku, and users receiving an error when trying to create an app. Not great, but not horrible. The alternative, building something from scratch, seemed worse.
Succeed to Fail > Fail to Succeed
How surprised are you that all of the bad scenarios we outlined materialized during the first three months of Wilco’s existence? First came a partner who wanted quests that were way more sophisticated than we could provide, with multiple servers. Then, one of our campaigns went viral, so we needed to cap it after a few minutes when we got to the Heroku app limit. The cherry on top was a user who figured out that they could upgrade their dyno and add add-ons to misuse our compute resources, with us footing the $10,000 bill.
Were we sad? Of course not! This was a massive success! We got traction and a lot of valuable feedback without having to build a lot of our own infrastructure. Heroku, to its credit, was kind enough to waive a small portion of the bill. If we had to build everything upfront, the cost of experimentation would be much greater than this bill.
The main downside, if you can call it that, of our approach is that we relied for a few months on what we knew was a temporary solution. But there’s a ray of light here as well: we had time. While building the Heroku-based Wilco, we were already planning our homegrown solution and scoping the resources required to build it.
By the time we received the bill, we were already deep into building our solution. Most of the work was done without any sense of urgency, as things were working relatively well.
In startups, time is one of the most important things we need to optimize. We can’t waste time building scaffolding for things that might not be used by anyone. Sometimes there’s no choice but to roll the dice with a calculated risk.
While it’s hard to build a solution that you know will not scale, and might cost a pretty penny, consider the time you’ll save in the long run and the money this time costs. We did, which is why I do not regret even for a second getting that bill.