The Cloud Native Computing Foundation sponsored this post.
TransferWise launched in 2011 and as with most companies that started 5+ years ago, we began with a monolithic application. However, we scaled quickly and things began to get more complicated, quickly. Hundreds of thousands of customers turned into millions of customers. Our product was evolving, fast. Our code base had to keep up.
I’ll explore our journey from humble beginnings, through to complex system changes and how we’re now preparing for the future. I’ll also share a few challenges and learnings along the way.
At some point around 2015, we found we had a growing ring of microservices. So far, so good. It was fairly easy to run our web app plus two to three services we needed on a laptop.
However, we were growing fast (we were about 80 engineers at this point, we’re now 300!) and we reached a point where we needed to make changes in several services with dozens of dependencies. This made local development no longer feasible.
As a result, our development teams turned to our staging environment to test changes. But as this was simply a single environment shared by multiple teams, it became a huge source of instability and frustration.
When It’s Not So Simple Anymore
The staging environment situation was exacerbated by the fact it was used by our external partners (e.g. banks and other payout users) to test their integrations with TransferWise. They were understandably getting frustrated by the frequent instability of our system.
Our platform team made a key decision to create a separate environment where our partners, and eventually any user of our open API, could play around in it and not be at the mercy of a dozen development teams.
As we were specking the new environment, we realized that we shouldn’t just build one. We felt it would be better to make the process fully automated, so we were able to spin up an arbitrary number of environments that could be used by our growing number of development teams and external partners.
Terraform Definition of a Standard Environment
We called the result custom environments. Here’s how it worked:
- Create a pull-request in our Terraform repository, adding a definition like the one in the above image and optionally customising the default service set.
- Once it’s in, trigger a Terraform run, which provisions the following on AWS:
- The database instances for the main app and the services;
- The security groups, controlling both inter-environment and outside traffic;
- VMs for Consul (more on that later), Kafka, Zookeeper and other platform dependencies;
- Finally, dozens of service VMs (using Spot instances to save costs).
- Applies our standard VM hardening to all the machines in the environment;
- Adds the SSH keys for the owner team to all the machines;
- Creates service users, installs dependencies like the JDK;
- Sets up the platform services like Consul and Kafka;
- Sets up the service nodes and deploys the services.
At this point, your own environment is ready to use. Awesome.
From Laptop to the Cloud: The Magic of a Hybrid Environment
This new environment was proving useful already. But it lacked an important feature. When a developer tried to make changes in a service, often they wouldn’t only need to have their service make requests to the environment but also the other services in the environment needed to call their service. Allowing a service running in the cloud to find another service running on a laptop, many kilometers away, was a non-trivial task.
Of course, we could build an artifact locally and deploy it to the environment, but this would add a lot of extra hassle (and no one is a fan of remote debugging), so we knew we had to figure out a better way.
- EC2 Nodes and Laptops Co-Existing in a Custom Environment (Screenshot from Consul UI)
HashiCorp Consul serves three goals in a custom environment:
- The key/value store augments the standard staging service configuration, allowing us to override environment-specific parameters like the database location and the generated credentials;
- We mapped Consul’s DNS resolution to one of our internal domains so you can access your service using a friendly pricing.service.yuriy.twi.se URL rather than fishing for an IP address;
- Finally, we could easily register a laptop as a service node, so once I stop the pricing service in the cloud on the screenshot above, pricing.service.yuriy.twi.se will resolve to my laptop and the other services will come here, allowing me to test my pricing service change it end-to-end.
The connect your laptop to the cloud workflow was particularly useful for our front-end developers at the time, as it allowed our still very big and resource-intensive web app to sit in the cloud and go to the laptop for the front-end code.
⚖ What Worked and What Didn’t
Custom environments were a great enabler for our development teams, allowing them to get their own version of TransferWise up and running, quickly.
The sandbox became a great tool to enable our partners, developers and customers to try the TransferWise API. Uptake among our development teams quickly surpassed all expectations from the platform team.
However, with over 100 environments and 4,500+ running EC2 instances, the issues in our approach became evident.
- As more and more services were added to a custom environment definition, provisioning and configuring times went up. As all the environments were using a shared inventory in AWX, only one configuration or deployment could run at a time, forming long queues;
- We ran into all sorts of AWS limits (EC2 instances, security groups, RDS instances), so we had to create support cases for limit increases on a regular basis. At some point, we used up the whole Frankfurt AWS region spot instance capacity, forcing us to retreat to Ireland, where many more resources were available;
- As the provisioning time went up, our teams started to keep their environments on at all times, contrary to our initial disposable premise. Teams would then have to keep track of several dozen virtual machines in their environment, dealing with spot instance termination, disk space issues, and all sorts of infrastructure management they were unprepared to face. This increased our AWS bill considerably;
- Some services are contacting external rate-limited resources. For example, provisioning a hundred of rate service instances would quickly drain the number of requests we could make to the rate provider, blocking all the work and potentially breaking the production rate updates as well. To solve this, we added a condition to fall back to the rate service instance in the staging environment, but this required non-trivial adjustments to our initial self-contained design and contributed to the environment instability even more.
Take Two: Kubernetes Custom Environments
Our platform team realized our current approach wouldn’t scale and that we needed to take action.
After gathering feedback from the development teams, going through usage data and studying alternatives, we identified the design goals for our next iteration:
- Instant creation and destruction time;
- The minimal set of services you need, possibility to easily add or remove a service, possibility to fall back to a staging service;
- No pull requests or dedicated button pushers for end users (if you want a new environment, you don’t need anyone else to approve it or do anything to enable you);
- Familiar tooling for deployment.
Luckily, around the same time, our Kubernetes setup started to really take off, making the answer to most of our design goals very simple. So how did it tick the boxes?
✅ Creation/destruction time: Spinning up a new pod is super-fast, and cluster auto scaling helps to upscale/downscale worker nodes as necessary.
✅ Self-healing: A native feature of Kubernetes. Should a pod or even a worker node die, the control plane will take care of it without any human intervention.
✅ Minimal set of services/no pull requests: Not directly a Kubernetes benefit, but with a chatbot we developed it’s simple to just ask to add or remove the USD-related set of services, for example.
✅ Familiar tooling: As our K8s migration continues, our development teams learn and embrace kubectl. We also ported Octopus (our internal release management tool) to work in the new custom environments, unifying the deployment process in production, staging, or any custom environment.
The new custom environments live in a new separate cluster next to the staging ones. The creation is simple:
Three minutes and we’re ready to go! An impressive change, particularly when you compare it to the hour it took to set up an old custom environment (and there were even times when developers struggled for several days to get up and running).
Once it receives the command, the bot spins up new namespaces in the cluster, applies the staging manifests with configurable custom environments overrides and creates the Deployment object for the services. Instead of using separate DB instances per environment, we use a shared pool of beefy RDS instances, allowing for much quicker setup (and teardown).
Some of the services are still in the process of migrating. If we don’t have a Docker image for a service, we use a generic image, which gets passed the service name and downloads the artifact on startup.
Since the first iteration of custom environments, we’ve rolled out Envoy everywhere, so the Consul setup is no longer needed as all of the services use uniform Envoy URLs. To connect your laptop to the environment, you simply run an Envoy container locally and tell it your environment name and the service name you’re running.
The Kubernetes custom environments are still in alpha stage, but we can already see how much faster, efficient and cost-effective they’re poised to become compared to our first iteration.
We never like to stay in one place for long and I’m sure the new custom environment setup will grow and change with time, as well as our production stack. As a company, we’re in the same boat.
As we’re beginning to explore global distribution and running parts of TransferWise closer to our customers, we’ll need to reflect those changes in the development environments as well.
Luckily, Kubernetes provides a solid foundation for us to build on and we look forward to the future challenges with enthusiasm.
To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon Barcelona, May 20-23 in Barcelona.
Feature image via Pixabay.