Techniques to Avoid Cloud Lock-in
Every cloud provider has marquee services that attract companies and developers to build on its platform. These flagship services work nicely with other services on the platform but often limit interoperability with other public clouds, creating cloud vendor lock-in. There is a case to be made for embracing lock-in: it allows a company to boost productivity and provide value to their users faster.
At Render, we are building a new cloud platform bootstrapped over multiple public clouds, with plans to add on-premises workloads, and it’s essential for us to avoid locking ourselves into a single provider. In this post, we talk about some of the key technical decisions we’ve made to avoid locking ourselves into a single cloud provider and to set us up for a hybrid cloud future.
Infrastructure as Code
Infrastructure as Code (IaC) is a requirement at most software companies today. It’s the cornerstone of all technology stacks, and cumbersome to change once a choice has been made. Popular options include AWS CloudFormation, Terraform, Pulumi, Chef, and Ansible.
AWS CloudFormation only works for companies all-in on Amazon Web Services. Terraform is popular with a lot of organizations, but does require learning a new Domain Specific Language. If you’d like to use a language you already know, then Pulumi (Node.js, Go, Python, .NET core), Chef (Ruby) or Ansible (Python) might be a better fit. Ultimately, we ended up using both Terraform and Ansible for their mature ecosystems and broad cloud provider support. Ansible is our tool of choice for configuring machine images; Terraform works well for provisioning infrastructure components and configuring networking on multiple public clouds.
Configuration and Secrets
Every production application needs access to configuration variables and secrets that are best stored in a purpose-built, encrypted and readily accessible location. Cloud providers offer API-driven products that make it easy to securely store and access this data: AWS Secrets Manager, AWS SSM Parameter Store and Google Cloud Secret Manager are all examples that free users from having to manage underlying storage and cryptography. However, API access to these services is always based on IAM credentials, which are impossible to port across clouds.
The lack of access and visibility into the control plane ultimately made it clear we had outgrown our initial setup and needed to manage our own Kubernetes clusters.
Our configuration and secrets management solution had to allow full control over our data, work across all major cloud providers and scale easily as the company grows. Access to source code that had already been professionally audited was also essential. Vault ended up satisfying all our constraints, and as an added bonus it was relatively easy to set up and manage.
Kubernetes can be prohibitively complex, but it provides useful abstractions that unify server/container orchestration across public clouds and private data centers. Our team had prior experience with it, and picked it over other orchestrators for its rapidly growing community and pace of development, despite its shortcomings.
Early on our focus was on getting to market as soon as possible, so we decided to use a managed Kubernetes offering. However, as we’ve grown to serve billions of requests every month we’ve also run into multiple limitations and bugs in managed solutions across multiple clouds. The lack of access and visibility into the control plane ultimately made it clear we had outgrown our initial setup and needed to manage our own Kubernetes clusters. At the same time, it was important for us to have the same management primitives for Kubernetes across all our clusters, which is of course impossible when using managed Kubernetes from different cloud providers. The launch of Render’s Frankfurt hosting region was a big milestone — not only did it transform Render into a multiregion and multicloud platform, but it also helped us build expertise in managing and administering Kubernetes from the ground up.
It might appear that we’ve avoided cloud lock-in by embracing Kubernetes lock-in. But this is where our UX decisions help: we’ve chosen to avoid becoming yet another managed Kubernetes platform and instead have focused entirely on making Render a UX-focused platform without exposing Kubernetes to our customers. In doing this, we preserve the optionality of migrating to in-house or third-party orchestration tools that best suit our users’ needs at any given time.
Adding new components to a distributed system leads to an exponential scaling in complexity and can quickly become a management nightmare. Message queues provide an elegant solution to this problem by offering new services a single integration point to communicate with all existing and future services. Public clouds create lock-in through default integrations with their proprietary queueing service. For example, Google offers native integrations between BigQuery and Pub/Sub, while AWS makes it incredibly easy to tie SQS with Lambda, RDS, Redshift and other AWS components.
Our solution to messaging lock-in is simple: use self-hosted Redis Pub/Sub and the excellent open-source machinery project to provide a Golang queue abstraction over Redis, which can be replaced with another OSS queue if needed without changing application code. Our approach to message queues has scaled to processing over 100 million events every day and we didn’t have to change a line of code when deploying message queues to a new cloud and region.
A combination of world-class open source projects and improved hybrid cloud support in major cloud platforms has made avoiding cloud lock-in easier than ever before for modern cloud platforms like Render. Avoiding lock-in does increase engineering investment, but in our experience, the peace of mind and the ability to use the best tool for the job makes the tradeoff well worth it.
Amazon Web Services is a sponsor of The New Stack.
Featured image via Pixabay.