Cloud Native / Cloud Services / Kubernetes / Contributed

How Kubernetes Saved Optoro from the Cloud That Cost too Much

25 May 2020 9:00am, by

Kubernetes is highly available, allows for the deployment of containerized applications, is extensible, and has matured to the point where its benefits are increasingly relevant and measurable. It lets you deliver the scale, stability and efficiency that customers expect. It also enables you to break things quickly so you can innovate, make mistakes, and find a way back.

In this article, I will explain why we moved Optoro’s Ruby on Rails stack from an Amazon Web Services’ Virtual Private Cloud (VPC), to an on-premise Joyent Smart Data Center (SDC) installation, and how this path led the company straight into containerization and the benefits of the open source Kubernetes container orchestration engine.

Why We Left the Cloud

Zach Dunn
Zach Dunn is the Senior Director of Platform Operations and CISO at Optoro where he helps teams build and deliver modern software for enterprise clients. He has spent most of his tech career in roles that loosely resemble production infrastructure, with responsibilities such as hot aisle drudgery, professional nerd herder and budget owner. Once recognized as an individual, these days he is commonly referred to as Arabella's Dad or on occasion William's Dad.

Most companies believe that the cloud is the future. We started with EC2-Classic in AWS, where everything was wild and open. Optoro was cloud native before the term became cool, and it stayed that way when I joined — moving the company to AWS VPCs in 2015. The entire infrastructure was built on Ruby on Rails, supported by a series of relational databases and backend applications.

Then it got expensive.

As a fast-growing company, we wanted to give our developers the ability to build and try new things. As a nascent startup, however, we had to keep cash at the forefront of our decision-making. We were spending a lot of money just to run our servers, and because we started out in EC2, it was a classic VM architecture. We couldn’t really scale our application up and down a lot, and many of our integrations ran 24 hours a day, which wasn’t ideal. We were listing inventory, processing requests from warehouses and handhelds, but it wasn’t very cyclical.

We also ran into instances where we needed to adjust that little bit more than the dataset implied. My classic example is our Redis queues, where we ran slightly larger machines than we needed because we were running on shared hardware and Amazon. We had to hedge our bets and run larger machines, so we had to pay more than we needed. A lot of the bursting capabilities weren’t that useful for us, and we were doing a lot of network transactions and disk I/O. Every month, 20-30 percent of our bill was just I/O — and those bills were getting bigger and bigger.

It needed to change. We couldn’t offset the I/O, and even doubling down on spend, we realized that we could run this more cheaply on-premise. Optoro was in a steady-state in terms of costs — the APIs and databases were never powered down, so costs would increase or decrease with cloud expansion and contraction. By moving into the data center, we could level-set these costs and gain more predictability around financial management. So instead of going the cost optimization route, we pulled up, pulled the lever and punched out.

Why the Data Center Migration Worked

When we first moved from AWS to on-premise, the first iteration used Joyent SDC, an open source infrastructure-as-a-Service (IaaS), building and managing it with HashiCorp’s Terraform. When we were inside AWS, we were already leaning toward the Consul and HashiCorp stack, so that didn’t change when we moved into our own data center. There have been multiple learnings and benefits along the road of the migration.

For example, it’s essential to know where your constraints are. When we looked at our application, it didn’t make sense to run it on rented hardware. We got down to what it would take to migrate our application. When you work your way back on-premise, you can build better systems as you iterate the process. When you understand the shortcomings of your platform, you understand where the strengths lie.

The benefits of the migration were:

  • We’d been squeezing as much performance as possible out of machines in the cloud, but these constraints went away with our migration to bare metal.
  • Capacity doesn’t cost more if we use it. In fact, we’re incentivized to use it because we’re getting value out of the tin we’ve paid for. This is our first-generation tech with 32 cores, 256GB RAM, SSDs – our machines delivered far more capacity and performance than the cloud.
  • We continued to deploy images and manage our instances with Terraform, just as in AWS.
  • Our direct-attached storage meant no contention around I/O any longer.
  • We had measurable cost savings that took us by surprise.
  • We can isolate workloads more easily because we control the metal, and our performance has improved exponentially.
  • By being more performant, we had bought more capacity through the efficiency of the platform

We sat down and did the math with our board of directors and our financial analyst. We modeled the move to the data center to estimate how much we’d save over a set number of years. Price stability is a big deal for us, as is predictability. We wanted to make the case that the cloud was a better fit. As a startup, we didn’t want to put down a lot of cash on something new in order to get a lower run rate.

When we ran the numbers, we worked out that we were looking at 60 percent savings — we weren’t trying to ballpark or estimate opportunity cost. These were the hard costs, and they made a lot of sense to the board and to us. The board knew what we were doing, and we had their buy-in, which was critical. Change is difficult and buy-in is essential.

They trusted us to pull it off, and we did. About two-and-a-half years later, one of them asked my boss about the model we had presented. They wanted to know if it worked and if the costs were what we thought they would be. We dug it all back out, and we realized that we’d been wrong. We’d saved more.

Why Data Center Migration Might Not Be for You

Migration from cloud to on-premise isn’t for everyone. You can’t give up the idea that you’re still writing infrastructure code. You’re still going to have generations of servers you’re going to build, and you need to be able to expand, replace and re-architect as you go. There is a risk you can paint yourself into a corner.

How Kubernetes Changed the Future

We’d been through two migrations — from Amazon EC2 Classic VPCs, to Amazon VPCs to on-premise with the Joyent SDC. Joyent created a Docker API that could be scheduled across the data center, and right there, our containerization journey began. We started to convert our estate of VMs into Docker containers, and in 2018 we began to experiment with Kubernetes. That marked our third migration — this one to Kubernetes on metal.

We were already mimicking a Platform-as-a-Service (PaaS) approach with Consul running on Joyent’s SDC using Terraform for provisioning. Containerization was already part of our trajectory, so it made sense to move from emulating a PaaS to adopting a platform in earnest. Kubernetes had matured and had more momentum since our first investigations. We felt we could learn from the experiences of the industry. Our senior engineers conducted several sprints testing Minikube locally to see how easily Kubernetes would integrate with existing infrastructure tools and applications.

We also knew we didn’t want to absorb additional costs. We needed a business case and we couldn’t find one for OpenShift, GKE or EKS. We wanted to allow our developers to keep managing their clusters directly. This is when we found Rancher. Its research into the business case for container adoption helped us to understand more about the technology and resulted in a successful PoC that has seen us steadily migrate our services and we now run 13 of our 42 services in production, and the goal is to move the entire infrastructure across this year.

Looking forward, we decided to not do a refresh of our existing Joyent platform, but we bolted on our Kubernetes clusters to the side of our existing infrastructure. It let us bring in as much capacity as we needed using Kubernetes to bring nodes up and down. Just as important, if we need more capacity or to triple the size of our data center, we need only expand the clusters in the cloud. Then if we know that’s going to be our steady state, we can buy back our baseload and optimize for cost.

The benefits have been extensive:

  • We can spin up environments to try out new ideas and tear them down when something breaks. This enables us to know we are truly building declarative infrastructure.
  • We have the flexibility to extend our network of US data centers on demand.
  • Our Quality Assurance team appreciates the granular visibility of the cluster, which allows them to check permissions, redeploy applications and view and monitor individual logs.

Conclusion

My team is unconstrained and can bring in new tools without the friction that comes with most commercial platforms. We can be creative and to do more without being tied down by a closed environment. We can also change strategy quickly, which makes a big difference in our dynamic organization. There were lessons learned — the move to Kubernetes wasn’t without its challenges — but it makes our lives easier. That’s the beauty of Kubernetes.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.