Platform Engineering Helps a Scale-up Tame DevOps Complexity
Going from startup to scale-up is a great moment for any tech company. It means you have great customer traction and proof of value that you can expand your reach to new markets and verticals.
But it also means it’s time to scale up your technology, often in the cloud. And that isn’t easy.
Capillary Technologies, which builds Software as a Service (SaaS) products within the customer loyalty and engagement domain, saw its customers increase in number from 100 to 250. It started experiencing the typical scale-up growing pains, Piyush Kumar, the company’s CTO, told The New Stack.
As Capillary’s team grew significantly, its challenges pertaining to DevOps complexity also grew. Read on to see if these challenges ring true for you and how Capillary Technologies leveraged Facets.cloud self-service infrastructure management and adopted platform engineering to speed up developer productivity and deliver value to end customers faster.
DevOps Doesn’t Scale by Itself
When Kumar joined Capillary as a principal architect in 2016, the company’s presence was growing in India, Southeast Asia and the Middle East, while starting to gain traction in China. But when it looked to go further, this company built on Amazon Web Services (AWS) started hitting some common roadblocks in the cloud.
“The ratio of number of developers to the number of people in our DevOps infrastructure team was starting to get skewed,” Kumar said. “That meant that the number of requests going in from the engineers to the DevOps teams was growing, so the operations tickets were basically growing, and our response times were beginning to slow down.”
Toward the end of 2019, Capillary started to expand to new markets and cloud regions in the U.S. and Europe. These opportunities also presented challenges.
“Newer regions essentially meant spinning off the entire software, infrastructure, monitoring, everything else in a different region,” he said.
Launching in new regions requires organizations to adhere to data sovereignty and data localization laws.
As these launches occurred, Capillary’s infrastructure was in a semi-automated mode. “When you’re in that mode, there are things that are automated and then there are quite a few things that are not. So you don’t have enough visibility into your overall environment stack,” said Kumar.
New regions brought a lot of surprises — the DevOps team had to grow to manage the new environments, and had to meet the new demands of the growing customer base, product portfolio and required number of infrastructure components.
At the same time, Capillary grew from about 100 to 250 engineers.
“We didn’t want stability to start to take a hit, because we now needed to release across multiple environments,” Kumar said. In short, he noted, “more than linear scaling was needed to manage all of this.”
The Cloud Native Complexity Problem
A lot of platform engineering initiatives are sparked by struggles with disparate dev and ops tooling. This was not the case at Capillary, which has always had centrally managed infrastructure.
This is why, in order to battle this complexity at scale, the team members logically tried to increase the automation coverage of their infrastructure. But they found themselves stuck in a constant game of catchup.
“So we tried to continue to automate more and more, and it continued as a team, where you would do more and then you will realize that there is more to be done, so it felt like a constant battle because that landscape kept growing,” Kumar said.
“In six months, whatever we went ahead and automated, we basically carried newer debt, so there was more to be automated.”
For instance, they adopted the open source database MongoDB to bring new infrastructure, storage and database capabilities into the Capillary ecosystem. The DevOps team soon realized that they couldn’t easily automate everything — from launching to new regions to monitoring, backups, upgrades, patches and restoration.
By the time the Capillary teams automated whatever they could, they had also adopted Apache Kafka for real-time data streaming and an AWS EMR to run and scale workloads — which they then also tried to automate.
Capillary’s teams had gone the open source route to avoid vendor lock-ins. But whether they went open source or proprietary, they realized the complexity of the cloud native landscape means a lot of stitching automation toolchains together.
To tackle this, they needed:
- Something that would make the overall infrastructure and deployment architecture more uniform, more visible and 100% automated, from build to deploy.
- To move developers from being reliant on the DevOps team, to being able to provision infrastructure in a self-service way. This includes documentation uniformity to create a single source of truth.
- A tool to manage the environment, infrastructure and deployment.
The solution Capillary sought, Kumar said, would allow users to “go ahead and create a document. You would say that this is my source of truth. And now I go ahead and do all of this in this way, And I do it uniformly all the time.”
In short, he wondered, “Is this something that a software could translate in terms of managing your environment, infrastructure, deployments, everything?”
Building an Infrastructure Blueprint
A lot of companies kick off their adoption of platform engineering with a journey of discovery. They literally ask themselves: what technology do we have and who owns what?
In late 2020, Capillary began partnering with Facets to co-build a solution to help answer this question. Capillary chose Facets in part because it automated the cataloging of applications, databases, caches, queues and storage across the infrastructure, as well as the interdependencies among them. This cataloging helped to create a deployment blueprint of how architecture should look in an environment.
“Once you have a single blueprint, then whatever it is you do downstream in terms of launching your infrastructure, in terms of running your applications, in terms of monitoring and managing, everything becomes a downstream activity from there,” Kumar said.
“This essentially is the piece which brings in good visibility and a standardized structure of how your blueprint would look like for your entire environment and applications.”
Another reason Capillary went with Facets is because it was running 10 environments globally — three for testing and the rest in production. This meant the whole migration to Facets process took four to five months to complete, ensuring that all existing data had migrated.
The teams specifically spent about three months moving the testing environments to ensure that everything worked perfectly. The production environments, Kumar said, were much faster to move.
By mid-2021, Kumar’s team had witnessed some clear results:
Operations Tickets Down by 95%.
“What we’ve been able to do with Facets is that we have created a self-service environment where, as a developer, if you have to create a new application, you go ahead and add it into that catalog,” Kumar said. “Somebody in your team, like your lead or architect, will go ahead and approve that. And then it gets launched on its own. There is no involvement required from the DevOps team.”
The DevOps teams were no longer involved in the day-to-day software launching. Now they were able to run about 15 environments across two product stacks with a six-member DevOps team.
In fact, Capillary renamed its DevOps team “SRE and developer experience,” pivoting to site reliability engineering and creating solutions to enable its developers.
Overall Uptime Increased from 99.8% to 99.99%.
“Our environment stability has basically taken a massive movement forward,” Kumar said. “Our environments are monitored continuously. Anything that you are seeing as a blip will basically get alerted. Your backups, your fallbacks, they are all pretty standardized.”
A 20% Increase in Developer Productivity.
“The biggest thing that has happened is that the queue time or the wait time on the DevOps team is gone,” Kumar said.
There’s also now uniformity across engineering operations, including logs and monitoring, which further increases developer productivity.
“And because our releases are completely automated, the monitoring of releases is completely automated,” Kumar said.
This has meant that over the last two years, the Capillary team has gone from releasing every two weeks to now releasing daily. Plus they’ve moved into an automated, unattended release mode with verifications. Now, said Kumar, “In case something is broken, you will get an immediate alert on that to go ahead and attend.”
The Capillary engineering team continues to grow with new products, the CTO said, as well as become more efficient. In 2016, it took 64 developer weeks to launch an environment. Now, it takes just eight developer weeks, including all verifications and stabilization.
Using the blueprint the company created with Facets, he said, the users have to define how a new environment “will handle this kind of workload and hence, this is the kind of capacity that is required. And so once you set that up, the environment launch is all automated. So you save a lot of time on that.”
Earlier this year, Capillary acquired another tech company, which required the launch of a new developer environment. The engineering team was able to define the blueprint within Facets and launch a new environment in two and a half weeks.
Greater Visibility of Infrastructure Costs.
Finally, three to four years ago, Kumar could only monitor infrastructure costs through post-mortem analysis, which caused a delayed response and leaked costs. Now, he said, Facets has helped with auditing and given it more visibility on how it’s using its infrastructure and where it’s over-provisioning.