Atlassian is known for keeping software developers and operations people organized. Traditionally, the company has focused on the collaboration and information management tools that track bugs, customer complaints, documentation, and source code. But like any good software tools provider, Atlassian is using all of the hottest new technology behind the scenes.
That includes Kubernetes, which Atlassian has been using extensively inside of its internal build systems. That has lead to a bit of a sticking point with the popular container orchestration platform, however: new nodes aren’t coming online fast enough for Atlassian’s tastes.
As would be expected from a company heavily promoting DevOps tools, Atlassian’s internal build process is intense and designed to be fast. By throwing large numbers of nodes at a build, Atlassian can quickly provide feedback loops to developers, allowing them to keep their heads down and continue to work while a build is running.
Unfortunately, some of those builds are taking hundreds of servers, and spinning them up for only a short period of time. The team was seeing three and four-minute windows of time between when a set of nodes were fired up, and when the auto-scaling algorithms inside Kubernetes were able to catch up to the need, observed Atlassian’s senior team leader for the Kubernetes platform team, Corey Johnston.
The problem is inside the Kubernetes auto-scaling routines. These systems are designed to be highly complex and to accommodate a host of scaling needs. Thus, each time more nodes are needed, the auto-scaler inside Kubernetes goes through an elaborate decision-making process, calculating exactly what could be needed, and what past usage would indicate.
The result of this mechanical thinking was a slow node provisioning process when the number of nodes hit multiple hundreds. Atlassian took a long time to consider this problem, and to make its own decisions: should it add its needed optimizations to the Kubernetes mainline, or simply build its own project?
The latter was the decision chosen in the end. The team at Atlassian found that their real problem was that the Kubernetes auto-scaler was too robust for their needs: the Atlassian team had very discreet, large-scale needs that negated the usefulness of the internal Kubernetes decision making trees. Thus, Escalator was born.
Escalator is an Apache-licensed, auto-scaler for Kubernetes. “Our workloads here are a lot of CI/CD workloads which are going through our clusters. The first workload we moved to Escalator was our build engineering for building all our products here. When we onboarded those guys, that’s when we saw our cluster node count going from single-nodes to hundreds of nodes per cluster at that scale where we experienced these problems,” said Johnston.
The move to Escalator brought a big time and money savings for the Atlassian team, said Johnston. “You’ve got the situation when your peak is over, and you’re moving into a troubled period: a bunch of extra capacity you’re paying for which is essentially unused. The old auto-scaler was taking up to 14 hours in some cases to fully offline virtual machines. At our scale, that can represent serious financial costs. Escalator replaces that. We make decisions about what is required and support the user workloads and adjusts correspondingly,” said Johnston.
Feature image via Pixabay.