A Look at Slack’s New GitOps-Based Build Platform
It was a necessary move: Slack is growing at a rapid pace, with its revenue basically doubling every year since 2014. With a company growing at that pace, what once worked may not be the best solution now. In this case, it was Slack’s build platform.
A recent blog post written by a team of Slack engineers went into great detail on this topic.
Slack has used Jenkins as its build platform since the early days. The idea of allowing each team to create their own customized Jenkins cluster, known as “Snowflake Clusters,” was a solid idea in 2014. But with the hyper-growth came an increase in the product service dependency on Jenkins and different teams started using Jenkins for their own unique needs, such as plugins, credentials, security practices, backup strategies, managing jobs, upgrading packages, etc.
In plain terms, there were enough “Snowflake Clusters” to cause an avalanche of complications considering that each unique cluster has its own ecosystem rich with plugins to upgrade, vulnerabilities to deal with, and processes around managing them.
There. were. challenges. A long list of challenges. And while every company has a long list of technical challenges unique to them, overall the list of Slack’s challenges read similarly to the universal reasons why companies decide to modernize their tech: the code as it stood currently was effective but not optimal for the future and led to a loss in productivity. And there was technical debt. No one ever wants technical debt.
Though the system wasn’t optimal, a complete rewrite wasn’t needed. The goals of the modernization were to fix key issues, modernize deployed clusters, and standardize the Jenkins inventory.
At a high level, the Build team would provide a platform for “build as a service” with enough knobs for customization of Jenkins clusters.
Where to Start?
Slack did what we all do… they conducted research on what large-scale companies were using for their build systems. Slack engineers did have the opportunity to meet with multiple companies to discuss their build systems. These meetings helped them learn and replicate (when possible) other build systems.
From someone who reads many engineering blog posts on a weekly basis, I see the same build system requirements that keep coming up.
The following is an incomplete list of features and concepts that Slack implemented (the team’s post is much more comprehensive):
Stateless and Immutable CI Service: Separating the business logic from the underlying build infrastructure made the CI service stateless. This led to quicker and safer building and deploying of build infrastructure, the option to involve shift left strategies, and an improvement in maintainability. All build-related scripts were moved to a repo independent from where the business logic resided. The team used Kubernetes to help build Jenkins services which helped solve issues of immutable infrastructure, efficient utilization of resources, and high availability. Every service was built from scratch thus eliminating the residual state.
Security Operations as part of the Service Deployment Pipeline: Obvious for many reasons in today’s never-ending blast of cyberattacks. Slack instituted identity and access management (IAM) and role-based access control (RBAC) policies per cluster. Vulnerability scanning takes place each time the Jenkins service is built.
More shift-left to avoid finding issues later: Testing is definitely the move here. This one specifically is coming up more and more. It is always better to find bugs in development than it is to find bugs in production.
Slack used a blanket test cluster and pre-staging area for testing out small/large impact changes to the CI system even before they hit the rest of the staging environments. This also allowed high-risk changes to be baked in for an extended time period before pushing changes to production. The additional testing led to better developer productivity and an improved user experience which is similar to results found in other shift-left articles.
In Addition to New Features, the Clusters Had to Be Standardized
Standardization in this case meant that a single fix could be applied uniformly to the Jenkins inventory. For this, Slack used Casc, a configuration management plugin for Jenkins.
Central storage ensured all Jenkins instances used the same plugins to avoid snowflaking. This allowed for automatic upgrades and alleviated any need for manual intervention or version incompatibility.
GitOps Style Management
Git became the single source of truth. Nothing was built or run on Jenkins controllers. This was enforced with GitOps. Configurations were managed through the use of templates to make it easy for users to create clusters, re-using existing configurations to easily change configurations.
The entire build infrastructure could be recreated from scratch with the exact same result every time as all infrastructure operations came from Git using the GitOps model.
Debugging was aided by the enabling of metrics, logging, and tracing on each cluster. The ability to re-use credentials was now available on applicable clusters. Upgrading the Jenkins operating system, packages, and plugins was quick as everything was contained in a Dockerfile.
The features listed here as well as others included in the original article made up this diagram which represents the flow of the new build system.
The Build team managed systems in the build platform infrastructure and the remaining systems would be managed by service owner teams using the build platform.
Challenges and Conclusion
As we mentioned, there was a long list of challenges. But overall the modernization effort led to a lot of learning, teaching, debugging, and top-notch documentation writing. The graph below details a process is well worth the struggle though.
Individual services were built and deployed quickly and in a safe and secure manner. Time to address security vulnerabilities went down and standardization of the Jenkins inventory reduced multiple code paths required to maintain the fleet. Infrastructure changes could be rolled out quickly and rolled back quickly if required.
The migration started slowly with a few of their existing production build clusters. Currently, the new clusters are being built with the new system and this is what’s helping improve the delivery timelines shown above. Currently, the migration of all services into the new build system is underway and new features are being added.
Tech changes on a dime and there is a difference between operational and optimal and sometimes it’s just time to dive in and make the change.