The Hidden Pain of DIY On-Premises K8s-Based Software Distribution
This is part of a series of contributed articles leading up to KubeCon + CloudNativeCon in October.
Let’s explore the experience of companies trying to build their own software distribution tooling. This hypothetical scenario is based on a Software-as-a-Service (SaaS) company and/or a traditional on-premises software company that is delivering their app to customer Kubernetes (K8s) environments in the cloud for the first time. Think of it as a composite of many people’s experiences. We hope you don’t make the same mistakes!
A Timeline of Hope and Pain
Day 0 — The sales or product team asks engineering simple-sounding questions: “Can we deliver our SaaS application into our customer’s self-hosted Kubernetes environments?” or “Now that we’ve modernized and containerized our application, can we distribute it to customer-managed clusters in the cloud?” Either way, what they are really saying is, “Our prospects keep asking us to do this, and we’re leaving money on the table every time we say ‘no.’”
Day 1 — How hard can it be? The lead engineer spends a couple weekends hacking out a rough solution, very excited to build something new. It seems to be fairly straightforward to refactor the app to work in any AWS or customer-hosted environment, right? We could use Terraform, maybe.
Day 30 — The field engineers deliver the app to their first customer-hosted K8s cluster running in an AWS virtual private cloud (VPC.) The proof-of-concept (POC) installation doesn’t go as smoothly as hoped, but after a couple of escalations to engineering and some patience from the customer, they finally get the app deployed. High fives!
Day 45 — The lead engineer has shipped several updates and changes to the new “on-premises” K8s installer to make it work. A production install is started in a different environment, but it’s not working the same way, and no one is quite sure why. More and more engineering time is being spent on Zoom with the customer, whose frustration is steadily growing. Other modernization, innovation and/or backlog work is starting to take priority, and this project is starting to look a lot more complicated than expected. The sales team is getting a bit nervous about their account and escalating to management.
Day 60 — The project is no longer fun and continues to suck time and people. The Terraform scripts are failing security reviews at some companies. The lead engineer asks the manager to get them off this ASAP because they are burning out. The company doesn’t want to halt the project because product and sales are close to closing this customer. There are a surprising number of on-premises and K8s cluster-based opportunities in the pipeline, and in this economy, the vice president of sales doesn’t want to turn away any revenue. The head of engineering begrudgingly assigns more engineers to work on the on-premises installer project, delaying the schedule for other planned app features and innovations.
Day 180 — A lot has gone on in the past four months. New customers are running the installer, but each one has a slightly different environment and installation requirements. A few examples:
- While the first customer accepted the Ubuntu-based installer, the next customer wanted a RHEL installer. So the team spent two weeks building a second package and designing CI/CD pipeline to build and test it in parallel with the Ubuntu-based package.
- Two government and financial services customers needed air gap installers. Engineers decided this is too much effort with everything else going on. This represents a substantial hit to the revenue stream that drove the idea in the first place.
Day 270 — With mixed failures and successes, the on-premises K8s install initiative carries on in fits and starts. More issues keep popping up. The install success rate is hovering around 50%, where half the attempted installs end with the customer getting fed up and losing trust. Other customers and prospects keep asking for it, and a number of big accounts are now deployed with it, so it seems impossible to turn back, but the quagmire is getting deeper:
- One customer runs into some common vulnerabilities and exposures (CVEs), which block an install, and it’s an all-hands-on-deck late-night scramble to patch the vulnerabilities and get everything stable again.
- Several customers have now (auto-)upgraded their Linux operating systems, which unfortunately broke the app packages, requiring rework and updates to the installer. It looks like this will happen at least once a quarter.
- Mysterious storage and networking failures have required more than 10 hours of hands-on troubleshooting across several weeks.
- The first customer to install has yet to upgrade their installer and is at risk due to unpatched bugs, which were fixed long ago in newer versions. Because the first version was not built with a self-serve upgrade path in mind, engineers spend another 10+ hours helping the customer perform a very manual migration to the latest version of the tool.
- Despite management efforts to bring in other team members to the project, the lead engineer who built v1 is still constantly pulled into on-premises install support escalations.
- One end customer had modified the base image for Ubuntu to change the names of all the default network interfaces. More mysterious network issues cause problems until this change is discovered.
- In environments where the customer brings their own Kubernetes cluster, the team encounters 10 different flavors of Kubernetes ingress that need to be supported by the application configuration. Every single one takes hours to fix and takes time away from other engineering work.
- Several end customers need enterprise long-term support (LTS) versions, which creates internal chaos and more firefighting. There’s a need to hire and train a lot of support engineers on Kubernetes or just keep escalating to engineering.
Day 360 — One year in, the engineering team, exasperated and burnt out, holds another all-hands-on-deck meeting to reset and decide what to do. Everyone dreads doing a rotation on the on-premises installer team; some people actively seek to get off the team. A few veteran engineers sit permanently on the team because they understand that without them, a big source of revenue would be in jeopardy. Engineering and product leadership agree to deemphasize new feature work to give the team up to 50% of their time for three months to invest in the install tooling. While they’re at it, engineering agrees to spend significant time developing the air gap installer that more and more customers are requesting. The team develops a wishlist for everything they’d want:
- Set up CI/CD and automated testing for all releases of the application in all supported environments.
- Convert the ragtag of hard-to-maintain bash scripts used to collect diagnostic info into a CLI tool that can be delivered with the installer. Consolidate into a framework that allows field engineers to contribute to the list of information that gets collected. Stretch goal: Package the internal scripts used to analyze these log bundles for common errors into a tool that end customers can run in their own environment.
- Design so that the team can centralize on one architecture and install method, and solutions architects working with customers don’t need to hack a bunch of strange custom configurations for specific customer environments.
- Give customers the option to bring an external database instead of using a datastore embedded in the application. This should help address some of the catastrophic failures in storage and networking.
- Offer snapshot and restore functionality that will work in the majority of customer environments, relying on a hunch that this will include SSH File Transfer Protocol (SFTP), Network File System (NFS), storage area network (SAN) and maybe others. Do some discovery with the product team and several key customers to scope it out.
- Automate scanning for CVEs in all code and enforce a policy of not shipping a release without patching all CVEs for which a patch is available.
- Invest time in ensuring that the build/test process for developers in local environments can be shortened from 10+ minutes to under 30 seconds.
- Automate testing for all installer versions on a quickly growing multidimensional support matrix of OS versions, Kubernetes versions, add-ons, cloud providers and other dimensions.
- Build a specific “area of responsibility” for a product team to ensure that they can support new versions of operating systems within 30 days of release
- Adopt an aggressive policy of deprecating old versions to reduce the total number of things that need to be maintained and patched.
Day 390 — The team is making progress, and even the lead engineers who built v1 are engaged again. A few improvements are made and momentum is building, but there’s still so much to do. The most knowledgeable people are still getting pulled into many support escalations with existing and new customers.
Day 480 — The three-month sprint has now sprawled out to six months. With half the team still improving the build/test/distribute/support platform for on-premises installs, app feature development is still behind pace. Work on an air gap installer has not even reached the prototype phase. With half the backend team focused on infrastructure-flavored tasks, frontend engineers staffed to work on a SaaS application or other modernization efforts are consistently running out of things to do. Disillusioned and completely burned out, the two engineers who built v1 of the installer and have the deepest knowledge of the project leave to join small startups founded by former colleagues. This sets back the team even further.
Some might read this and conclude that distributing software to customer-managed on-premises K8s and private cloud environments simply isn’t worth the pain. But 80% of all software spending still goes to applications that aren’t pure SaaS, and most organizations now expect applications to be K8s-friendly. We’re seeing a looming trend of application boomerangs from the cloud for reasons of security, compliance, performance and cost. There’s got to be a better way to solve the hard problems outlined above and still increase your addressable market!