“By three methods we may learn wisdom: First, by reflection, which is noblest; second, by imitation, which is easiest; and third by experience, which is the bitterest.” — Confucius (551-479 B.C.)
We have been running Kubernetes in production for some time now, and with large companies — global traffic, Black Friday and Christmas sales, stringent security guidelines and multi-country teams. We manage those clusters 24/7 armed with pagers at the ready for every single one of them. We are the SRE team for our enterprise customers.
That puts us in a unique spot to share a handful of our trials and tribulations, some of them bitter, some by standing on the shoulders of giants and others through reflection from the get-go. We’ve been-there-done-that and got the t-shirt. This will be fun.
Security: RBAC and PSP and Network Policies
Security is a large focus for us and we have co-written the Kubernetes CIS Benchmarks and by now have compliance (important: where it makes sense!) with many other CIS Benchmarks (AWS and Docker). We enabled RBAC by default in our clusters, in some cases already with Kubernetes 1.6 and finally with Kubernetes 1.7. As we host all infrastructure components in containers and many of them, like Calico Networking, run as pods managed by Kubernetes, the old adage of “turn off privileged containers” is not feasible for us (as it won’t be for most serious production setups out there).
To get around that problem, we combined RBAC and Pod Security Policies (PSP), to be able to mitigate the issue by restricting the usage of extended security contexts. We thought this was a logical solution until we tried it out. We found minor bugs (PSP was still in alpha) and but just about no pre-made policies were in place, which meant that pretty much nobody could possibly have been using this. That’s the price of being cutting edge. Still it worked out pretty well for us and the help of key people in the community was amazing.
We managed to get it done eventually, and from that point on, every cluster has been fully locked down. We even have customers using Calico Network Policies to limit egress traffic to NFS shares on premises on a per namespace basis for different teams.
We have also come to partly hate security scanners. All the cluster etcds are fully encrypted but running on unencrypted EBS volumes. Theoretically, that’s no problem, as all the content is encrypted. But the security scanners will notice the unencrypted EBS and from then on they will say that the setup is insecure until you fix it. This is one of the annoying little problems. And if you know etcd, you know how cumbersome a live migration of it can be.
Updates: Don’t Touch! It’s Working!
We are treating infrastructure as code and that means that we have a full CI/CD setup to manage the infrastructure components, especially the ones we build ourselves. And we actually have a system of patches and minor and major updates with different rules on when we can deploy them. But with time we learned that a big part of it is making sure we are actually allowed to deploy them.
There is even a part in our contracts by now that says that our management is no longer included if your clusters have not been updated for more than three months. This was suggested by a customer who can now tell their teams that they need to upgrade because otherwise, it will get expensive. It works. A lot better than anything else, because otherwise teams just say: But it’s working, why touch it?
General Electric CEO Jack Welch once said, “I’ve always believed that when the rate of change inside an institution becomes slower than the rate of change outside, the end is in sight. The only question is when.”
And that really is the problem. We see most of our production problems in older clusters, mainly because each production incident in any customer cluster leads to a full #postmortem until the root cause is fixed and rolled out to all customer clusters. We make sure — damn sure — that we have a single product, and customers love us for that, as they see less problems. It also means though, that we need to get people to upgrade.
We are also working closely with one of our big corporate customers on tooling to make deployments and upgrades simpler for distributed corporate teams. Watch out for this one on our blog.
Installation: AWS in a Few Hours, On-Premises in $var
“Trying to predict the future is like trying to drive down a country road at night with no lights while looking out the back window. ” — Peter F. Drucker
We can bootstrap an entire Giant Swarm installation in two hours on AWS and close to that on Azure — which will likely end up being the same eventually, as we recently launched our production-ready Kubernetes on Azure. From there, our customers can create as many clusters as they want. Our forté is that we can do the same thing on-premises, but sadly we still have to say that an installation on-premises takes emmm… time. I tend to joke that it takes anything between six hours to six months, both of them highly unlikely.
The problem with on-premises is that they are all different. How does the VPN work? What about the Jump Host? Are the servers networked together correctly? Do we have ILO access? On-premises has real value for many of our customers, but it is totally different and not as easy to use. This is especially true if you do not integrate a Load Balancer API (can you even get your own F5 networks BIG-IP space for your own rules?) and have no properly standardized and Kubernetes-supported storage solution. We have iSCSI connected, which has a few stability issues with Docker, and NFS, which is not always performing so great. And we are looking at EMC Ipsilon and Rex-Ray or Rook but none of them is perfect, yet. It’s much nicer on a cloud provider that just comes with a good dynamic provisioner upstream.
Load Tests and Killing AWS
We obviously have load tests with customers and we push them to do so. But once we started freaking out because when the load test came in, nothing worked anymore. It was total death of the cluster. Our customer’s team was still used to the old way of working, though, and just handed the notion that the load test failed over the wall and said they would try again.
We were at a loss as to how to fix it, until we dug down deeper. First of all, the load test was misconfigured and did not produce 100,000 concurrent users. Instead, it produced 100,000 concurrent users per location, resulting in 300,000 concurrent users due to three origin locations being used. Then we dug even deeper and found out that the EC2 machines used in the tiny cluster had 1Gbit network cards and after a few seconds they became saturated and died. Traffic never arrived at the cluster level.
After scaling the cluster up it worked great.
Authentication — From Certificates over OIDC to Azure AD
“Once upon a time there was ___. Every day, ___. One day ___. Because of that, ___. Because of that, ___. Until finally ___.” — Pixar Storytelling Rules #4
Our setups default to certificate authentication automated by our API and cert-operator and based on Vault in the backend. However, Active Directory is something that a lot of corporations use these days and it is good. We wanted to move a customer to be able to authenticate through that.
As we wanted to integrate through the Kubernetes-supported OIDC interface, first, we needed the customer to upgrade to a more recent version of ADFS as older versions do not offer any OIDC endpoints. However, as we started testing we ran into problems with self-signed certificates so we started testing with Dex in between. After some back and forth we got that working, but we still had problems getting extended claims/scopes from the OIDC endpoint. And as the UX side of Dex is not a solved thing, yet (if you are not running Tectonic that is). At some point, we decided to try a third path: Azure AD.
The customer had a sync between their on-premises ADFS and Azure AD. By talking to our friends at Microsoft (it’s great to have them directly in our Slack) and checking some upstream docs we found that to be a solution worth trying. With some help from Microsoft to solve some open questions from the upstream documentation, we got it working quickly and from then on, all of the customer’s clusters come pre-set with Azure AD authentication.
As these stories show, there are lots of challenges both technical and organizational that need to be overcome on the road to production with Kubernetes; some come early, some later, some at scale and some at resource pressure. If we only had simple supplier-to-customer relationships with our clients it would never look like this. We are happy to instead be able to foster real relationships and a partnership with our customers that help us to grow together.
We hope that by sharing our experiences here, we could at least partly extend this relationship to you and next time you meet us hit us up for more war-time stories from the trenches of Kubernetes production.
Oliver Thylmann will be co-presenting with Adidas on “The Enterprise’s New Shoes — The Journey of Adidas to a Global Kubernetes Rollout ” at KubeCon + CloudNativeCon EU, May 2-4, 2018 in Copenhagen, Denmark.
This article was contributed by Giant Swarm on behalf of KubeCon + CloudNativeCon Europe, a sponsor of The New Stack.
Feature image via Pixabay.