More Problems with GitOps — and How to Fix Them
Codefresh sponsored this post.
GitOps is an emerging way to manage the actual state of systems, through definitions of the desired state stored in git, and executed by Kubernetes. But while GitOps as an idea is great, we are not even close to having that idea be useful in a practical sense. There is still a lot of work to be done.
In a previous post, I explored a number of initial issues around the emerging practice of GitOps — namely that it is misunderstood, that it is too often thought of as only a way to manage Kubernetes deployments, and that GitOps tools are not promoting GitOps practices.
Now, we’ll take a look at a number of additional issues: That GitOps principles often can not even be applied to GitOps tools them, that we do not have the tools that reflect changes happening inside clusters in Git, and that observability remains immature.
Let’s jump down into the rabbit hole…
We’re Often Not Even Able to Apply GitOps Principles on GitOps Tools
It’s a chicken and egg problem. We need a chicken to make eggs, but we cannot have a chicken without an egg. The same is true for GitOps. We need tools that will help us apply GitOps, but how do we apply GitOps principles on GitOps tools? If, for example, we pick Argo CD to manage our applications based on GitOps principles, we have to ask how we will manage Argo CD itself? We are told that we shouldn’t execute commands like
kubectl apply manually, yet we have to deploy Argo CD itself. Even if we ignore that part and say that the initial installation is an exception, how are we supposed to manage upgrades and maintenance of Argo CD? Now, if you dig through the documentation, you will find vague instructions to install it manually, export the resources running inside the cluster into YAML files, store them in Git, and tell Argo CD to use them as yet another app. That might allow Argo CD to manage itself, but… Come on! I do not need to tell you how silly it is to deploy something inside a cluster and start exploring that something into YAML files.
Yet, the situation with Argo CD is one of the better ones. If we check the instructions for most of the other tools, the problem only gets worse.
Now we are getting to the part that potentially breaks GitOps and makes it even dangerous to use.
We Do NOT Have The Tools That Reflect Changes Happening Inside Clusters in Git
Our systems are dynamic. They are changing the desired state all the time, and we do not yet have tools that reflect changes happening inside clusters in Git. Confused? Let me give you an example or two.
Flagger allows us to define (almost) everything we need in a few lines of YAML, that can be stored in a Git repo and deployed and managed by Flux or Argo CD. Whenever we push a change to Git, those tools will make sure that the actual state changes. So far, so good.
One of the best things about Flagger is that it will create a lot of resources for us. It will create Deployments, Services, and other “core” Kubernetes resources. If, for example, we are using Istio, it will also create VirtualServices and other components required for our app to work correctly. That’s great, because it simplifies a lot of our work. Instead of writing hundreds of lines of YAML, we can get away with a minimal definition — usually measured in tens of lines. However, the actual state is not converged into the desired one. Git is not the single source of truth, because what is running in a cluster is very different from what was defined as a Flagger resource. Nevertheless, we can skip over that and say that we are indeed defining the desired state, but only in a different and more compact format. The real issue is different.
Flagger will roll out our application to a fraction of users, start monitoring metrics, and decide whether to roll forward or backward. If everything goes as planned, it will eventually roll out a new release to all the users. If something is off, it will rollback. That’s great. It gives us safety. That’s why we love canary deployments. However, that produces a drift that is not reconcilable.
When automated rollback happens, the desired state in Git is still stating that a new release should be running in the cluster, while the actual state is the previous release. If Flagger were applying GitOps principles, it would NOT roll back automatically. It would push a change to the Git repository. That change would change the tag of the app definition to be whatever was there before the attempt to roll out a new release. That would be picked by Flux, Argo CD, or another similar tool that would initiate the process of rolling back by effectively rolling forward, but to the previous release. Maybe it should revert the commit that defined the new state that has to be rolled back. Or, perhaps, it should not do any of those things, but instead, notify some common interface so that other tools could do those things. The design is debatable, but the process is not — at least when GitOps is concerned.
GitOps forces us to define the desired state before some automated processes converge the actual state into whatever the new desire is. Changing the actual state without defining it as the desired state first and storing the changes in Git is a big no-no. Yet, Flagger does just that. No matter how great it is in what it does, it is by no means applying GitOps.
Now, that does not mean in any form or way that Flagger is not a great tool. It is amazing. Nevertheless, it is marketing itself as a GitOps tool without really applying the principles it promotes.
Argo Rollouts Suffers From Similar Issues as Flagger
If we are using Istio, Argo Rollouts requires us to define all the resources. It does not create them for us. We still need to define Istio VirtualService and others on top of typical Kubernetes resources. From the perspective of the person who writes and manages those definitions, it is more complicated than Flagger. On the other hand, it is more GitOps-friendly. There is less “magic” involved, resulting in us being in more control over our desires. Argo CD has fewer issues converging the actual into the desired state.
Nevertheless, Argo Rollouts does modify weights at runtime, so there is an inevitable drift that cannot be reconciled. However, that drift is temporary. It is a temporary difference between the two states. While it is almost certain that some changes to the actual state (e.g. horizontal scaling) might never be reflected in the desired state, it is not inconceivable to imagine the tools doing progressive delivery — feeding the changes to weights back to Git and letting the tools in charge of deployments apply them. Such possible actions raise some questions, especially around performance. Nevertheless, there is undoubtedly a middle road we could take, if not transforming them fully to GitOps.
If we move to the more significant problem of rollbacks, the issue becomes as complicated with Argo Rollouts as with Flagger. When a rollback happens, it is automated and the desired state stored in Git will not change. From that moment on, according to Git, we are running a new release while there is the old release in the cluster. If we update any aspect of the definition of the application besides the release tag, the system will try to rollout the same release that was rolled back. We’ll get into a mess with unpredictable outcomes.
So, both tools are failing to apply GitOps principles, except that Argo Rollouts is aware of it (intentionally or unintentionally) and is, at least, attempting to improve. Also, due to it having less “magic,” it is closer to being GitOps-friendly — since it forces us to be more explicit. Still, those are shades of gray rather than real differences.
The major differentiator is that you will not find in Argo Rollouts documentation that it is a GitOps tool. Argo CD has GitOps all over the place, but Argo Rollouts doesn’t. Flagger, on the other hand, has the following sentence on the home screen of its documentation: “You can build fully automated GitOps pipelines for canary deployments with Flagger and FluxCD.”
So, if both are failing to adhere to GitOps principles, one of them is at least not claiming that it does.
Let’s move into observability.
Observability Is Immature
This might be one of the main pain points of GitOps: observability is immature.
Tools like Argo CD do show us what the current state is and what the difference is compared to the previous one. They might add a link to the commit that initiated the change of the actual state, and that’s more or less it. Where are the pull requests that were used to create the actual state? Where are the issues (JIRA, GitHub, etc.) that made us change the state in the first place? Where is all the other information we might need?
Now, you might say that we do not need all those things in one place. We can go from one tool to another and find all the data we need. That’s true, but I am not an archeologist (I was, but that’s a different story). I do not want to dig for hours to determine what caused the changes to the actual state, and who did what and why.
To make things more complicated, observability of the actual state is not even the main issue. The desired state is where everything falls apart.
To begin with, Git is not designed to provide that type of observability. If I want to see the previous desired state, I might need to go through many pull requests and commits. Sure, when looking at a single pull request in which only the tag of the image used in a deployment of the new release has changed, things look easy and straightforward. But that is not the real world. Big systems are complex. The desired state is changing all the time. One minute one team might express the desire to add an app to the preview environment, the other someone might want a new release in staging, a few minutes later others might want yet another preview application, while (in parallel) the desired state of production might be changing.
All of that is great when everything works like a Swiss clock. But when something fails — and I assure you that it will — finding out who wanted what by looking at the pull requests and the commits is anything but easy. Try jumping from one repo to another, switching branches, digging through pull requests and commits, and do all that in a bigger organization with hundreds or even thousands of engineers constantly changing the desired and, indirectly, the actual state. All I can say is that it is neither pretty nor efficient.
So, we need a way to visualize the actual and desired state, backed with the ability to travel through time and see what is and what was.
Yes, we need a good way to visualize both the actual and the desired state. But there’s more. We need to combine them. We need to be able to see what should be (the desired state), what is (the actual state), both now and in the past. We need all that, combined with all of the relevant information — like pull requests, issues, etc. However, even all of that is not enough.
GitOps is a set of principles — like everything defined as code, code stored in Git, Git holds the desired state, machines converge the actual into the desired state, etc. But, it does not stand a chance alone. It is part of a “bigger machine,” which we currently call continuous delivery (CD).
I’ll get to the GitOps issues related to CD in the next post. What matters is that the information from CD pipelines must also be included in GitOps observability. We need to know which pipeline builds contributed to the current or the past states.
In the next and final post, I’ll describe a number of additional issues around GitOps, including:
- There are no well-established patterns.
- The connection between Continuous Delivery and GitOps is not yet well established.
- Running GitOps at scale is challenging.
- Managing secrets is a big issue.
Feature image via Pixabay.