Upgrading Istio without Downtime
As of writing this blog, Istio is unveiling the 1.11 version, so it’s a good time for us to talk about a method for upgrading Istio without downtime. If you have been working with Istio for some time and have it deployed to your environments, you might be running a significantly older version. As of Aug 24, Istio 1.9 is no longer supported. This means that your older versions are no longer receiving critical patches and updates that help keep you secure. Solo.io does support older versions of Istio via Gloo Mesh and its long-term support (LTS) for current release and the previous four releases.
Nevertheless, upgrading is a good idea, so let’s pay down some of your technical debt and upgrade your Istio deployment so your applications can take advantage of the latest features and stay secure. This blog will explain an architecture and process that will help you get your Istio deployment back into compliance and set you up for easier upgrades in the future.
Upgrading One Version at a Time
Istio recommends that you upgrade one minor version at a time, up to 1.8 in which you can skip 1.9 to 1.10. This means that if you are still on Istio 1.6, they recommend that you upgrade three times to get to 1.10 (1.6→1.7→1.8→1.10). With the proposed architecture laid out below, you may be able to skip these versions by being able to test them side by side. This might save you a number of hours and allow you to catch up safely and more efficiently.
The Two Failure Modes for Istio
First, let’s talk about the two main ways Istio can affect your workloads due to an outage.
The first failure mode to discuss is the loss of configuration propagation of the Istio sidecars. If your istio-agent sidecars lose the ability to communicate with Istiod or are incompatible with the configuration being sent, your workloads will not be able to join or communicate with the mesh. This can even affect existing workloads as endpoint discovery will not be up to date, and you may try to reach workloads that no longer exist. New workloads, however, will not be able to join and will remain down until the issue is resolved. Due to this type of outage, it is recommended for istio-agents to match and retain the same version as the control plane (Istiod). It also makes sense that during an upgrade, the existing control plane deployment remain in place rather than upgrading it directly. It is desirable to do a blue/green deployment as a step toward upgrading Istio without downtime.
The second failure mode, and often the more critical, is loss of traffic flow through the ingress gateway. Unlike the loss of control plane, an outage in the ingress gateway will have an immediate impact on your end users. Since this is a critical path for the flow of traffic, extra care should be taken for upgrading Istio without downtime. That includes being able to fall back to the existing gateway if the upgrade fails. This is why it’s also recommended to do blue/green ingress gateway deployments. Shown below is an example upgrade using an external Kubernetes LoadBalanced service that can select the “blue” or “green” ingress gateway for traffic flow.
An Architecture for Upgrading Istio without Downtime
Extending on the mitigations for the two failure domains, we can show how some of the newer Istio features can help us in deployment and upgrading Istio without downtime. This solution relies heavily on the Istio Canary Deployment feature. Introduced in 1.6, It allows us to deploy multiple versions of the Istio control plane side by side and migrate workloads. We can also use the same mechanism to canary deploy ingress gateways, using our own managed LoadBalanced service.
The current best approach is to use the Istio operator to deploy Istio components with the IstioOperator configuration. Due to problems with operator compatibility between versions, we have to deploy a new operator for every version upgrade. Due to this constraint, it is probably just as easy to deploy Istio via Helm, and we may recommend that in the future. (It’s not being recommended today due to the convenience that the IstioOperator CRD offers over the traditional Helm values file.) Once we have the operators deployed, we can deploy our multiple IstioOperator configurations (one for istiod, one for each gateway). Example configurations of each are shown below.
Below, here’s an example Istiod deployment with revision label. It would be deployed by the 1-9-7 Istio operator.
# Traffic management feature
# Istio Gateway feature
# Disable gateways deployments because they will be in separate IstioOperator configs
- name: istio-ingressgateway
- name: istio-eastwestgateway
- name: istio-egressgateway
Here’s an example gateway deployment with custom LoadBalanced service. We can use the service selector to blue/green future versions of gateway deployments. Note: We must modify the ingress gateway service to be a
ClusterIP service so it will not create its own load balancer.
# select the 1-9-7 revision
- name: status-port
- name: http2
- name: https
- name: tcp
- name: tls
- name: istio-ingressgateway-1-9-7
# Since we created our own LoadBalanced service, tell istio to create a ClusterIP service for this gateway
Since we created our own LoadBalanced service, tell Istio to create a ClusterIP service for this gateway
How to Migrate Your Sidecars
Once your new Istio control plane is deployed, you can migrate your application workloads to the new Istiod deployment. If you are already using revisions and the revision label, it should be as simple as updating the namespace label
istio.io/rev=<new_revision>. Then, you will need to recreate your pods to get the updated proxy sidecars.
Here’s an example rolling restart command to update the sidecar version.
kubectl rollout restart deployment/nginx -n nginx
Migrating to Revisions
If you are not currently using revisions or canary deployments, it still is easy and recommended to migrate to them. The pattern for migration is strikingly similar to upgrading between versions. We would recommend deploying the same Istio version next to your existing deployment, but with the added revision label. Then you can migrate your application sidecars at your leisure by removing the
istio-injection=enabled label and adding the new revision label
Migrating the gateways may prove to be more difficult, however. If your current Istio deployment owns the LoadBalanced service, you will have to take extra care when removing the existing infrastructure. It may be easier in some cases to migrate to a new LoadBalanced service.
Try Out Our ‘Upgrading Istio without Downtime’ Demo!
If you are interested in trying this for yourself, we have created a lab to test it. We deploy two versions of Istio and migrate the Bookinfo applications while requesting traffic from them. Once complete, we take a look at that traffic to make sure it worked without issue.
Of course, you don’t have to go it alone! Solo offers enterprise production long-term support (LTS) for Istio and fixes CVEs quickly with patches and backports.