My Istiod Pod Can’t Communicate with the Kubernetes API Server!
A few days ago, I published a blog post on if a network cache-based identity could be mistaken, where I introduced an error scenario that has caused a Kubernetes’ pods’ identity to be mistaken, thus granting unauthorized access. In this blog, I would like to demonstrate, using the exact same scenario, how to leverage defense in depth with Cilium and the Istio service mesh to prevent just such unauthorized service access.
In this experiment, you’ll set up a Kubernetes kind cluster, deploy v1 and v2 of the client applications (“sleep”) and v1 and v2 of the server applications (“helloworld”), along with the v1 network policy that allows ONLY the v1 client to call the v1 server, and the v2 network policy that allows ONLY the v2 client to call the v2 server.
You’ll also set up Istio Authorization policies to allow ONLY the v1 client to call the v1 server and ONLY the v2 client to call the v2 server. You’ll first observe the network policies enforced as expected. Then you would trigger an error scenario, along with scale up/down client pods and observe the v1 client able to bypass the L4 network policy but failed at the Istio RBAC check. Let us get started!
Setting up the Environment
Refer to the setup instructions in this blog to set up your kind cluster and Cilium CNI. Download the latest stable istioctl, install the minimal profile and scale up the Istiod deployment to three replicas:
istioctl install --set profile=minimal -y
kubectl scale deploy istiod -n istio-system --replicas=3
Deploy the applications and network policies
Label the default namespace for sidecar injection:
kubectl label namespace default istio-injection=enabled
Clone the repo, then deploy the sleep and helloworld deployments, along with the v1 and v2 CiliumNetworkPolicy resources.
kubectl apply -f ./yamls
Apply the simple Istio PeerAuthentication resource below to only allow strict mTLS traffic in the default namespace:
Apply the simple Istio Authorization resource below to allow nothing in the default namespace, based on zero trust best practice — always starting with trusting nothing then explicitly allow access as needed.
Apply the Istio Authorization policy below to allow sleep-v1 to call helloworld-v1 on GET methods:
Apply the Istio Authorization policy below to allow sleep-v2 to call helloworld-v2 on GET methods:
Assume all of your sleep and helloworld pods are up running, you can call helloworld-v1 from the sleep-v1 pod and helloworld-v2 from the sleep-v2 pod:
You’ll get outputs as below where only sleep-v1 can call helloworld-v1 and only sleep-v2 can call helloworld-v2, and nothing else. When sleep-v2 calls helloworld-v1, the “connection failed” error is displayed because the v1 network policy is properly enforced:
With the above applications and network and authz policies deployed, in most cases, network policies will be effective so sleep-v2 will not be able to call helloworld-v1 successfully. Let us trigger a similar error scenario where the node (where the helloworld-v1 pod runs) can’t communicate with the Kubernetes API server. In my environment, on that kind-worker node, I have one Cilium pod and one Istiod pod running:
Trigger the error similar as before:
Run the Test!
If you are not familiar with the test, refer to the Review the test script section. Simply issue run-test.sh to run the test. You may observe that a few sleep-v2 pods take 30 seconds or so to reach the running status, this is because when the istio-proxy container tries to start, its pilot-agent sends the Certificate Signing Request (CSR) to Istiod (which serves as CA in my test) using the service account token that is provisioned by k8s and mounted to the pod.
If the CSR request happens to be sent to the Istiod pod which can’t communicate with the Kubernetes API server, it won’t be able to validate the service account token thus won’t process the request. This is where retries come to the rescue: the pilot-agent on the istio-proxy container is intelligent enough to try to send the CSR request to a different Istiod.
Shortly, you’ll see all pods have reached running.
You’ll also observe the sleep-v2 pod that has the mistaken Cilium identity is found but it still can NOT call the helloworld-v1 successfully:
Note the “RBAC: access denied” error came from Istio with the Authorization policies enforced by the helloworld-v1’s istio-proxy container. If you recall, the error was different earlier when Cilium network policy was properly enforced to NOT allow sleep-v2 to call helloworld-v1:
Take a look at the short video to watch me run the above steps in my test environment with Cilium and Istio:
When one of the Istiod pods could not communicate with Kubernetes API server, your application identity (based on cryptographic primitives) can continue to be properly generated from its Kubernetes service account token via CSR requests and Istio Authorization policies continue to be enforced. This reinforced my recommendation earlier of using the defense in depth approach along with a zero trust model in your security architecture, so that you can be well prepared for various error scenarios.