Envoy and the Programmable Edge: Edge Proxies and the Developer Experience
At the inaugural EnvoyCon, which ran alongside KubeCon+CloudNativeCon in Seattle last December, several large organizations discussed how they have recently begun using Envoy as an edge proxy, such as eBay, Pinterest and Groupon. Moving away from hardware-based load balancers and other edge appliances towards the software-based “programmable edge” provided by the Envoy proxy clearly has many benefits, particularly in regard to dynamism and automation. However, one of the core challenges presented was the need to create an effective control plane that integrates well with the existing engineering workflow or developer experience. This article explore this challenge in more depth.
At the Leading “Edge” of Developer Experience
In a previous The New Stack article, “Kubernetes and PaaS: The Force of Developer Experience and Workflow,” I summarized some of the recent conversations had within the Datawire team and their community and customers, and argued for the need for engineering organizations to pay more attention to creating an effective engineering workflow — often referred to as “developer experience” — rather than simply building a platform on Kubernetes, and letting that dictate the workflow. There are currently lots of interesting tooling evolving within the space, such as Garden, Tilt, and Skaffold (just to mention a few), and I’ll hopefully focus on these in a future article. However, for the main thrust of this piece, I want to focus on how having programmatic edge impacts developer experience.
Historically, because edge appliances were hardware-based, they were typically under the responsibility of the operations team, and sometimes a specific network operation control (NOC) or edge team with relevant vendor skills and certificates e.g. technologies created by the likes of F5, Cisco and Citrix. When a development team wanted to deploy a new domain, TLS certificate, or firewall rule, this typically involved the creation of a ticket within an issue tracking system.
I remember doing just this in several consulting gigs in the pre-DevOps days where we were deploying greenfield monolithic Java applications hosted on WebLogic. At the time making requests to the operations team wasn’t a big deal for the development team, as we were only deploying one new domain (with catch-all endpoints being routed to a single WebLogic instance) and a single product. We could schedule the necessary load balancer and firewall modifications weeks in advance, and as we were doing all of our quality assurance (QA) in staging environments (with a “hardening sprint” pre-release!) we didn’t need much control in the way of releasing our application — the domain and associated config would simply be activated after-hours on our chosen release date.
Modern Workflows Make the Edge More Dynamic
Now, fast forward 15 years and not only has the deployment technology changed but (perhaps more importantly) so have business requirements and the associated software architectures. This obviously has a knock-on effect on engineering requirements with interacting with the edge.
Modern product-focused (micro)service development teams also now want access to the edge, and this access will typically be more dynamic than before. For example, developers will want to configure routing for a new API, or to test or release a new service via traffic shadowing or canary releasing. Here, control will typically be decentralized as product teams are working independently (loosely coupled) from each other, and also “high touch,” as developers want to continually tweak traffic routing based on a scheduled incremental rollout or from observable metrics (or alerts).
Providing a control plane for the operations team is still equally vital. Any communication originating from outside your trusted network can be from a bad actor, with motivations that are intentional (e.g. cybercriminals) or otherwise (broken client library within a mobile app), and therefore you must defend against this. Here operations team will specify sensible system defaults, and also adapt these in real-time based on external events. In addition to rate limiting, you probably also want the ability to configure global and API-specific load shedding, for example, if the backend services or data stores become overwhelmed, and also implement DDoS protection (which may also be time- or geographically-specified).
Exploring the Ambassador Control Plane
As mentioned above, a centralized operation or SRE team may want to specify globally sensible defaults and safeguards for all ingress traffic. However, the (multiple) decentralized product development teams now also working at the front lines and releasing functionality will want fine-grained control for their services in isolation, and potentially (if they are embracing the “freedom and responsibility” model) the ability to override global safeguards locally.
As a general rule, it is not a good practice to deploy directly to the cluster using kubectl.
A conscious choice that was made by the Ambassador community was that the primary persona targeted by the Ambassador control plane is the developer or application engineer, and therefore the focus on the control plane was on decentralized configuration. Ambassador was built to be Kubernetes-specific, and so a logical choice for specifying edge configuration was close to the Kubernetes Service specifications that were contained within YAML files and loaded into Kubernetes via kubectl.
Options for specifying Ambassador configuration included using the Kubernetes Ingress object, writing custom Kubernetes annotations or defining Custom Resource Definitions (CRDs). Ultimately annotations were chosen, as they were simple and presented a minimal learning curve for the end-user. Using Ingress may have appeared to be the most obvious first choice, but unfortunately, the specification for Ingress has been stuck in perpetual beta, and other than the “lowest common denominator” functionality for managing ingress traffic, not much else has been agreed upon.
An example of an Ambassador annotation that demonstrates simply endpoint to service routing on a Kubernetes Service can be seen here:
- protocol: TCP
The configuration within the getambassador.io/config should be relatively self-explanatory to anyone who has configured an edge proxy, reverse proxy or API gateway before. Traffic sent to the prefix endpoint will be “mapped” or routed to the my-service Kubernetes service. As this article is primarily focused on the designing and implementation of Ambassador we won’t cover all of the functionality that can be configured, such as advanced routing (including traffic shadowing), canarying (with integration with Prometheus for monitoring) and rate limiting.
Although Ambassador is focused on the developer persona, there is also extensive support for operators, and centralized configuration can be specified for authentication, TLS/SNI, tracing and service mesh integration.
Ambassador and Developer Workflow with GitOps
In regard to incorporating the creation and updating of Ambassador ingress/edge configuration into your developer workflow, I am a big fan of “GitOps,” which is the name given by the Weaveworks team for how they use developer tooling to drive operations and to implement continuous delivery. GitOps is implemented by using the Git distributed version control system (DVCS) as a single source of truth for declarative infrastructure and applications. Every developer within a team can issue pull requests against a Git repository, and when merged, a “diff and sync” tool detects a difference between the intended and actual state of the system. Tooling can then be triggered to update and synchronize the infrastructure to the intended state.
The Datawire interpretation of the guidelines for Weaveworks’ implementation of GitOps, which uses containers and Kubernetes for deployment, includes:
- Everything within the software system that can be described as code must be stored in Git. By using Git as the source of truth, it is possible to observe a cluster and compare it with the desired state. The goal is to describe and version control all aspects of a systems: code, configuration, monitoring/alerting — and in the case of Ambassador, routing, security policies, rate limiting etc
- The “kubectl” Kubernetes CLI tool should not be used directly: As a general rule, it is not a good practice to deploy directly to the cluster using kubectl (in the same regard as it is not recommended to manually deploy locally built binaries to production).
- The Weaveworks team argue that many people let their CI tool drive deployment, and by doing this they are not practicing good separation of concerns,
- Deploying all changes (code and config) via a pipeline allows verification and validation, for example, a pipeline can check for potential route naming collisions, or an invalid security policy
- Automate the “diff and sync” of codified required state within git and the associated actual state of the system: As soon as the continually executed “diff” process detects that either an automated process merges an engineer’s changeset or the cluster state deviates from the current specification, a “sync” should be triggers to converge the actual state to what is specified within the git-based single source of truth.
- Weavework use a Kubernetes controller that follows an “operator pattern“: By extending the functionality offered by Kubernetes, using a custom controller that follows the operator pattern, the cluster can be configured to always stay in sync with the Git-based ‘source of truth’.
- The Weaveworks team uses “diff” and “sync” tools such as the open source kubediff, as well as internal tools like “terradiff” and “ansiblediff” (for Terraform and Ansible, respectively), that compare the intended state cluster state with actual state.
- The AppDirect engineering team writes Ambassador configuration within each team’s Kubernetes service YAML manifests. These are stored in git and follow the same review/approval process as any other code unit, and the CD pipeline listens on changes to the git repo and applies the diff to Kubernetes
As all of the Ambassador configuration is described via Service annotations in Kubernetes YAML files, it is very easy to implement a “GitOps” style workflow — in fact, if a team is already following this way of working for deploying applications and configurations, no additional machinery or set up should be required.
When engineering teams began discussing with Datawire about integrating Ambassador configuration into a GitOps workflow, a couple of issues did repeatedly appear: first, as Envoy had evolved and began offering a more feature-rich “v2” config, many engineers wanted access to this; and second, as Ambassador configurations had got more complicated and were being deployed at a larger scale, engineers required additional validation and a method to support configurations updates when running under heavy load.
Evolving Ambassador to v0.50: Envoy v2 and ADS
In consultation with the Ambassador community, the Datawire team recently undertook a redesign of the internals of Ambassador in 2018. This was driven by two key goals. First, we wanted to integrate Envoy’s v2 configuration format, which would enable the support of features such as securing multiple domains being hosted on a single IP via Server Naming Indication (SNI), improved endpoint/service-specific rate limiting (using request label metadata) and gRPC authentication APIs. Second, we also wanted to do much more robust semantic validation of Envoy configuration due to its increasing complexity, particularly when operating with large-scale application deployments.
Even though the vast majority of interaction may be decentralized (via product teams) the resulting data plane updates will in effect by “centralized” at the edge.
The latest release of Ambassador 0.50 has been fundamentally re-architected to address these issues. The internal class hierarchy within Ambassador was made to more closely mirror the separation of concerns between the Ambassador configuration resources, a multipass compiler-inspired generation of an Intermediate Representation (IR), and the Envoy configuration resources. Core parts of Ambassador were also redesigned to facilitate contributions from the community outside Datawire.
We decided to take this approach for two reasons. First, Envoy Proxy is a very fast moving project, and we realized that we needed an approach where a seemingly minor Envoy configuration change didn’t result in days of reengineering within Ambassador. In addition, we were also wanted to be able to provide semantic verification of configuration before loading this into Envoy.
The second point is particularly relevant to the theme of this article. When designing a control plane for the edge, even though the vast majority of interaction may be decentralized (via product teams) the resulting data plane updates will in effect by “centralized” at the edge. It’s worth noting that this isn’t something that you would typically see with a service-to-service data plane, as changes are often localized to the data plane proxy running within a service (or services) sidecar. Extra effort has to be put into validating changes being made to an edge proxy (or load balanced fleet of edge proxies), as an invalid configuration could potentially break all ingress traffic. As all of this configuration is effectively happening at a global level at the edge, the edge proxies must also be capable of very rapidly actioning any changes specified via the control plane.
We also switched the Ambassador internals to use Envoy’s v2 Aggregated Discovery Service (ADS) APIs to load configuration into the Envoy process instead of relying on the previous approach of using a hot restart. This completely eliminated the requirement for restart on configuration changes, which we found could lead to the dropped connection under high loads or long-lived connections, such as gRPC streams or WebSockets.
The new internal Ambassador configuration process now looked something like this:
Moving to a “programmable edge” is beneficial, but you will need to adapt your developer experience or “DevEx” in order to fully take advantage of this new technology. Modern architectures and technologies like microservices and containers allow engineers to build and release functionality quickly, but the supporting underlying infrastructure also needs to adapt.
In particular, modern engineering workflows make the edge of your network more dynamic — rapidly changing business functionality is exposed here via independent decentralized product-focused teams, and external threats such as man-in-the-middle attacks or DDoS need to mitigated by centralized operations teams. By using tooling like Ambassador, which acts as an edge-focused control plane for the Envoy Proxy, in tandem with using new workflow approaches like GitOps, I believe this will go some way in helping to address some of the challenges discussed.