How Container-Based Architectures Require Different Networking
With the rise of so-called mode two applications, we see significantly different approaches to software architectures. Gartner defines mode two applications as exploratory, non-linear focused on agility and speed. In order to achieve this level of agility, the underlying architecture needs to be dramatically different.
The base principles of these architectures are starting to converge towards a collection of microservices deployed in (Docker) containers. The containers are small and agile teams update them all the time, which demands a different dynamic architecture around continuous deployment, management and networking infrastructure.
Scale in a container based software world leads to new requirements. Where in the past enterprise solutions were running in a handful of virtual machines, in a container world the number of micro service instance can grow dramatically.
Today we are only at the beginning of this revolution. Microservice architectures do not appear overnight since it requires significant rewrites of enterprise solutions. New solutions will be based on microservices without any doubt.
Large enterprises have a ton of legacy applications that will be reborn in the cloud over time based on these techniques, the benefits of a container based architecture are simply too attractive to resist. Not only does it reduce the risk of software development, but the time to value for new capabilities will also bring lots of business value to end-users.
A typical large enterprise has between 500 and 10,000 applications (I’ve even seen extremes of nearly 100,000). Imagine that over time all these applications will be written using microservices running in containers. Each of these applications will probably have 10-100 different microservices where each microservice will have many instances running in containers. This will lead to enterprise environments running 100,000+ containers over time! Again this will not happen overnight but 10 years from now this might be a reality.
Microservices tend to be highly dynamic scaling up and down depending on the load on the system and are updated so frequently that each instance is short lived. Network traffic between microservices (east-west traffic) tends to be significantly larger in volume than north-south traffic (from the requestor to the landing service). Most requests are hopped over multiple microservices before the response is sent back to the requestor.
Containers are also ideal for deploying microservices based applications in the public cloud. Containers provide great flexibility by hiding physical infrastructure and operating system requirements from the application, and this makes them ideal for running workloads on-prem and in public clouds.
The short lifespan of containers and frequent updates to applications hosted inside those containers makes provisioning microservices at scale a challenge. This is where Kubernetes comes in. Kubernetes has a thriving open source community with 1000s of contributors and is a primary project within the Cloud Native Computing Foundation (CNCF). Kubernetes provides cloud agnostic solution to automate deployment, scaling, and management of containerized applications across on-prem and public clouds and in doing so Kubernetes also addresses concerns around cloud vendor lock-in.
Migrating a traditional three-tier application to microservices is not an easy task and is best done in phases to ease the migration.
Application Deployment Journey – Three-Tier to Microservices
A traditional application is deployed as three-tier where it is divided into application tier, business logic tier, and data tier, and these individual tiers talk to each other via a load balancer. The three-tier architecture is simple to deploy but rigid in its design to support continuous delivery of new capabilities. Microservices architecture increases operational complexity but is flexible in its design to enable continuous integration and delivery of new capabilities. In a three-tier architecture, networking and security policies are administered from a central place, typically a load balancer or a firewall. The distributed nature of microservices architecture on the other hand, makes administering networking and security policies a lot harder than monolith architectures. Another difference worth noting is that a three-tier architecture follows traditional or server-side load balancing where the client knows only of one destination. Microservices architecture on the other hand, follows client-side load balancing where the client knows of multiple destinations and load balancing happens closer to the client.
It is important to account for these differences as you plan your transition. Here is a step-by-step guide for ensuring minimal disruption as you redeploy your 3-tier application as microservices:
Stage I) Migrate to Hairpin Architecture
In a typical three-tier architecture, load balancer is the entry point to your application. Having all traffic come through a single entry point makes it easier to manage network and security policies from a central place. In the first phase of migration, it is best to retain this simplicity and instead focus on the harder task of decomposing the app into modular microservices. Once the app is decomposed into modular services, you should hair-pin them through the central load balancer. The central load balancer, in this case, could be the same hardware or software appliance that is already functioning as the N-S entry point for all applications. The main advantage of this approach is that it retains the simplicity of a three-tier traffic flow for both North-South (N-S) and East-West (E-W) communication; this simplifies things a lot from a security point of view. One drawback of this approach is that it introduces an additional hop in the network, and thereby adds latency for E-W traffic.
As a gradual next step in the transition, you should consider separating the N-S traffic from E-W, by moving the E-W traffic to a dedicated E-W load balancer. In addition, you could further simplify E-W load balancing by decomposing it into one E-W load balancer per application, as shown below with purple and green apps. This gives flexibility to Dev teams to package and deploy load balancer as part of their application bundle.
For organizations that are early in their DevOps journey, this deployment architecture enables a clean separation of roles and responsibilities between traditional IT and Dev teams. It allows IT to retain control of the central load balancer and in doing so, allows them to be responsible for managing security and compliance for N-S traffic coming into the data center, and at the same time, it gives Dev more freedom and greater control over of their service. There are also other benefits of this model – it minimizes coordination overheads between Dev teams because each application comes bundled with its own mini load balancer, it limits the blast radius on failure to single application, it aids in faster troubleshooting by decentralizing fault domains, and it enables clean separation of logs, configuration, and security policies by individual applications.
Stage II) Use Combination of Hairpin and Mesh
Next, you move the traffic with simple load balancing needs away from the central load balancer but you keep the traffic with application-aware routing and security needs hair pinned through the central load balancer. In this mode, all traffic with simple load balancing needs are load balanced through the default proxy that comes with Kubernetes. For instance, applications that need SSL or content inspection by Web Application Firewall or rich content routing policies could continue to hairpin through the central load balancer, while all other traffic go directly to the target service. Another point worth noting with this change is that all E-W traffic with simple load balancing need is now starting to shift from server-side load balancing to client-side load balancing. The main advantage of this step is that it reduces latency and throughput overhead caused by the hair-pinning model of stage-I. One drawback of this approach is that administering security is not as simple as three-tier. Security moves lower in the application stack, closer to each service and the distributed nature of microservices architecture makes managing security a challenge.
Stage-III) Final Destination with Full Mesh or Service Mesh
The final step in the journey is a full mesh architecture. This stage has an intelligent content-aware proxy with rich API management, API security, and traffic management capabilities, running on each Kuberenetes node, managing all E-W traffic. At the end of stage III, the E-W traffic that started out in traditional server-side load balancing mode moves over completely to client-side load balancing mode. The N-S traffic coming into the data center still follows the traditional server-side load-balancing paradigm. The biggest advantage of this architecture is that it makes the overall network efficient by eliminating the hair-pinning overhead of stage I and II.
A slightly advanced form of full mesh architecture is Service Mesh. It provides an infrastructure layer that abstracts away common networking and security requirements from the service and absorbs it into the mesh infrastructure. A few examples of such common networking and security requirements are authentication, tracing, circuit breakers, and visibility.
At the end of Stage III, you will have successfully redeployed a monolith application to distributed cloud native microservices architecture deployed on containers and managed by Kubernetes. Enterprise environments that run a large number of containers need a different approach to load balancing, security, visibility, monitoring and management of microservices.
Challenges of Managing Microservices and Containers at Scale
Microservices architecture consists of many small, modular, loosely coupled and distributed services. In microservices architecture, network operators have to contend with a large number of containers, talking to each other forming complex patterns. Therefore, by definition, the number of services, number of container instances, scale of deployment, and rate of change, are a lot higher in microservices architecture compared to a traditional 3-tier architecture.
Here are some aspects to consider about operational complexity:
- There are a lot more service instances to load balance,
- The short lifespan of service instances makes health checking these ephemeral containers a challenge,
- The sheer number of service instances call for a high degree of automation for managing rolling-upgrades at scale; manual or semi-manual processes don’t work when dealing containers at scale,
- Frequent application updates by autonomous teams in a microservices architecture call for new tooling for canary testing, blue-green deployments to minimize downtime,
- Compared to a monolith, microservices applications are required to have higher tolerance to failures through application level timeouts, retries and back-offs. This design pattern induces a new behavior in the network that one needs to account for in their planning i.e., a rapid increase in retries when a service slows down, etc.
- Increased cross-connects between services significantly increases the volume of logs/metrics,
- Unlike monolith architectures where modules communicate with each other within a single image, either through in-process calls or IPC, with microservices these calls are made over the network. More services and service instances potentially mean more failure points, more operational complexity.
Visibility and Analytics
Visibility and troubleshooting is difficult in any distributed system, and microservices are no different. This section explores operational, troubleshooting and visibility challenges that are very specific to microservices architectures.
The distributed nature of microservices increases failure points in the system at all levels — network, hardware, and application. More moving parts does not necessarily mean more failures but it certainly increases operational complexity. To help prepare for failure events, you need a good visibility and analytics solution in place.
In a microservices architecture a user request typically fans out into multiple API calls to other services, which in turn can potentially call into more nested services. This fan-out model brings out some interesting problems:
- It becomes difficult to understand the journey of an API call through the infrastructure. For instance, it becomes hard to gather information such as — how many dependencies does a service have, are dependency services running hot, visual depiction of the call chain of an API, how deep does a request go, at what level does it fail, etc.
- A service that has poor latency 1 percent of time might look good in isolation but this could still be problematic in aggregate. For instance, a user-facing top-level API that calls into many downstream services could be hitting someone’s slow path (1 percent case) at most times, thus leading to poor user experience.
- Large volume of API logs in a microservices environment make it difficult to spot inefficient API calls. For instance, a poorly implemented list traversal logic that fans out into many API calls to the same service could easily benefit from batching of such calls and caching the response.
- Getting to the bottom of a failing top-level API that started behaving differently due to unrelated changes in a downstream API, like addition/removal of a field unrelated to the top-level API, becomes challenging
Team structure and dynamics in a microservices architecture also lead to subtle problems. In a microservices architecture, service teams work independently with little or no coordination, thus leading to teams that are siloed. This can lead to subtle problems typically not seen with monolith architectures:
- Teams working on microservices are required to handle failures gracefully through request timeouts, retries, and back-offs. Oftentimes service teams unintentionally pick similar values for timeouts and retries, which can lead to fatal side effects under failure conditions. A potential problem exposed by this design pattern is that a service that has gone down can get bombarded by a ton of requests on its way up, causing it to go down again, and this cyclic shutdown-restart repeats itself. This can happen if the timeouts and retries in caller services are predictable and not randomly picked, or if a caller service is not able to distinguish between legit and transient failures.
- Teams move independently, at their own pace and with little to no coordination. This means a caller has no control over the target service — a team could be canary testing a new feature in production, or a team could be reverting to an older version due to some regression, or Kubernetes itself could be autoscaling container instances to meet varying load conditions. In this environment, ensuring consistent SLA across all teams can become challenging, especially if you do not have a consistent dashboard for measuring and comparing metrics across all services.
- There needs to be good accounting and tracking of logs generated by service teams. Without proper guidelines and auditing, a runaway service could unintentionally generate verbose logs, which could then lead to new set of scaling challenges around storing, searching, and indexing of the data at the log aggregation layer.
Furthermore, following design patterns in a microservices architecture also give rise to interesting problems:
- For planning and reliability reasons, services in microservices architecture are encouraged to fail fast as doing so is better than taking on more load and degrading experience for everyone. However, the problem is there isn’t a one-size-fits-all answer to this situation. Every service is different and a good solution should dynamically engage circuit breakers and service limits based on history, current load, resource type, and calling service.
- The eventually consistent paradigm followed by microservices applications where services start out with divergent views but eventually converge on a consistent view, makes application behavior inconsistent during the settling period.
- Canary is a commonly used technique in microservices for introducing new updates to a subset of users, before opening it up to the broader user population. By comparing the new version side-by-side with the older version, the canary approach quickly uncovers problems in the new version. The challenge however is that without a good analytics framework in place, it is hard to do this before vs after comparison to assess the risk of a canary release.
Clearly, visibility is a challenge with when dealing with many microservices. One way to get more insight is to embed a unique request-id in user-facing requests and have the downstream services carry them through the call chain, and then mine the logged data for valuable insights. Given these challenges, there is a need for a solution that is data-driven and understands microservices design patterns to not only spot problems proactively before they go wrong and reduce triaging time when things do go wrong, but also recommend corrective actions to restore the service to a stable state.
Managing security in this dynamic environment can be also quite challenging. In a monolith architecture, application modules are bundled together in a single image. These modules typically communicate with each other over in-process calls, loopback, local sockets and shared memory. A major drawback of monolith approach is that the entire system becomes vulnerable if one of the modules breached. Likewise, ensuring security and compliance for a microservices architecture is just as difficult — microservices architecture consists of large number of services that are accessible over the network using APIs, and this significantly increases the attack surface as the attacker can now devise network port based or API based attacks. The ephemeral nature of the service instances also make it difficult to monitor containers, apply security, and do forensics on security breaches. Given how easy it is to package and deploy containers by pulling together base images from different sources, security in container and microservices architecture needs to be looked end-to-end:
- Right from building, scanning, digitally signing a container image to deploying it,
- It must be baked into the CI/CD pipeline,
- It should account for hardening of the OS,
- It should include network and application level segmentation,
- It should include authentication and authorization mechanism to ensure callers are who they claim to be,
- Services should talk to each other over a secure communication channel.
In general, because services are independent — developed by different teams, programming languages, technology building blocks, etc — ensuring a consistent security posture across all services in a microservices architecture is a challenging problem.
As enterprises bet on containers and microservices for improving speed and agility of application delivery, there is a need for an enterprise-grade solution that truly understands microservices design patterns to assist infrastructure operators on security and compliance, and provide data-driven insights and actions.