3 Observability Best Practices for Cloud Native App Security
Why is observability important for better security?
Observability, especially in the context of cloud native applications, is important for several reasons. First and foremost is security. By design, cloud native applications rely on multiple, dynamic, distributed and highly ephemeral components or microservices, with each microservice operating and scaling independently to deliver the application functionality.
In this type of microservices-based architecture, observability and metrics provide security insights that enable teams to identify and mitigate zero-day threats through the detection of anomalies in microservices metrics, such as traffic flow, process calls, system calls and more. Using machine learning (ML) and heuristic analysis, security teams can identify abnormal behavior and issue alerts.
Observability also enables security teams to visualize the blast radius in the event of a breach. Using this information, teams can apply mitigating controls, such as security policy updates, to isolate the breached microservice and thereby limit exposure.
And finally, observability helps DevOps teams maintain the quality of service by identifying service failure and performance hotspots and conducting a detailed investigation with capabilities such as packet capture and distributed tracing.
DevOps and SRE teams today are being overwhelmed by an enormous amount of data from multiple, disparate systems that monitor infrastructure and service layers.
In order to troubleshoot microservices issues, someone needs to stitch together all this data. Not only that but, in order to use this data, teams need an understanding of monitoring systems at different levels of the stack. This results in teams spending a tremendous amount of time troubleshooting microservices issues.
Due to the overwhelming amount of data generated and the complexity of microservices deployments in the cloud, it is extremely difficult to diagnose and troubleshoot issues manually. Not only are they overwhelmed by the data, but orchestrators like Kubernetes also introduce a layer of abstraction on top of your host, VMs and container. All data you collect needs to be enriched with Kubernetes context in order to be useful.
Observability Best Practices
Here are three best practices for maintaining and improving observability:
Your observability tool should be distributed and Kubernetes native, should support sensors across all layers (L3–L7) and should collect telemetry data from various sensors in your cluster. It should also collect information about Kubernetes infrastructure (for example, DNS and API server logs) and Kubernetes activity (Kubernetes audit logs) in the context of deployments and services.
Analytics and Visibility
Tools must provide visualizations, such as a service graph, Kubernetes platform view or application views, that are specific to Kubernetes operations. In addition to visualizations, tools should leverage machine learning techniques for baselining and reporting anomalies.
Security and Troubleshooting Applications
To help troubleshoot applications, it’s helpful if the observability tool you implement supports distributed tracing. Advanced machine learning techniques are also helpful for understanding Kubernetes cluster behavior, which allows you to predict security and performance concerns.
Tooling Solutions for Maintaining Observability
There are excellent open source and commercial tools for maintaining observability. Some open source tools for cloud native applications include:
While open source tools are a great way to start your monitoring and observability journey, they have their limitations. Commercial tools for cloud native application observability offer advanced features that go beyond what open source tools can offer. I recommend looking for a tool that offers as many of the following features as possible:
- Big-picture visualization: Some sort of topographical representation of traffic flow and policy that shows how workloads within the cluster are communicating, and across which namespaces. Bonus points if the tool provides advanced capabilities to filter resources, save views and troubleshoot service issues.
- Dashboards: Such as a DNS or L7 dashboard. A DNS dashboard should help accelerate DNS-related troubleshooting and problem resolution in Kubernetes environments by providing exclusive DNS metrics. An L7 dashboard should provide a high-level view of HTTP communication across the cluster, with summaries of top URLs, request duration, response codes and volumetric data for each service.
- Dynamic packet capture: The tool should provide a way to capture packets from a specific pod or collection of pods with specified packet sizes and duration, in order to troubleshoot performance hotspots and connectivity issues faster.
- Application-level observability: You want a centralized, all-encompassing view of service-to-service traffic in the Kubernetes cluster to detect anomalous behavior like attempts to access applications or restricted URLs and scans for particular URLs.
- Unified controls: Ideally, the tool should offer a single, unified management plane that provides a centralized point of control for unified security and observability on multiple clouds, clusters and distros. This would enable you to monitor and observe across environments with a single pane of glass.
Learn more about Kubernetes monitoring and observability.