Apache SkyWalking: Observing the Heterogenous Stack at Scale
Tetrate sponsored this post.
The observability problem for modern DevOps is familiar: As enterprises move to microservices, containerization, multilanguage RPC frameworks and service meshes, there’s an increasing need for users to understand a highly complex, distributed architecture and the dependencies between applications. Apache SkyWalking, an application performance monitor (APM) and observability platform, is an open source project that addresses this need — with or without a service mesh.
Like other observability tools, Apache SkyWalking allows system administrators to track system health and understand what’s going on among abundant and interdependent services. The internal system of a large-scale enterprise will often have scores of subsystems running hundreds of services and thousands of instances. SkyWalking is built to help operation and maintenance teams identify why and where a request is slow, alert them to deviant system performance, provide apples-to-apples, language-agnostic metrics across apps, and efficiently monitor overall system health.
Heterogeneity distinguishes SkyWalking, which provides a holistic platform for collection, aggregation and a domain-specific query system — with agents for different systems and the potential to seamlessly integrate a service mesh. Organizations might opt to use SkyWalking so that they can maintain consistency and use the same APM system for traditional and cloud native architectures.
Apache SkyWalking was started by Sheng Wu as a personal project and has grown meteorically since then, with 375 contributors today. It was named after a literal “observability platform,” the glass bridge Skywalk at Grand Canyon West, that provides a birds-eye view of the natural landmark.
From its humble start as a training project, to help colleagues understand the problems that arise in a distributed system, it evolved from a pure tracing system to a full-featured APM system and observability analysis platform — aimed at microservices and distributed services running in large-sized enterprises at scale. SkyWalking is a top-level Apache project and monitors large-scale distributed systems that include Alibaba, Huawei, Tencent, Baidu, China Telecom, and various banks and insurance companies. It collects and analyzes, in many cases, billions of traces with metrics per day.
“SkyWalking guarantees availability under high-load conditions in production,” says Sheng Wu. “Its users are looking for regular processing power at the level of tens of billions, lightweight process, pluggability, and easy customization.”
SkyWalking’s functionalities fall under the “three pillars” of observability: metrics, logs, and tracing. Fundamentally, SkyWalking is an APM tool dedicated to application performance — allowing development, operations and maintenance teams to understand the relationships between their systems and their operations in practice.
Metrics give you aggregated data on application performance — for example, the number of services, average response time, throughput, etc. You can also add a custom-defined metric to the SkyWalking UI, based on individual business requirements. Logs provide a record of events or error messages. Tracing shows you event behavior over time, so that you can track a request from start to finish and identify system defects and errors.
SkyWalking’s distributed topology maps use the STAM (Streaming Topology Analysis Method), to analyze topology from traces displaying relationships that can’t be pulled from simple metrics SDKs. Used in a service mesh, SkyWalking can support observability with Envoy’s Access Log Service (ALS) — the proxy extension that emits detailed access logs of all requests going through Envoy. SkyWalking gives you various means of making such data useful and actionable: a list view of latency bar graphs to quickly view slow points in the system, alarms triggered by a user-specified service-level objective (SLO) threshold, or a topology diagram to locate the boundaries of a performance issue, to name just a few examples.
SkyWalking’s architecture includes four key components:
- The agent; i.e. the language agent or protocol of other projects providing metrics and tracing.
- The Observability Analysis Platform (OAP); a highly modularized and lightweight analysis program, consisting of a receiver and kernels for stream-processing and queries.
- A UI module to query and display data through the standard GraphQL protocol.
Recent SkyWalking updates have focused on making the project increasingly lightweight, pluggable and customizable, with robust visualizations and expanding reach for monitoring its own performance and (most recently) browser data.
For mesh adopters, Apache SkyWalking integrates with Istio and Envoy and comes built into the service mesh management platform Tetrate Service Bridge.
Feature image via Pixabay.