Observability / Technology

SkyWalking: APM for the Heterogeneous New Stack

6 May 2019 3:00am, by

SkyWalking, one of the Apache Software Foundation’s newest Top-Level Projects, has expanded from a pure tracing system, to an observability analysis platform and application performance management/monitoring system.

It began four years ago when Sheng Wu and his colleagues were trying to build large, distributed systems for China Unicom along with seven other vendors.

“We had bad experiences there. Sometimes it was impossible to identify which vendor’s system caused a failure. So I decided to build SkyWalking as a training project. We used distributed tracing to guide newcomers and colleagues to understand the problems that arise when you build a distributed system,” said Wu, now project founder and vice president.

The trickiest problems system administrators face happen among abundant and interdependent services: Identifying why and where a request is slow, distinguishing normal from deviant system performance, comparing apples-to-apples metrics across apps regardless of programming language, and attaining a complete and meaningful view of performance, he said.

Apache SkyWalking is an application performance monitor (APM) tool that provides an automatic, efficient way to instrument microservices, cloud-native and container-based applications.

The project entered the Apache Incubator in December 2017. It has grown from six to eight volunteers to more than 100 source contributors. It’s used in more than 70 companies including Alibaba, autohome.com, China Eastern Airlines, China Merchants Bank, Huawei, Sinolink Securities and WeBank.

Skywalking is a different breed of performance management system, according to Wu.

“It provides a holistic platform for collection, aggregation and domain specific query system. It also is truly heterogeneous, that it not only has agents for different systems, it also seamlessly blends service mesh in,” he said.

He pointed out these features:

  • A polyglot agent-based instrumentation mechanism. Tools that focus solely on distributed tracing usually don’t provide agents. Multiple language agents provided, especially with auto instrumentation supported, in Java, .NET and Nodejs.
  • Performance: Its impact CPU on the monitored application is less than 10%, even with a payload instance of just over 5k transactions per second/requests per second. This lightweight payload would support 100% trace sampling in production environments.
  • Observability for distributed systems based on traditional, agent-based and service mesh architectures, with consistent analysis and visualization.
  • Topology and dependency analysis without sampling.
  • Easy operation and maintenance achieved directly by our clusters, without reliance on big data technology

“SkyWalking is one of the only open source tracing systems where usability and user interface have been a focus, something missing in most open source projects,” said Jonah Kowall, chief technology officer at Kentik, and former research vice president at Gartner. “Making tracing and APM more easily used by developers and operations team is a key goal which makes Apache Skywalking a project to watch.”

SkyWalking accepts data from both tracing and metric sources, integrating with service mesh platforms like Istio and Envoy, Wu said. Users aren’t just collecting data and alerted to problems, but get a complete and meaningful map of their entire, distributed system with actionable insights.

“In modern, cloud-native and container-based deployment environments, observability comes first because it’s critical to have a complete view of performance. Our visualization tools, such as a topology map, trace tree and customizable dashboard, help end users really understand the distributed system and maintain high performance,” he said.

It presents both tracing and metrics for traditional and modern services in a way that reduces debug and resolution time for users. Logs collection and analysis is part of the 2019 road map.

Many APM products use config-based instrumentation, which can present some burdensome challenges such as library dependencies conflicts or labor-intensive API upgrades, he said.

SkyWalking’s java agent uses a java runtime code manipulate mechanism, and provides a set of APIs to define the instrumentation plugin for dozens of supported frameworks.

SkyWalking works with service mesh and the project contributed features in SkyWalking’s latest 6.0.0 release to support Istio’s Mixer. In the near future, it will support service mesh observability directly from Envoy. Wu and Lizan Zhou, founding engineer at Tetrate will talk about this at KubeCon China in June.

The project’s roadmap is focused in four areas, Wu said:

  • Providing multiple language agents/SDKs in addition to its current Java, .NET, NodeJS and PHP
  • Deepening integration with service mesh, Istio, Envoy and Kubernetes ecosystems
  • Supporting log collection, persistence, and connection with metrics and tracing
  • Continually evolving to provide users with a comprehensive observability stack for traditional and modern workloads.

“I hear regularly from users that observability is the most important feature they’re getting out of their service mesh,” said Zack Butcher, core contributor to Istio. “By integrating Apache SkyWalking with Istio, the SkyWalking team has brought their incredible tools for deeply understanding system behavior to the mesh. We’ve already seen great results, and I can’t wait to see what further insights users unlock using Apache SkyWalking together with Istio to observe and manage their deployments.”

Feature image: “Comic Daredevil Bello Nock – 2013 Royal Melbourne Show” by Chris Phutully. Licensed under CC BY-SA 2.0.