Q&A: Ben Sigelman on the Emergence of “Deep Systems” in Microservices
In breaking a monolithic software architecture into discrete, modular chunks, microservices have recently become a popular solution to IT challenges by increasing software agility, application scalability, and autonomy. But with these benefits of a novel, service-oriented software architecture come new challenges. As the rise of microservices has translated into mind-bogglingly multilayered structures, there’s been a corresponding need to understand, track and monitor how these discrete, distributed elements interact with one another within these increasingly “deep systems.”
To get a better grasp of the emergence and impact of “deep systems,” we caught up with Ben Sigelman, CEO and co-founder of LightStep, which offers observability solutions for deep, multilayered systems. In addition to co-creating the OpenTracing and OpenTelemetry OSS projects, Sigelman has also previously worked with Google, deploying Dapper distributed tracing, and launching Monarch, their high-availability time-series collection, storage, and analysis platform.
What are “deep systems” and how are they related to the emergence of microservices? How is this situation different than the past?
Our industry adopted microservices in order to ship quality software faster. With hundreds of developers working in a single business unit, we needed to create separately-managed units — “microservices” — that could be developed, deployed, and operated with autonomy and independence by small teams.
But these microservices are not actually independent, of course: they rely on other microservices, and other microservices rely on them. The depth of an architecture is the number of independently-managed layers in the end-to-end application stack, including microservices, monoliths, and managed cloud services. “Deep systems” are production architectures with four or more of these independently-managed layers.
Now, in a certain sense, stacks have been “deep” since the invention of the function call in the 1940s. What’s different about deep systems is that each layer in these modern, distributed, stacks is developed, deployed, and operated by distinct development teams. When requests cross boundaries between layers and teams, conventional tooling completely breaks down and investigations falter: this is why deep systems are highly correlated with catastrophic on-call shifts, performance mysteries, unexplained regressions, inter-team finger-pointing, and an overarching lack of confidence that decelerates feature velocity — and, ultimately, innovation.
How does this current “deepness” of systems affect control and responsibility?
This is a really important point. I’ve addressed it in a blog post and various talks (e.g., this one at QCon and this one from the Systems@Scale conference), but it’s easier to illustrate than to describe in words alone:
Stress can be defined, concisely, as “responsibility without control.” The reason that deep systems are so stressful for DevOps is easy to understand in this context. By design, in a microservices architecture, DevOps only control their own service, and yet they are ultimately responsible and held accountable for the performance and reliability of everything they depend on in the deep system. When there are many layers of independently managed services downstream, there is a rapid divergence between what one controls and what one is responsible for, and that’s a huge problem.
Why do we need to rethink the “three pillars” of observability?
Microservices observability is rife with misleading and/or misguided advice. Perhaps most prominently, the so-called “three pillars of observability” — traces, metrics, and logs — are just the raw input data for an observability solution. At best, they are “the three pillars of telemetry,” or, more cynically, “the three products we’ve acquired by M&A and now need to find a way to market.” In any case, it is a major mistake to structure an observability strategy around traces, metrics, and logs as distinct product capabilities: the only sensible way to build up an observability practice is around actual use cases and workflows. While traces, metrics, and logs all have their place in these workflows, treating them as separate capabilities means that you will have (at least) three tabs open during releases, incidents, and performance investigations, and further means that you will lose context as you switch from one to the other (and back again, and again).
The so-called “three pillars of observability” — traces, metrics, and logs — are just the raw input data for an observability solution — Ben Sigelman
Gather your traces, metrics, and logs using portable, high-performance instrumentation. I’m partial to OpenTelemetry or its predecessor projects, OpenTracing and OpenCensus (which have now merged, to be clear), but anything vendor-neutral will do. Then structure observability around the three critical use cases:
- Deploying new service versions (i.e., innovating),
- Reducing MTTR (i.e., enforcing SLOs),
- Improving steady-state performance (i.e., improving SLOs).
It’s rare to find vendors that build directly to these use cases. LightStep is one, though to be clear I have some confirmation bias as the founding CEO! But these are the things that actually matter in observability — don’t get dragged into a confusing evaluation based on raw data types; they won’t deliver value on their own.
So what are some of the solutions you are proposing, and how do they work?
For deep systems, the difficult problems typically involve interactions between multiple services communicating across multiple independently-managed levels of the distributed stack. Traces are the only type of telemetry that models these multiservice, multilayer dependencies, and so tracing must form the backbone of observability in deep systems.
For instance, let’s say you build, monitor, and maintain Service A. If Service A depends on Service Z — perhaps through several intermediaries — and Service Z pushes a bad release, that will likely wreak havoc on the performance and reliability of Service A and everything in between. The right approach here is to build a model of the application from the perspective of Service A, and to take snapshots of that model before, during, and after things like Service Z’s hypothetical bad release above. By assembling thousands of traces in each snapshot, an observability solution can find extremely strong statistical evidence that the regression in Service A’s behavior is due to the change in the version tag in Service Z; and, further, can correlate the negative change to other metrics and logs in Service Z, both before and in the midst of the bad release.
What distinguishes your approach from similar competitors?
Historically, most other approaches did not incorporate distributed tracing data in the first place — as such, they have almost no way to analyze or represent the elaborate dependencies between services in deep systems. In recent years, metrics- or logging-oriented products have thrown in distributed traces “on the side,” typically as individual data-points that can be inspected manually in a trace visualizer.
This blunt, simplistic approach can be effective in identifying some limited number of egregious problems, but complex issues in production are more subtle. LightStep’s approach is unique in that the individual traces are sampled and analyzed in order to address specific high-value questions: “What went wrong during this release?”, “Why has performance degraded over the past quarter?”, “Why did my pager just go off?!”
For instance, one of our customers recently experienced a sudden regression in the performance of a particular backend, deep in their stack — it turns out that the underlying issue was that one of their 100,000 customers changed their traffic pattern by 2000x. This was obvious within seconds after looking at aggregate trace statistics, though they estimated it would have taken days just looking at logs, metrics, or even individual traces on their own.
This is all possible because LightStep’s Satellite architecture grants our product access to about 100x more data than a conventional SaaS solution at the same (or lower) cost. With so much more data, and colocated storage and compute, we extract more context about deep systems; this is why we have earned the trust of progressive customers like Lyft, GitHub, Twilio, UnderArmour, and many more.
In the future, what direction do you see the industry going toward? What future steps is your company working on at the moment?
Our industry is still wrapping its head around the full operational and managerial implications of deep systems. Soon, we will recognize that high-quality observability is an absolute prerequisite for high-velocity development in deep systems, and that observability strategy must center around key use cases, not just around telemetry. The OpenTelemetry project will also become ubiquitous across vendors, cloud providers, and critical OSS infrastructure software. This will be a good thing for developers and operators, as high-quality telemetry will be performant, vendor-neutral, and “on-by-default.”
Furthermore, consumers of observability technology will rightfully demand pricing units that are predictable, controllable, and proportional to the scope of the challenges those solutions address – today, pricing in conventional solutions is tied to data volumes that are both unpredictable and poorly correlated with value.
As far as LightStep is concerned, we will continue to do what’s made us successful to date: listening carefully to our customers and prospects as we build our simple, opinionated observability product for deep systems. Thanks to a recent focus on automated telemetry-gathering and single-service analytical features, individual developers can now adopt LightStep in minutes and immediately take greater control over performance and reliability for their own services, particularly when deploying new versions or reasoning about downstream dependencies. Going forward, LightStep will ingest more and more forms of telemetry and continue to build new product around use cases across the entire software lifecycle — in this way, our product can be powerful without being overwhelming.