Q&A: Epsagon Brings Automated Distributed Tracing to Microservices, Serverless

With the move to a distributed, microservice software architecture, there can be more to take into consideration than you might first realize. For example, the logging and application performance management (APM) tools that you have become accustomed to using may not serve your purposes anymore. With a monolithic application, a traditional log can help with troubleshooting, but distributed architectures and innumerable ethereal microservices could mean that you have countless logs to pore through to find out what’s going wrong in your application. Add to that the multiplied complexity that can come with adopting serverless functions, and legacy logging tools and APMs may no longer be able to provide any insight at all.
Instead, distributed architectures can require new approaches, such as distributed tracing, which can provide insight into what’s happening in software that consists of many interacting parts, rather than a single monolith. We caught up with Nitzan Shapira, co-founder and CEO at Epsagon, a company that provides automated tracing for cloud microservices, to talk about the state of modern application development and troubleshooting, the move from monoliths to microservices, and find out a bit about how Epsagon makes it all a bit easier for modern developers.
Everything you hear these days is about the move from monolith to microservices — how are organizations making this move?
Usually, it’s not a one-step move. There are two parts to it — building new microservice applications and migrating existing applications. Writing new applications is much easier, of course. In many cases, they just decide that from now on they will do microservices on the cloud. They often get assistance from an AWS Solution Architect, for example, or from a system integrator, who comes in and does a lot of the implementation. As for the migration, there is really no easy way to migrate a legacy app to microservices — it’s more like writing from scratch and connect it to the existing legacy apps. It’s a lot of work.
What challenges do companies face when moving from monoliths to microservices and distributed architectures?
Many. The first challenge is the development itself, because their existing software is not necessarily built for microservices, containers, or serverless. It may be in different programming languages, some of them are outdated, so they have to write a lot of things from scratch. For example, they might have used Java in the past, and now they want to use something like Node.js or Go. They might have to learn it from scratch. Then, the deployment and CI/CD pipeline are usually more advanced to their existing methods.
The biggest challenge we are seeing is the visibility — monitoring and troubleshooting, especially in production. Most of them were using a mix of tools, usually an APM or infrastructure monitoring tool and a log aggregation tool, which worked fine for a monolithic app, but suddenly, when they go into microservices, the data that comes from those tools doesn’t tell the story of a distributed system. In a microservices environment, those solutions don’t provide a good understanding of what’s going on in production. Logs are just not the right tool to understand the distributed system since there is no correlation between them. This is something that sometimes organizations realize only after they are in production and it’s a barrier for going to microservices architecture in production. Monitoring and troubleshooting are the top issues we’re seeing.
Who’s making the move to microservices? We always hear examples from leading-edge technology companies, but are you seeing more traditional companies making this transition and if so, how?
Yeah, we see many traditional companies, actually. It can be retail or insurance or anything like that, and suddenly they understand they have to be more digital. For example, in the eCommerce space, everyone is going digital and online and then they want to do it using proper technologies. In these cases, it’s actually much more interesting because they get a lot of assistance from the cloud providers. The result is actually a very complicated app, very modern — sometimes more than you see in a tech company, because they do it right from day one. They can just start from scratch with an entirely new architecture in the cloud. These apps are well-designed, but also very complicated for an enterprise that is not used to working in such environments.
Speaking of the move from monolithic to microservices, how do organizations traditionally monitor and troubleshoot and how does it change when you make the move to microservices?
What we’re seeing today is that, first of all, there is no more separation between IT Operations, DevOps, and developers. Developers have to be involved much more because every problem is more complicated than ever. It doesn’t actually make sense to separate logs and tracing because there is a lot of back and forth that’s happening between tools and between departments. It’s a huge inefficiency. The logs on their own are not enough to understand the system because the data flow is very hard to comprehend when you’re architecture contains dozens or hundreds of nodes. You have to correlate it. Organizations realize this very quickly, and that’s why technologies like distributed tracing began to emerge. They are meant to overcome the fact that everything is uncorrelated and instead, create a trace between the services.
Now, however, if you look at tracing solutions, they are usually separated from the metrics and the log solutions, which still creates a problem because they have to jump back and forth between different tools. At Epsagon, for example, we are overcoming this problem by taking the approach that distributed tracing is fundamental for a distributed app. Our approach is not to use logs as the main source of truth, but instead simply give access to the existing logs. We have built something we call “payload visibility,” which captures data between the services, stores and indexes it. It allows very rapid troubleshooting. Usually, people using Epsagon don’t rely on their log aggregation services anymore.
So when we’re talking about microservices, we often think about containers and Kubernetes. Where does serverless fit into all of this?
Serverless is a type of technology that can be used as part of this microservices story. It’s the cutting edge. It’s still considered to be very advanced and obviously, it’s not yet everywhere. If a company decides to go to the cloud, then in many cases they will choose to try and go serverless as much as they can because it just makes more sense. They don’t have to manage an infrastructure and it’s very cheap. Eventually, it’s going to be a mix of serverless, containers, and traditional VMs whenever it’s needed because it’s not possible to write everything from scratch. Serverless requires a lot of design and thinking behind it.
What challenges does serverless present for a logging and tracing and how do you guys overcome them with Epsagon?
The first challenge is that everything becomes much more distributed, at least an order of magnitude more, because every function is much smaller and you have many more of them. The architecture is much more complicated. If you use logs, you’re probably going to be lost very quickly. That’s where distributed tracing is far more critical. If you look at some of our case studies, you can see visual maps that show the complexity and how difficult it is to understand what’s going on without them. Another issue is the fact that there is no place to install an agent. Traditional APMs rely on agents, but you need to install them somewhere. Agents do different things that cannot be done inside a serverless function. In order to do tracing, you need to do instrumentation, and you need to be inside the code and to do it without an agent. It’s a technical challenge.
At Epsagon, for example, from day one we had to build everything agentless, using just a code library that can be part of any kind of service. This is an advantage because today we can provide the same experience for serverless, containers, and traditional VMs.
Is this something that developers need to think about as they’re building or the application or could they apply it retroactively?
We are under the assumption that our customers are already running in production, possibly thousands of microservices. Everything we do is automated. If you have an existing app, Epsagon can be injected automatically into your existing services. Then, all of the tracing and the instrumentation happens completely automatically, including connecting to your cloud account or Kubernetes cluster to generate a bunch of metrics and insights as a first step. Everything we do is under the assumption that the application is already running in production in very high scale. So, the answer is no, you don’t have to make a decision in advance before using Epsagon.
Along with considerations for languages and architectures, are there organizational changes that need to accompany the move from monolith to microservices? That makes all this work better?
Yes. I think the main big change that has to happen is the fact that developers and operations are no longer separated. It’s actually a big change because it’s just the way things work today. It means that developers are much more involved in what happens after the code is deployed. It’s not the operations team’s problem anymore. The owner of the application is becoming the developer and not necessarily operations. It’s something that most organizations are still trying to figure out. If you think about stuff like managed services and serverless, the operations team sometimes isn’t even aware that the new service was now pushed to production. Somebody has to know about it — if it’s not operations, then it has to be the developers.
Previously Epsagon focused on cost monitoring as a differentiator but less so now. We’re wondering what you see as your competitive edge in the monitoring space now?
The same differentiation that we always had was the fact that we can do automated distributed tracing, which is already extremely difficult to do. There is no solution today that does automated distributed tracing for the type of applications that we do. If you have a cloud-based containers app today, you have no solution that will provide you this kind of tracing automatically unless you want to spend weeks or months implementing something like OpenTracing on your own, which we use under the hood, but we implement for you. That’s a big difference.
The differentiation on top of that is the fact that we consolidate many tools to provide an efficient experience for rapid troubleshooting. Payload visibility means you have access to your data at any given time with no sampling. That’s super powerful for troubleshooting. Our customers usually start with a mix of tools and when they moved to Epsagon they stop using that mix of tools, and use just Epsagon. We give them the opportunity to ask very sophisticated questions about what happened and find the solution right away so they can fix a problem within a few minutes instead of hours or days. That has always been the differentiation.
Some of the cool stuff we did is introducing support for special services such as AWS App Sync, which is a highly popular service in AWS that has no troubleshooting solution today. Epsagon is the only provider today to do that. There is no solution today that is fundamentally based on distributed tracing and does it automatically as we do.
How does Epsagon do what it does? How does it automatically discover all of these services in a preexisting architecture?
It’s all based on automated code instrumentation. Our code library automatically goes inside an existing app. It can be done using an injection, a CI/CD plugin, or just by adding a few lines of code at the top. Just follow the instructions and once the library is there, the code is automatically instrumented. We already support the top five programming languages. Once the code is instrumented, the runtime is sending events are being extracted back to our SaaS platform. That’s where we have the engine to be able to connect all those events together to create distributed traces across a very, very big variety of frameworks and technologies. It can be a container publishing a message to Kafka, written in Java, going to an express Node.js container and then triggering a Lambda function — all of those things will be connected automatically from end to end.