LogDNA is a sponsor of The New Stack. This article will appear in the forthcoming e-book, Cloud Native Observability for DevOps Teams.
For the folks at LogDNA, DevOps is all about empathy, in getting beyond the “hot potato” mindset of not worrying about something because it is not your problem, and instead getting the whole crew working together on a shared challenge. LogDNA’s own roots are in solving a problem that traditional log management software couldn’t — ingesting unstructured logs and structuring them so that the developer or administrator didn’t have to do it by hand.
The company has taken this holistic approach to its own product, building out a platform that gives both the developer and the site reliability engineer (SRE) access to the information they need to build and maintain software.
The New Stack interviewed longtime LogDNA System Architect Ryan Staatz, as well as company co-founder Lee Liu, about how a company can embrace DevOps best practices even as it dramatically increases its size, as well as the importance of observability for DevOps, and how the best observability tools are made for multiples types of users.
The interview has been edited for brevity and clarity.
The New Stack: What is your definition of DevOps?
Staatz: DevOps is a nebulous term. it’s somewhere between developer and operations. If you look at Wikipedia, it’ll say it’s a very specific set of methodology and set of practices, which is the formal definition. But like most things, once the industry gets ahold of it, it takes on a definition of its own.
And, at least for me, DevOps is figuring out how you build that cohesive lifecycle for the lifespan of the application, where you go from developing in code to testing, to pushing it out to production, and getting feedback, that whole loop. I feel like a lot of that loop has gaps, depending on what stage of maturity your organization is in.
When you’re really small, and you only have a few people, it’s not hard to have one or two people just own the entire lifecycle. Communication is really fast. As you grow, you have more teams and more people available to you to run faster. But you have these gaps now that start appearing in your lifecycle where an app might be after the developer hands it off, or it might be between releases, or maybe you don’t have a release team.
DevOps is the idea that you’re owning those gaps, you want that whole lifecycle to work. It’s this idea that development and operations, while they’re largely seen as separate entities doing separate things, have some of the same goals. So, we should all be in this together.
What are some of the gaps between Dev and Ops in many organizations?
Staatz: The age-old gripe that I hear about Dev and Ops teams is that it is this hot potato that keeps getting tossed over the fence. That’s an unhealthy interaction that you want to avoid. In DevOps, you have a shared goal that you’re working towards.
The attitude that you’ll see in some places is, “How do I make this not my problem?” And that’s, I think, just an unfortunate part of being at a company that may not have all the resources set aside or may not have all the right processes in place. And no company is perfect.
Some of it just has to do with the fact that as you grow as a company, you have more strictly defined roles. And when you get to areas that are somewhat shared it can be harder to basically say, “Well, who owns that?” Well, the answer is, nobody really.
There is no magic fix, you know? It’s usually more along the lines of encouraging people to think about what other teams are doing, think about why they’re doing it. Understand their goals, maybe look at their tools. So, it takes a little bit of effort, honestly.
Liu: One of the challenges we’ve seen come up when Dev and Ops teams aren’t aligned is risk adversity. In the airline industry, you can see this problem exacerbated over decades. You look into a cockpit, and you see things that are really old, like that computer that looks like it belonged on my dad’s desk, right? Everything else is very modern about the planes, like the seatback TVs, and the like. But the actual technology seems really old.
And I think that that could be one of the things that comes out of risk adversity, where the Ops team got burned one too many times. So now there’s PTSD that’s developed over the years. And the Ops team is like, “Just don’t make too many big changes, and then things will be more stable.”
What the airline industry faces is when airplanes crash, people die. So, they are not eager to change the software. Software may have been in use since 1990 but it doesn’t need to change. It will fly the plane and it will land the plane. That’s all it needs to do. It doesn’t need to be modernized.
What ends up happening is now [you’re] using 20-year-old technology, and I don’t think that is the right approach for software startups at least. There is always risk in deploying new code for sure. But it’s up to us using the tools — especially observability tools — to help facilitate that when problems happen, we know how to fix them.
What could observability bring to DevOps?
Staatz: Part of the whole lifecycle is figuring out where things are breaking down, right? Conversely, where in this lifecycle can I improve things, so things don’t break down?
At the end of the day, observability is all about getting details around what’s going on.
There are lots of different tools out there that help do different things, and often in different parts of that lifecycle. It can be specific to some section of your infrastructure. It might be from something in your application code that is sending data somewhere. You can have things that can be logged. It could be metrics. It could be something that keeps an eye on the state of your environment.
“One of the best things you can do is being a good Samaritan towards other people who are either developing code or maintaining your code.”
—Ryan Staatz, system architect, LogDNA
And so, there becomes this sort of interesting intersection, like we talked about earlier, where different teams overlap in certain spaces, and it becomes unclear as to who owns what in those spaces. If you have observability tools that help both of those teams achieve their goals, and hopefully their goals are somewhat aligned, then that can be a huge help.
We have a product feature we developed called Kubernetes Enrichment. It displays information about the state of the Kubernetes cluster relevant to that logline in that application, to give you a sense of if there is something going wrong in the environment at this time. And seeing that information is really helpful for two different teams, this overlap of development and infrastructure folks. And so it helps clarify an area that’s somewhat vague: What might be going on at that point in time.
What would somebody do if they didn’t have Kubernetes Enrichment? How would they get their Kubernetes logs? Do they even have access to that information?
Staatz: There are some complications that are sort of exacerbated by that gap that I talked about earlier, where there are a lot of environments, especially production, that many developers may not be allowed to access, and so getting the information about those pods might involve logging into another tool, assuming you have that setup. In our case, that’d be something like Sysdig. But it’s limited in the types of metrics that are useful for developers.
What you normally do without any tooling is, you just basically say, “Hey, SRE friend, could you go look at the environment for me?” And at that point, it might be too late to figure out what’s going on. So that’s a lot of steps that need to happen across teams, which often can be difficult.
That kind of gets into the whole question of if you have other tooling, does the developer know how to use it? If it’s not used generally by developers, this gets back to the same problem of how much of this is individual motivation — “I need to really sort of dig deep and do more” — and how much is these organizational processes, and how much of this is just choosing the right tool?
How are the observability demands of developers different from those of system administrators or site reliability engineers (SREs)?
Liu: I would say that, at least from a logging perspective, we try to make a tool that can be used by both parties. It’s never perfect. And so, you can only cater so much to one audience versus another, because they’re fundamentally looking for different things, and they understand different contexts.
Developers don’t really care about other people’s apps. They care about the ones they’ve written and the ones they need to debug. They understand their app much more intimately than they would know about what else is running on the system and plaguing the system that their app is running on.
The system administrators and SREs are looking to make sure that the system as a whole is stable. It’s more of a macro-versus-micro level. Everything needs to behave cohesively in order for this to work.
The tools that we can build can try to have different things so that everyone has a piece of data they’re looking for. But it’s definitely challenging to have tools that work for both parties.
What challenges does Kubernetes present in terms of observability?
Staatz: Kubernetes did a lot of cool stuff, a lot of built-in tooling, a lot of CLI things that are wonderful, even a dashboard you can use. That being said, tracking down metrics and logs around microservices that run on hundreds of nodes can be really hard, like, even if you are a Kubernetes wiz, and you know all the label selectors and logs.
Kubernetes is fairly new, and it will mature over time, but it can be hard to track everything down. And not everyone wants to use command-line interface tools constantly even if they are technical. Kubectl is great, but at a certain point, you want to go to a tool that’s a bit more user-friendly.
Lee, what was the technical issue that spurred you and Chris Nguyen to focus LogDNA on logging? What was frustrating you about other logging tools?
Liu: We built our entire backend on Elasticsearch. Something that I did not like about Elasticsearch — though it wasn’t actually Elasticsearch itself — was the ingestion portion. Elastic used Logstash to ingest the logs. The problem with Logstash was you need to tell it what type of logs you’re ingesting so it can do the regex filtering and to enrich data and all that stuff.
If you have one app, that’s not hard, right? But if you also use Mongo or Redis, Logstash becomes a little bit hard to maintain because you have different log sources. So, we wrote our own ingestor that basically would take the data that the logs are coming in, and will auto-detect what type of file it was, and auto-parse that.
It doesn’t matter what kind of logs I sent it, it would have handled some things generically, and some things are very specific. If it detected a weblog, it would do these things for the weblogs. If it looks like a MongoDB log, it’d do these things that are MongoDB-related.
Some of the other companies that we talked to about Elasticsearch, that was their struggle, the lack of auto-detection. They had to do all those things manually. And they didn’t want to do it manually.
Why is cross-team empathy so important when there’s not a dedicated DevOps person or DevOps team for a company? And how can logs help with that?
Staatz: Some of the challenges that I face now are not so much solving technical problems, but more along the lines of having to solve this technical problem organizationally. It’s a different set of problems, because I can’t simply go in and fix the product. That’s not how that works. So, a lot of it is figuring out what team does this.
And as you start going into this, you get into this sort of routine of asking, “Hey, what are you guys working on?” You inadvertently develop empathy for these different teams as you try and get your work done, because, you know, maybe the work encompasses more than just one team at this point.
Production is a big, scary place that you just ship your apps to, right? And you’re like, “Oh, my app goes there. And I hope it works. Magic, right?” Well, it’s not magic. There’s somebody on the other end that has to clean up your mess or gets you to clean up your mess if there’s a problem. And I know, it’s kind of a negative way of thinking about it. But one of the best things you can do is being a good Samaritan towards other people who are either developing code or maintaining your code.
And as you become a bit more experienced as a developer, you start thinking about some of the same things that infrastructure deals with every day. And so having logs and observability that cover both of those things, and can be collaborated on by both, can go a long way. Anything that can help build those bridges and can build that trust and keep people on the same page, it’s great.
MongoDB and Redis Labs are sponsors of The New Stack.