Cloud Native Observability: Fighting Rising Costs, Incidents

Observability is never easy but once you start deploying on multiple clouds, it can grow into a much bigger and tougher job.
The industry-wide problem is twofold, according to Martin Mao, CEO and co-founder of Chronosphere. First of all, Mao said on this episode of The New Stack Makers podcast, that customer-facing incidents are on the rise, with engineers spending an average of 10 hours a week, according to some studies, tackling debugging issues.
Plus, “the tools that we use cost a lot more, and they’re actually less effective at the job that they’re trying to do,” Mao said, adding, “From a ROI equation perspective, both sides of that equation are getting worse. And that’s actually a really big issue for the industry right now.”
In this episode of Makers, Mao talked to Heather Joslyn, of TNS, about the challenges of observability in cloud and multicloud environments. The discussion addressed how to help a team prioritize what matters most when they get alerts, and how are people using observability to help them “shift left” and take more responsibility at the developer stage for security.
How Observability Can Serve Developers
When an organization begins deploying on distributed cloud architectures, they not only see the volumes of data they’re producing increase, but they’re also usually asked to make a cultural change — into DevOps, which demands greater accountability from developers.
“Historically, developers have not had to operate their software in production, there’s been a centralized operations team,” Mao said. “These days, you may have [a site reliability engineering] team or a platform engineering team. But those teams are not really responsible for operating software, they’re responsible for setting up the frameworks and the tools and providing the best practices.
In the cloud native world, he added, “each developer really has to take that responsibility into their own hands… the average developer doesn’t just have to write the piece of software. Now they have to know how to operate it and monitor it in production, they have to know how to secure it, they have to know how to deploy it and learn everything about the CI/CD.”
The problem is, Mao said, many observability tools were built for centralized operations teams.
Cloud native observability tooling should help developers run, maintain and operate the software in production, he said. But it could potentially do more than that.
“One of the things that we’re finding is that observability can play a really good role in even helping decode the fairly advanced environment that a developer is running in,” Mao said. Using such tooling along with data, he said, could help developers answer questions like, “Where does my piece of software run? How does it run? What are its actual dependencies? Because that’s often something that you may not know.”
“There’s actually a lot that the observability tooling can be doing to sort of inform the developer have all that information in ideally, a really simple way.”
To help teams sift through all the information they get from observability efforts and contain costs, Chronosphere has created a vendor-neutral framework called the “Observability Data Optimization Cycle.”
The first step the framework recommends is to establish centralized governance, to set budgets for all teams that produce data.
“You really want to perhaps optimize all your data, so you can get the value out of it,” Mao said. “But without paying that the whole cost. So we created this framework for companies to think about it and really apply a lot of the FinOps concepts to the observability space.”
Check out the full episode for more on meeting the challenges of cloud native observability.
If you want more, here’s another episode with Chronosphere: Chronosphere Nudges Observability Standards Toward Maturity