Technology Decisions for a Successful Observability Strategy
This is Part 2 of a two-part series on how observability design and architecture need to start with the fundamentals: keeping people and process at the center of it all. In Part 1, we discussed how and why people and processes are keys to achieving significant change and progress with any IT and engineering function. In Part 2, we cover the technology aspect of observability design and architecture, and explore “build vs. buy” and how to plan for failure.
“Build vs. buy” software is commonly discussed in the observability field. My first question to anyone participating in this discussion is, “How much engineering time do you have to support a “build-your-own solution” vs. the cost of paying for a commercial tool?”
After all, time is the greatest limiting factor for any enterprise. You have to ask yourself, do you spend your engineering time building and maintaining an internal tool or focusing on providing more value to the business? If you are spending too much time on an internal tool, then you really need to consider buying a tool that allows you to move your engineering time to customer-focused work.
The other factor to consider is enterprise complexity. If your enterprise has standard, stable logging, you might not need all the options a commercial tool would provide.
I have never had the luxury of low-complexity, standard formats, combined with plenty of time to engineer a build-your-own solution. I wanted my teams to spend time on building business-focused tools — such as ML-powered service monitoring, and advanced dashboards — and on working to train users to take full advantage of our toolset, instead of building custom extensions and patching ELK or Kafka. I prefer a mix of commercial tools such as Cribl LogStream and Splunk to form the basis of my observability pipeline. (I look for the best mix of capabilities and ease of use). In my 20 years of working with observability tools, LogStream and Splunk Enterprise are the two biggest disruptors in the observability space. Both tools brought enormous value and genuinely new functionality to the space. Best of all, they made what was previously hard much easier.
Minimizing business-as-usual admin time is also critical to get as much engineer time as possible focused on solving business challenges. Every process should be scalable and use a version control system to help manage code at scale. If you do have extra engineering time, then building your own observability platform is awesome (and something I wish I could do, too!). But it is important to be honest with yourself and your team’s capabilities when you make the “build vs. buy” decision.
Plan for Failure
My hardware design focuses on scale and high availability. You have to plan for failure. Everything fails; it’s a matter of when it will fail, not if it will fail. It is important to plan for failure and know how your systems will behave.
Some suggestions to help plan for failure:
Separate your agent-based and agentless logging into separate tiers per data center. In the Splunk world, this would be called a Splunk Heavy Forwarder and LogStash for Elastic. I will call it logging middleware for the purpose of this article.
For each data center, have at least one cluster of servers that will consume your agentless logging coming over Syslog, HEC, and HTTPS. Be sure to use load balancers to spread the load and scale your hardware to handle the spike in traffic. I prefer to use Cribl LogStream to act as the middleware. It has a rich UI, and makes setup and scaling very easy. The cost of LogStream is more than covered by lower license costs and less engineering time due to automated workflows and pre-built sources and destinations.
Using the same approach with agent-based monitoring, you can set up a cluster of middleware servers using Cribl LogStream. You can also set up all of your logging agents, regardless of whether you’re using Splunk Universal Forwarder or Elastics Beats, to log to all the middleware servers, all the time. Using a full mesh architecture provides automated failover. Even better, set up your logging agents to log to the middleware cluster in the other data center or cloud region too. Some environments will not allow unrestricted traffic across a WAN due to network constraints, but it is something to consider. Everything you can do to provide failover options means your systems are more likely to work when you need them to work the most.
Note About Capacity Management
In terms of capacity management, do not place your observability systems in the typical utilization model centered around average CPU and memory consumption. You need your observability platform to scale to handle the spike in traffic that only occurs when the environment is under pressure and failing, such as when your firewalls are running out of CPU and logging gigs of data per second. You need your observability platform to work when everything is failing.
Bottom Line and Final Tips
Observability architecture and design are about more than software and servers. They’re about working with people and building processes to create a lasting, repeatable pattern. Here’s what’s worked for me so far:
1. Join forces with your governance and security teams to make clear rules for issues like formats and retention.
2. Make sure senior leadership understands the need to advocate for these rules to apply to all verticals across the enterprise and not just the operations vertical.
3. Put in time with the solution and software architects to get involved early with new application platforms and logging.
4. Plan for failure by breaking up workflows into separate clusters, and consider allowing traffic from data center to data center.
These are a few foundational ideas to get you started down the road to a successful observability strategy.