Wringing Value out of First-Mile Observability for Cloud Apps

We are living through a data renaissance right now.
Every aspect of business, and our lives, by extension, are awash in data. Every transaction and message, and their underlying compute and storage workloads in the cloud, generate new sources of data.
All this data is essential for scaling application performance and maintaining the integrity and scalability of our critical business systems.
Observability data (logs, metrics and traces) are even more important for cloud native applications, where the inner workings are short-lived microservices and containers. However, as the world moves to more cloud native infrastructure, we are starting to hit observability exhaustion.
We are only utilizing small portions of the observability data ingested — hitting massive cloud egress and storage bills when moving and holding all this data; and in the end, we are gaining less insights from our massive troves of collected data.
If observability data continues to grow, and it represents the most valuable information we can understand as part of our critical infrastructure, then we need to rethink how we understand, refine and use observability data.
First-Mile Observability
The focus of observability has always been centered on the “last mile” — the databases, data lakes, index stores and warehouses where data comes to rest. Value is readily apparent in these backends because users can query these stores to explore how their applications are behaving; they can visualize with dashboards, charts and graphs to get historical information, or even run analytics and machine learning to gain trends and predictive insights.
In the past few years, we’ve begun to see that focus broaden to also include intermediate or “middle-mile” markers, where data streams through layers such as Apache Kafka, pub/sub, event buses and message queues.
These middle layers serve to give users more control over how asynchronous data is consumed within the enterprise as well as provide visibility into data in motion. However, this doesn’t help us solve the present issues with observability exhaustion.
Now we need to continue shifting left to focus on the first mile of observability, where data is collected from its point of origin and sent to all of the above streaming queues and backends.
First-mile observability can generate immense value for the business, including unifying data from application, infrastructure, or network devices, reducing vendor dependencies and costs, or adding reliability, parsing, and enrichment to expedite analysis.
More Agility, Less Lock-in
While this might be your first time hearing about first-mile observability, you most likely have systems in place to collect data — generally vendor-provided agents. These agents handle much of the automatic collection, parsing, enrichment and forwarding to provided backends. However, they come with a trade-off in that all your data is locked into specific pipelines and backends.
Vendor-supplied collectors are tuned to route data to proprietary queues and last-mile data repositories within the same vendor’s product suite. That’s fine if your entire application lives inside such a walled garden, but in today’s API-driven, multi-enterprise workflows, there is a very strong motivation for thinking outside a single pipe.
And while having an intermediate event messaging system could help here, many proprietary agents do not support sending data to that system. If your team goes for a DIY approach, implementing custom features or changes for those agents, you may not be able to receive future updates or support from the vendor.
In contributing to the open source Fluentd and Fluent Bit projects over the last few years, I’ve seen users leverage these projects to completely own the first mile of their observability. These projects are open source and vendor-neutral, meaning you can choose to send data to an observability solution like Splunk, Elastic, or Datadog, or send it to a middle-mile tier like Kafka.
Conserving Cloud Costs
This freedom and flexibility lead to the first reason many companies seek to get ahead of first-mile observability — to reduce the high costs of pouring raw or poorly filtered data into expensive backends.
For example, by owning your first-mile observability strategy, you can choose which data flows where with full knowledge of how much you may pay for that data. Additionally, you can filter or throw away spurious or bad data. Why pay for empty log messages when they provide no value to the observability practitioner?
Another advantage is that if you were streaming high-volume log data from hundreds of systems, you can choose which backends they might be best suited for instead of only opting for more costly high availability storage backends.
Enriching Data for Context with Performance
Analytics and machine learning applications generate some of the most processor- and storage-intensive workloads in the cloud today.
In such scenarios, we want to shift left the enrichment of observability data as much as possible. Correlation of metadata right at the source data provides real-time context and reduced query times for analytics. Early and consistent labeling improves the training speed of machine learning routines, which in turn reduces AI inference times.
Unifying the format of first-mile observability, rather than waiting to normalize it in middle tiers or at its last-mile destination, gives us two main advantages. One, it makes the data useful in all segments of our observability journey, and two, it makes the final data structures useful for more application types and easier to leverage for almost any type of work.
Where to Start?
One of the easiest paths toward understanding the untapped value and beating observability exhaustion is simply doing an inventory of all the agents you have installed on your servers.
How many of them are performing proprietary actions that lock you into a data backend? How many of them are collecting non-tunable data? And are there opportunities to replace some of them with vendor-neutral or open source options?
There’s great potential for value with first-mile observability, so don’t let it pass you by.