Kubernetes / Monitoring / Sponsored / Contributed

How to Overcome the Top 3 Modern Monitoring Challenges

28 Jul 2020 8:54am, by

New Relic sponsored this post.

Augustine Mathew
Augustine is a Principal Product Manager at New Relic with over two decades of experience in the software industry leading data products and platforms. He is passionate about enabling organizations to make data-driven decisions and to deliver software faster and cheaper.

Your business relies on delivering best-in-class customer experiences, and doing so requires delivering software faster, cheaper at scale, and with continuous improvement enabled. To do so, your teams must move beyond just “monitoring” performance and instead shift to understanding why your systems are performing as they are. To achieve this level of understanding, you need the ability to collect and analyze your operational data across your entire technology stack. In other words, you need full-stack observability. Only then can you start to optimize your systems and reduce Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR).

In this article, we will discuss how to overcome the top three monitoring challenges, so that your business can achieve the observability necessary to stay ahead of your competition.

1. Fragmentation of Data

The monitoring tool landscape is vast. There are many great point solutions — including open source tools for myriad use cases — but operating multiple, fragmented point solutions with disparate data stores leads to data silos, which makes it extremely difficult (sometimes even impossible) to correlate data. This, in turn, increases the time to detect and resolve problems.

Let’s take Prometheus as an example. If you’re one of the many companies operating Kubernetes, using Prometheus is a no-brainer. As the de facto metric tool for Kubernetes, it enjoys a vibrant open source community. But you also need to monitor other systems, your applications, and the performance of your customers’ experience; with each scenario producing different types of telemetry data that require different monitoring tools. Commercial monitoring solutions specialize in only a subset of environments (cloud, on-premise, hybrid) and often, only for a subset of telemetry data (metrics, events, logs, or traces). Similarly, popular open source solution stacks like Elastic Stack, TICK (Telegraf, InfluxDB, Chronograf and Kapacitor), and TIG (Telegraf, InfluxDB, and Grafana) tend to be best for either logs or metrics.

To understand how your entire system is performing, you’re probably using a combination of multiple tools, each of which requires learning a new UI, a new query language and a new operating model. The consequence of doing so creates cross-team dependencies that delay resolving customer-impacting issues. Unfortunately, this lowers customer satisfaction, revenue, and even employee morale.

See All of Your Data in One Place

Best-in-class observability solutions enable you to collect and analyze all of your telemetry data in one place, removing data silos to reduce MTTD and MTTR. With all of your data in one place, you’re able to derive context-aware insights and correlate your data to understand why your systems are performing the way they are. Correlating transactional (sales, user conversions, etc.) and operational data shortens MTTD and enables end-to-end digital experience and business performance monitoring.

Returning to our Prometheus example, companies can send Prometheus metric data to a central data platform that consolidates it with other telemetry data from their entire technology stack. Some existing solutions today even offer the ability to create their own valuable custom insights; and support for common query languages allows users to operate within a paradigm that they’re already familiar with. The good news is that, although rare, these types of solutions are beginning to emerge.

2. More Tools Mean More Toil

Open source software monitoring tools need to be self-hosted, either on-premises or in a public cloud environment. This requires organizations to procure and install hardware (if on-premise), provision, configure and upgrade their compute resources (CPU, memory, storage), networking and software. This could mean dealing with multiple suppliers and contracts, and possibly CAPEX expenditures. Managing such environments at scale (horizontal scaling, forecasting demand, managing spend, etc.) makes it challenging. Companies have to hire and train experts for the upkeep of these solutions. On top of that, companies have to secure these systems and meet compliance and service-level objectives (SLO) for the company and its teams.

For example, operating Prometheus at scale requires complex setup, including federation and replicas for sharding and high availability. In addition, in order to overcome Prometheus data retention limitations, users have to set up and operate additional tools such as Cortex and Thanos. Given that the charting capabilities of Prometheus are rudimentary, most users choose to deploy Grafana to build dashboards. The consequence of these bolt-on solutions is increased complexity, additional resource requirements, increased training requirements and reduced security compliance – all of which increase costs. These additional tools in the ecosystem expose increased attack surface area, and, as with any open source software, associates increased security risks and vulnerabilities from custom software development, test/QA coverage, and lack of product management.

Research by Gartner states:

“…nearly every open-source adopter expects cost savings when compared to either homegrown or licensed proprietary third-party solutions. However, Gartner’s research has consistently shown that open-source efforts do not always result in cost saving. This outcome hinges on many factors but historically, more often than not, open-source investments (over 50%) have not yielded considerable TCO advantages over other alternatives.”

A scalable and reliable self-hosted deployment of monitoring tools often requires up to 6 months and many experts with different skill sets to operationalize (build, deploy, configure, maintain). Often, the deployment lead time, maintenance overhead and opportunity costs result in a TCO that exceeds other approaches. If teams spend less time and resources on managing and maintaining their monitoring tools, they are able to spend more time innovating and creating better products for their customers.

DIY Is Not Worth Your Time

Software-as-a-Service (SaaS) monitoring and observability solutions scale on-demand, eliminating the burden of predicting such demand, provisioning resources and operating deployment tools. By removing the burdens of acquiring, setting up, training employees, and actually operating additional systems, you can rapidly accelerate your ability to deliver software. Further, by reducing CAPEX spent on unnecessary infrastructure, and by not having to source and hire the specialists to operate such systems, you can instead invest that money back into your business.

Commercial observability solutions support unlimited scale with respect to data ingestion, data management, storage options and performance; the best of which produce millisecond responses and the ability to search through terabytes of data. The real performance at scale comes from innovative approaches such as distributed queries and parallel execution. For example, best-in-class solutions provide fast and flexible data pipelines, allowing highly available data ingestion, streaming, and transformation at scale. Distributed caching and storage enables efficient querying, by breaking down jobs into smaller chunks and executing them in parallel and close to the target. In addition, fully managed solutions offer out-of-the-box compliance and service-level agreements (SLA) that help companies meet their SLOs.

3. High Costs Cause Blind Spots

Managing multiple point solution monitoring tools, including self-hosted open source tools, leads to higher costs. The unfortunate impact of higher costs is that you have to decide how much of your technology stack, and what types of data and sampling rates, you can afford to instrument. The impact of higher costs often leads to monitoring only a subset of your data, or perhaps, only monitoring production environments at the expense of your pre-production environments. If all of your development takes place in pre-production environments, yet you lack the ability to observe the performance of those environments, ultimately an unnecessary number of bugs will propagate into production, increasing customer issues and downtime. Further compounding the issue is that less frequent data sampling translates into users lacking the data necessary to effectively troubleshoot issues or audit transactions, resulting in revenue loss or penalties.

Instrument Everything

Commercial SaaS monitoring solution providers, with their purpose-built platforms and availability of pre-built applications, data collectors, exporters, scrapers, etc., can often enable companies to achieve value much faster (one-sixth of deployment lead time) and with fewer employees to manage their solutions (one-fourth the resources). As a result, companies need solutions that offer affordable pricing, even at high data rates (data points per minute) and cardinality (unique time series data). Pricing models that offer pennies per GB for ingested telemetry data allow companies to affordably capture full-stack data across development, staging and production environments. These models further enable multiple software deployment strategies, such as canary, blue/green, red/black, synthetic testing, etc. With affordable pricing and flexible data retention options available to companies, they can perform period-over-period and seasonal comparisons in order to plan and manage their software better, and to meet data governance and compliance requirements.

Companies should look for observability solutions that can collect data using popular open source agents, and support native instrumentation so that they can achieve greater coverage for observed data. For example, New Relic has a wide array of plug-ins for popular agents and Software Development Kits for OpenTelemetry standards and protocols that allow its customers to ingest data from virtually any source.

Choosing the Right Solution for Your Business

Full-stack observability and monitoring is necessary for companies to succeed in today’s hyper-competitive markets. Organizations must analyze the total costs and benefits of building and managing their own monitoring practices, making sure they encompass all parts of their stack. Teams should feel empowered to instrument everything without exceeding their budget in order to avoid performance blind spots. The right observability solution should allow teams to see all of their data in one place so they can create a full picture of their ecosystem. They should choose a solution that can ingest data from any source (open, on premise, or in the cloud), can easily scale to the demands of the business, and provide value for your bottom line. This often leads to teams choosing robust SaaS observability platforms over simple monitoring tools. These Saas platforms better equip its users to gain full visibility into their software architecture, removing hours of toil and ultimately leading to maximizing customer satisfaction and revenue.

Popular platforms like New Relic continue to innovate their enterprise-grade observability solutions, providing full-stack visibility with one place to analyze your entire stack. With a focus on interoperability and support for open standards and OSS tools, New Relic’s platform serves a broad set of needs for all your operational data. The platform also provides advanced artificial intelligence capabilities to automate incident response, management, and remediation to mitigate a major source of lost IT productivity among companies.

Observability and monitoring stack by New Relic

Building effective monitoring practices requires thoughtful decision-making, requiring teams to decide what capabilities are necessary to deliver the best value for their business. In essence, teams need to ensure they choose the solution that meets their unique needs.

Feature image via Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

A newsletter digest of the week’s most important stories & analyses.