What Does Effective Cloud Monitoring Look Like?

Before you moved to the cloud and containers and virtual machines took over everything, application performance monitoring (APM) was much simpler. You could track your application on a specific machine or processor and watch its behavior in different conditions. Releases were spaced far enough apart that you had time to instrument the code to provide necessary debug info. Monitoring at five-minute intervals gave you the info you required most of the time, and the volume of data was at a human scale that you could realistically collect and analyze.
What we did then doesn’t give us the information we need now. Most or even all of your infrastructure is outside of your control. Cloud providers offer limited visibility into performance, with many metrics gathered at infrequent intervals, ranging from one minute to as much as 15 minutes. Code is distributed across hundreds or thousands of systems in a dynamic and elastic environment, making it difficult to monitor your processes and their dependencies. Workloads spin up and down frequently, containers and microservices slice processes and methods into smaller and smaller pieces, resulting in very short-lived objects that frequently change location. DevOps processes, agile development, and web deployment speed up release cycles, increasing the possibility of errors and leaving little time to add or update instrumentation. And the sheer volume of monitoring data — big data — exceeds human comprehension, requiring new analytics and machine learning.
A New Paradigm for Cloud Monitoring
Changes in the operating environment call for changes to application performance monitoring to better support the needs of cloud monitoring. The old hardware constructs of servers, processors, disks, memory, and other specific physical attributes are much less relevant. Instead, modern APM tools must focus on logical concepts such as processes and transactions, and capture highly dynamic relationships and dependencies. Watching a few servers or applications isn’t enough for DevOps to function effectively. Today’s APM tools need the scale to watch all business-critical applications and all transactions so that the team can quickly detect and resolve issues. Sampling at one-minute or five-minute intervals misses far too much. We need high-definition data capture and granularity. The resulting big data sets demand sophisticated analytics and artificial intelligence to help our own analysis and decision-making. Finally, the tools must be cloud-aware and available via SaaS so that they easily integrate with your applications and infrastructure day 1.
Avoid Blind Spots
For me, the first priority of a modern cloud monitoring strategy is actually monitoring everything that is important, including the microservices, APIs, infrastructure, network paths, and end-user experience. Unfortunately, most organizations only monitor a fraction of their business-critical applications and only a subset of the components of these applications, resulting in far too many blind spots. It’s important to note that the adoption of Docker containers, microservices, and other components makes tracking everything increasingly important since objects move around frequently. Sampled transaction metrics are likely to completely miss many short-lived objects.
Since you cannot possibly instrument everything, advanced cloud monitoring host agents need to run at the operating-system level, automatically discovering and instrumenting all processes, regardless of whether they are in a virtual machine, a container, or a microservice. Java and .NET agents spin-up and down right along with the VMs (either built with your image or with the Docker host) so that you get statistics whenever and wherever your application is running. Network details, such as round-trip times, throughput, and retransmission rates, are needed to help you quickly distinguish between network and server issues. JavaScript snippets and page tags, automatically deployed to your web pages, give information on page load times and AJAX requests for detailed insights into end-user experience.
Big Data Approach to Cloud Monitoring
A big data approach to cloud monitors must support both the data collection and the analysis of large volumes and variety of data. Collecting all of this data means that the resulting data set is extremely large, so an intelligent cloud monitoring agent compresses, buffers, and streams the data back to a NoSQL backend. There is lots of duplication in this data stream making compression very effective at reducing the network load. On the backend, traditional databases are unsuited to this volume and speed of data collection, so a multithreaded and multi-queued approach is necessary to keep up. In the rare event that the data exceeds the capacity of either sending or receiving system, intelligent throttling selectively omits the least important data. Note that metadata should also be collected (i.e. information on the location or the user or the device) as it is extremely valuable when troubleshooting complex problems.
Concerned that the volume, speed, and variety of data will overwhelm any monitoring solution, slowing down applications with an unacceptable performance burden or even causing them to crash under the load? Efficient containerized agents and cloud-based APM solutions provide the necessary scalability to collect and process the petabytes of data. When you investigate a specific issue or trouble ticket with the actual transaction data and application stack details, it results in faster diagnosis and resolution than if you are limited to the average of users’ responses based on a sampled subset. One reason why is that when you have all the necessary data available, there is no need to wait to form a hypothesis and collect additional metrics.
Advanced Analytics and Machine Learning
The volume of data in this containerized and elastic environment is beyond human comprehension and exceeds the capacity of spreadsheets and traditional analysis tools. Pattern recognition, correlation analysis, and anomaly detection can help DevOps teams identify early warning signs, find related transactions, and often fix emerging problems before they are reported by users.
By applying analytics, advanced APM tools can re-assemble transaction fragments using cues such as user session identifiers and communication protocols and visualize relationships into logical groups, making it easier to trace performance dependencies. Detailed reports can then enable you to drill down into specific methods, networks, and machines, leading the team to quickly isolate and remediate issues.
Machine learning is also valuable when analyzing APM big data. Machine learning can look for signatures that are indicative of performance problems and surface potential causes. This type of insight accelerates problem resolution efforts and is continually enriched based on the learning sets.
Trust but Verify
Virtual private clouds (VPCs) and other services available from leading providers like Amazon AWS and Microsoft Azure become another resource to monitor. Full visibility into user experience, transaction details, and resource availability is necessary to ensure end-to-end transactions are performing as expected. Cloud monitoring augments vendor-provided metrics with end-user-experience details to quickly distinguish between issues on user devices, cloud servers, and network infrastructure, so that you can address the right problem.
As you shift to cloud services, your teams will also need to monitor service-level agreements (SLAs) and optimize your usage commitments. Cloud provider’s SLAs stop at their cloud’s edge and their performance and availability metrics don’t represent the actual experience of your users. That’s why it’s important to measure the performance of cloud-delivered apps as they render on the screen from each user’s perspective, in the context of the overall business workflow, severity, relative importance, geography, and many other classifications. It’s also helpful to compare performance between apps hosted on different cloud providers and your own data center, as well as before and after upgrades, to get the best possible results.
The Essential Elements for Cloud Monitoring Success
Without a doubt, performance monitoring is much more challenging than in the days of three-tier architecture. Older APM tools and techniques, which evolved from physical concepts of servers and processors, struggle to trace the modern logical and virtual constructs between your application infrastructure and tens or hundreds of thousands of web and mobile clients. Effective cloud monitoring demands a new strategy, one that can watch all running instances, including virtual servers, containers, microservices, and even into the cloud instance. The scale of big data calls for a new category of APM tools that support cloud monitoring with the elasticity and capacity to collect data on all users, transactions, and methods. Look for these essential elements to bring your application performance monitoring into the cloud era, and help you deliver the performance and reliability that your customers demand and that your organization needs to be successful.
Feature image via Pixabay.