5 Steps to Ensure your Microservices Are Running Optimally
These days, it seems like everyone is into microservices and monolith architectures are slowly fading into obscurity.
Trends come and go, of course, and the attention they get is often exaggerated and doesn’t reflect what’s really going on. With microservices, though, there seems to be more consensus that the trend is here to stay. It makes sense. Conceptually, microservices extend the same principles that engineers have employed for decades.
Once you do commit to the microservices architecture, here are five rules to keep in mind to run them successfully.
The Flip Side of Microservices
Separation of Concerns (SoC), a design principle stating that software should be built with distinct sections determined by “concern” or overall function, has been employed for more than 30 years to dictate how technology should be built. In monolithic applications, it is reflected in the separation of presentation, business and data layers in typical 3-tier architecture.
Microservices take this concept and flip it on its head. They take the same application and separate it in such a way that the application’s singular codebase can be broken up and deployed separately. The benefits are huge but they come at a price, usually reflected in higher operations costs in terms of both time and money. Aside from the enormous upfront investment that comes with transitioning an existing application to containers, maintaining that application creates new challenges.
Challenge #1: As if Monitoring a Monolith Wasn’t Hard Enough
While monolithic applications have their own challenges, the process for rolling back a “bad” release in a monolith is fairly straightforward. In a containerized application, things are much more complicated. Whether you’re gradually breaking down a monolithic app to microservices or building a new system from scratch, you now have more services to monitor. Each of these will likely:
● Use different technologies and/or languages
● Live on a different machine and/or container
● Be containerized and orchestrated using K8 or a similar technology
With this, the system becomes highly fragmented and a stronger need arises for centralized monitoring. Sadly this also means that there’s also a lot more to monitor. Where there was once a single monolithic process, there could be dozens of containerized processes running across different regions, and sometimes even different clouds(!). This means there is no longer a single set of Ops metrics to rule them all with which IT/Ops teams can assess the general uptime of an application. Instead, teams must now deal with a deluge of hundreds (and even thousands) of metrics, events and alert types from which they need to separate signal from noise.
The way forward: DevOps monitoring needs to move from a flat data model to a hierarchical model where a set of a high-level system and business KPIs can be observed at all times. With the slightest deviation, teams must be able to drill into the metric hierarchy to see from which microservices the disturbance in the force is originating, and from there into the actual containers that are failing. This most likely requires a retooling of the DevOps toolchain from both data storage and visualization standpoint. Open source time series DBs and tools such as Prometheus and Grafana 7.0 make this a very achievable goal.
Challenge #2: Logging Across Services
When talking about monitoring an application, one of the first things to come up is: logs, logs, logs. The IT equivalent of carbon emission, GBs of unstructured text are generated by servers on a daily basis, culminating in overflowed hard drives and crazy ingestion, storage and tooling costs. Even with a monolith architecture, your logs are probably already causing your engineers some headaches.
With microservices, logs become even more scattered. A simple user transaction can now pass through many services, all of which have their own logging framework. To troubleshoot an issue, you’ll have to pull out all the different logs from all the services that the transaction could have passed through to understand what went wrong.
The way forward: This key challenge here is understanding how a single transaction “flows” between the different services. To achieve this requires a massive reworking of how a traditional monolith would normally log all of the events during the execution of a sequential transaction. While many frameworks have come out to help developers deal with this (we especially like Jaeger’s approach), moving to asynchronous, trace-driven logging is still a herculean effort for enterprises looking to refactor monoliths into microservices.
Challenge #3: Deploying One Service Can Break Another
A key assumption in the monolithic world is that all code is deployed at the same time, which means the timeframe in which an application is at its most vulnerable is a known and relatively short period of time (i.e. the first 24-48 hours post-deployment). In the world of microservices, this assumption no longer holds true: as microservices are inherently intertwined, a breaking change in one can cause behavior or performance issues that will only manifest in another. The challenge is that the other dev team whose Microservice is now failing wasn’t expecting a break in their code. This can lead to both unexpected instability of the application as a whole, as well as some organizational friction. While Microservice architectures may have made the process of deploying code easier, they actually made the process of verifying code behavior post-deployment harder.
The way forward: the organization must create shared release calendars and allocate the resources for closely testing and monitoring the behavior of the application as a whole whenever related microservices are deployed. Deploying new versions of microservices without cross-team coordination is as successful a recipe for trouble as avocado with toast.
Challenge #4: Finding Root Cause of Issues
At this point, you’ve nailed down the problematic services, pulled out all the data there is to pull including stack traces and some variable values from the logs. You probably also have some kind of APM solution like New Relic, AppDynamics or Dynatrace (which we also wrote about here and here). From there, you’ll get some additional data about unusually high processing times for some of the related methods. But… what about… the root cause of the issue?
The first few bits of variable data you (hopefully) get from the log most likely won’t be the ones that move the needle. They’re usually more like breadcrumbs leading in the direction of the next clue and not much further. At this point, we need to do what we can to uncover more of the “magic” under the hood of our application. This would traditionally entail emitting detailed information about the state of each failed transaction (i.e. exactly why it failed). The challenge here is that this requires a tremendous amount of foresight by developers to know what information they would need in order to troubleshoot an issue ahead of time.
The way forward: When the root cause of an error in a microservice spans across multiple services, it’s critical to have a centralized root cause detection methodology in place. Teams must consider which information particles would be needed to diagnose future issues, and at what level of logging they should be emitted to account for both performance and security considerations — this is a tall mountain to climb, and one that never ends.
Challenge #5: Version Management
Another issue, one that we brought up before but we think is worth highlighting, is the transition from a layer model in the typical monolithic architecture to a graph model with microservices. As more than 80% of an application’s code is usually third-party, managing the way by which third-party code is shared across a company’s different microservices becomes a critical element in avoiding the ever-dreaded “dependency hell.”
Consider a situation where some teams are using version X.Y of a third-party component or shared utility (which virtually all companies have) and version X.Z in others. This increases the risk of critical software issues arising from the lack of compatibility between different versions, or that slight changes in behaviors between versions which can give rise to most idiosyncratic and painful software bugs to troubleshoot.
And all this before we remind ourselves of the security issues stemming from any one of the microservices using an older, more vulnerable version of third-party code — a hacker’s dream. Allowing different teams to manage their dependencies in siloed repos might have been feasible (if not recommended) in a monolithic world. In a microservice architecture, it is an absolute no-no.
The way forward: companies must invest in redesigning their build processes to leverage centralized artifact repositories (Artifactory would be one) for both third-party and shared utility code. Teams should only be allowed to store their own code in their individual repos.
As with most advancements in the tech industry, microservices take a familiar concept and flip it on its head. They rethink the way that large-scale applications should be designed, built and maintained. With them come many benefits, but also new challenges. When we look at these five main challenges together, we can see that they all stem from the same idea. The bottom line is that adopting a new technology like microservices demands both a rethinking and retooling of how code is built, deployed and observed. The prizes are big — but so are the risks.