CI/CD / Microservices / Monitoring

Monitoring, Microservices, Self-Healing and the Connection to “AntiFragile” Systems

7 Sep 2014 9:32am, by

Nassim Nicholas Taleb’s bestseller AntiFragile has been the source of inspiration for practitioners in various fields, software included. Not only does it open a new paradigm for software systems, it also invites a new possibility. For some, it’s a new buzzword in town, and for some, its a process of discovery into the basis of some fundamental assumptions that we make while developing software. Anti-fragile is a chaos concept, making it especially relevant in today’s world. It goes beyond just making something hardened and strong. More so, anti-fragile mean things get better. Building things that thrive on pressure, failure and chaos, however, is not a radical new idea. The human body is a perfect example of anti-fragile system at work. The body evolves through failures and over the period of time becomes adaptive to operate during stress and chaos. Let us look at some examples of resilience, and how software systems have been built so far to deliver it:

  1. Adding redundancy across all layers of the software system is a common place for practitioners. These redundant instances are either always Active or remain dormant till the moment of failure for the original master. A lot has been said, documented and built to develop this redundancy, and is a key item in a practitioner’s handbook.
  2. Generating failure situations to measure and identify problem areas in a system is another common way. Gameday exercises that originated from Amazon where failures were purposefully injected into critical systems to help with identification of problems. This helped in fixing flaws, and over the period of time, improving the overall stability of the system. Netflix’s Simian Army is very interesting project designed to generate failures and help isolate system’s weaknesses.
  3. Graceful degradation in a system allows as minimal impact as possible when failure happens. Web applications operating in read-only mode so to allow users some level of interaction and avoid complete failure is a common technique. This allows applications to isolate the failure to certain features that get degraded instead of to the entire system.
  4. Retry logic along with Circuit Breakers are used as patterns to build resilient systems. This is especially evident when systems are composed of distributed services and interconnection between them accomplishes system features

Overall, the key to build resilience into a system is by admitting that failure is common-place. However, the approach that resilience engineering offers creates robust systems that can survive in the face of disaster and limit the area of damage. Anti-fragile concepts takes this idea of robust systems to a whole new level. Anti-fragility needs the system to not just survive failures, but use it to its advantage by becoming more powerful and stable. The anti-fragile system feeds on failure events, and adapts to them over the period of time. This creates a completely new class of software systems, that among all, must accomplish the following:

  1. Monitoring all aspects of the system including the infrastructure and using that to learn over a period of time to become self-operational. This means minimal human intervention is required for the system to make decisions, especially with respect to survival.
  2. Ability to have system operations delegated across multiple independent services that have the ability to dynamically scale up or down depending upon scenarios. This means the system should have the ability to control its own growth as its used over time. That precisely requires building support for understanding which services need more attention and how to provision for more resources.
  3. Failure events are used as accepted parameters for variables in the system. The system is designed to take corrective actions to failure scenarios on its own. This means, the system must be self-aware and can detect or predict abnormalities. Predicting potential failures is vastly important for this system to thrive, as it could help prepare itself for the inevitable.
  4. The system’s complexity is derived from simple fundamental concepts. This means, at each layer of the system, simple concepts and rules are defined. This allows for broader abstraction using which complicated systems can be built in a simple way. We can treat each of these simple concepts as services accomplishing only a very specific task.
  5. The system is agnostic to the choice of where it runs and flourishes. This means it is portable enough to quickly adapt to a new environment and be able to take control of it to sustain itself. This is important for the system to embrace failures, as it would need to adapt to take advantage of the supporting environment quickly.

The future is to allow the applications to be able decide for themselves in dealing with failure, chaos and uncertainty. This requires us to build intelligence back into the system that is ever growing and continually learning about its surrounding. That will make systems to be self-aware and truly anti-fragile.

Vivek Juneja is an engineer, based out of Seoul, and focused on cloud services and mobility. He currently works as a Solutions Architect at Symphony Teleca, and is a co-founder of the Amazon Web Services User Group in Bangalore. He started working with cloud platforms in 2008, and was an early adopter of AWS and Eucalyptus. He is also a technology evangelist and speaks at various technology conferences in India. He writes @ www.cloudgeek.in and www.vivekjuneja.in and loves grooming technology communities. You can also reach him by email: vivekjuneja@gmail.com

Feature image via Flickr Creative Commons

 

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.