ChaosNative sponsored this post.
HaloDoc is the most popular all-around healthcare application in Indonesia. A rapidly-growing startup founded in 2016, our mission is to simplify and bring quality healthcare across Indonesia. We partner with more than 4,000 pharmacies in over 100 cities to bring medicine to people’s doorsteps. Recently, we launched a premium appointment service that partners with more than 500 hospitals, allowing patients to book a doctor’s appointment inside our application.
A quick reading of this profile will give you a hint about the mission-critical nature of our service.
Background: Service Infrastructure at HaloDoc
The platform is composed of several microservices hosted across hybrid infrastructure elements, mainly on a managed Kubernetes cloud, with an intricately designed communication framework. We also leverage AWS cloud services such as RDS, Lambda and S3, and consume a significant suite of open source tooling, especially from the Cloud Native Computing Foundation landscape, to support the core services.
As the architect and manager of site reliability engineering (SRE) at HaloDoc, ensuring smooth functioning of these services is my core responsibility. In this post, I’d like to provide a quick snapshot of why and how we use chaos engineering as one of the means to maintain resilience.
While operating a platform of such scale and churn (newer services are onboarded quite frequently), one is bound to encounter some jittery situations. We had a few incidents with newly added services going down that, despite being immediately mitigated, caused concern for our team. In a system with the kind of dependencies we had, it was necessary to test and measure service availability across a host of failure scenarios. This needed to be done before going live, and occasionally after it, albeit in a more-controlled manner. Testing and measurement would complement a well-oiled QA with comprehensive automated test suites and periodic performance testing/analysis to make the platform robust.
An encouraging leadership group, flexible engineering team and a strong observability infrastructure enabled us to begin the practice of chaos engineering without much cultural friction.
Choosing the Chaos Platform
In keeping with our open source culture and affiliation, around January 2021, we started looking for an open source chaos engineering solution that would meet the following criteria:
- Kubernetes native: HaloDoc uses Kubernetes as the underlying platform for a majority of the business services, including hosting tools that operate and manage observability across our fleet of clusters. We needed a chaos tool that could be deployed and managed on ARM64 (AWS-Gravitron)-based Kubernetes, as well as the ability to express a chaos test in Kubernetes’ language, resource YAML.
- Multiple fault types with extensibility: Considering the microservices span across several frameworks and languages (Java, Python, C++, Golang), it was necessary to subject them to varied service-level faults. Add to it the hybrid nature of the infrastructure (varied AWS services) and ability to target non-Kubernetes entities like cloud instances, disks, etc., becomes clear. Furthermore, we were looking for a chaos platform that could help application developers build their own faults, integrate them into the suite and have them orchestrated in a similar fashion to the native faults.
- Chaos scenario definition: We needed a way to define a full-fledged scenario that combined faults with some custom validation, depending on the use case, as the chaos tests were expected to run in an automated fashion after the initial experimentation/establishing test fit. HaloDoc also uses a variety of synthetic load tools mapped to families of microservices in its test environment that we wanted to leverage as part of a chaos experiment, to make it more effective and derive greater confidence.
- Security features: The tiered-staging environments at HaloDoc are multiuser, shared environments accessed by dedicated service owners and SRE teams, with frequent upgrades to the applications. We needed a tool with the ability to isolate the chaos view for respective teams with admin controls in place for the possible blast radius. This, allied with the standard security considerations around running third-party containers.
- Observability hooks: As an organization that has invested heavily on observability, both for monitoring application/infrastructure behavior (the stack includes New Relic, Prometheus, Grafana, ElasticSearch etc.,) as well as for reporting and analysis (we use Allure for test reports and Lighthouse for service analytics), we were expecting the chaos framework to provide us with enough data to ingest in terms of logs, metrics and events.
- Community support: An important consideration when choosing between open source solutions, we were looking for a strong community around the tool with approachable maintainers who could see reason in our issues and enhancements while keeping a welcoming environment for users like us to contribute back.
We finally settled on LitmusChaos, which met the criteria we were looking for to a great extent, while having a roadmap and release cadence that aligned well with our needs and pace. It also had some other interesting features we have started using since, such as GitOps support. We ended up contributing toward better user experience (Litmus dashboard) and improved security in the platform.
How We Practice Chaos Today
Our initial efforts with Litmus involved manually creating the ChaosEngine custom resources targeting application pods to verify behavior. This, in itself, proved beneficial with some interesting application behavior unearthed in the development environment. Eventually, the experiments were crafted with right validations using Litmus’s probe feature and stitched together to form Chaos Workflow resources that could be invoked programmatically.
Today, these chaos workflows, stored in a dedicated git repository, are mapped to respective application services via a subscription mechanism and are triggered upon app upgrade via the Litmus event-tracker service residing on the staging cluster. Interestingly, we soon learned that automated chaos as part of continuous deployment seems to be a trend that is picking up in the ecosystem, and we were not alone in thinking this way.
While the chaos experiments on staging are used as a gating mechanism for deployment into production, the team at HaloDoc believes firmly in the merits of testing in production. We use the Scheduled Chaos capability of Litmus to conduct “automated game days” in the production environment, with a mapping between fault type and load conditions that we devised based on usage and traffic patterns.
The results of these experiments are fed into a data lake for further analysis by the development teams, while the reports from the ChaosCenter, the control plane component of Litmus, especially those around comparisons of the resilience score of scenarios are also leveraged for high-level views.
The personnel involved in creating/maintaining and tracking the chaos tests on staging are largely developers and extended tech teams belonging to the different verticals, while the game days are exclusively carried out by the members of the SRE team.
The upgrades of the chaos microservices on the clusters are carried out in much the same fashion as other tooling, with the application undergoing standard scans and checks in our GitLab pipelines.
KPIs for Resilience Engineering
Since our chaos journey is in its fledgling stages, the evaluation criteria are not fully crystallized yet. Having said that, efforts must be measured to yield the right benefits! Here are some of the important top-level metrics that we are keeping a tab on, a few of which have shown the right curve/trend as our chaos practice has picked up.
- MTTR: Many application services have been configured to self-heal, but the meantime to recover tells us the mechanism’s efficiency. This is tracked for all target microservices (application-under-test or AUT in Litmus parlance) using both native Litmus probes and custom checks.
- Error budgets: As a fast-paced startup with nearly daily upgrades into prod, we need to embrace risk. Chaos experiment results with failed verdicts are taken into account while calculating error budgets, the latter being tracked at different levels of service.
- SLOs: The steady-state hypothesis accompanying an experiment may directly relate to a service-level objective or contribute to it. The SLOs are tracked for upkeep, while the degree of failure increased via the workflow feature.
- Tickets/outages: Outage alerts and incident tickets submitted are another business-level metric, which are being tracked to gauge the efficacy of the chaos practice.
Current Impact and Looking Ahead
Thus far, chaos engineering has helped on both counts: verifying expected behavior and learning about new ones. Looking ahead, we expect to add more scenarios targeting different aspects of system behavior. We expect the chaos practice and culture to grow stronger, with dedicated resources allocated across verticals.
From a technical standpoint, we are working with the Litmus team in baking in improved authentication (G Suite integration) and secrets management mechanism (Vault) to widen the platform’s reach in the organization.
Featured image provided by ChaosNative.