Gremlin’s chaos engineering platform is now available on Amazon Web Services‘ CloudFormation Public Registry. This makes it easy for AWS customers and Gremlin users to easily discover, deploy, and manage Gremlin agents across their AWS infrastructure, specifically their Amazon EKS clusters.
Reliability is paramount when running workloads in the cloud. Even in a fully managed cloud environment, there’s still the potential for a wide range of failure modes that can cause outages. These outages can cost customer trust, revenue, and valuable engineering time spent on troubleshooting and incident response. Reliability is so important that it’s one of the pillars of the AWS Well-Architected Framework (WAF). With Gremlin and CloudFormation Public Registry, you can easily validate the resilience of your AWS deployments against a variety of failure modes.
Installing the Gremlin agent enables you to run targeted experiments on your EKS workloads, such as:
- Testing the configuration of auto-scaling groups (ASGs) by simulating heavy traffic.
- Validating region failover and disaster recovery by simulating Availability Zone or region outages.
- Validating CloudWatch configurations and alerts.
- Ensuring that containerized workloads, Kubernetes resources, and distributed services can automatically recover from failure.
In this tutorial, we’ll show you how to use CloudFormation Public Registry to deploy Gremlin and validate that you can run experiments on your cluster. You’ll create an IAM role for CloudFormation, deploy an Amazon EKS cluster, activate the Gremlin extension in CloudFormation, and finally deploy the agent to your cluster.
How It Works
The Gremlin agent is an executable that orchestrates experiments on a host. On Kubernetes clusters it is deployed as a DaemonSet, which means an instance of the agent is automatically deployed to each node in the Kubernetes cluster. The agent detects the name of the host, its status (active or idle), AWS-specific metadata such as Availability Zone and Region, and Kubernetes resources (such as Deployments, Pods, and DaemonSets). This information can then be used to target a specific resource — or set of resources — when running an experiment using the Gremlin web app, API, or CLI. The Gremlin agent can also detect processes running on your hosts, which can be targeted using the Services Discovery feature.
CloudFormation Public Registry uses the Gremlin Helm chart to deploy the Gremlin agent. You don’t need to be familiar with Helm to follow this tutorial, unless you want to configure the chart yourself.
Step 1: Create an IAM Role for CloudFormation
Our first step is to create an IAM (Identity Access Management) role for CloudFormation, which will give it the necessary permissions. A template is available here. Running this template will generate an ARN (Amazon Resource Name), which you will need for the following steps.
Next, enable the
AWSQS::EKS::Cluster extension. Navigate to the CloudFormation registry, select public extensions, then search for “AWSQS::EKS::Cluster”. Click Activate, and when prompted for an execution role ARN, use the ARN created for your IAM role.
Step 2: Deploy an Amazon EKS cluster
Next, you’ll need to provide CloudFormation access to the Kubernetes API for your cluster. You can deploy a new cluster using this template, or you can manually add the IAM execution role to your cluster to grant access. You can find additional instructions in our GitHub repository.
Step 3: Activate the Gremlin extension
Now that your cluster is running and CloudFormation has access to the Kubernetes API, the next step is to activate the Gremlin extension. Navigate to the CloudFormation Registry. Under “Publisher”, switch to “Third Party” and search for “Gremlin” as shown here:
Leave the details with their default settings, but for the execution role ARN, enter the ARN that you generated in step 1. Then, press “Activate extension”:
Step 4: Deploy the Gremlin Agent
The last step is to deploy the Gremlin agent. This extension uses the Gremlin Helm chart, which is configured using a YAML template. As part of the Gremlin agent installation, you’ll need to authenticate it with your Gremlin account using your Gremlin team ID and either secret-based authentication or certificate-based authentication. For this tutorial, we’ll use secret-based authentication. You’ll also need to provide a name for the cluster: this name will be used to identify the cluster in the Gremlin web app.
You can use the YAML below as a template. Replace the following values:
<YOUR-GREMLIN-TEAM-ID>: The unique ID for your Gremlin team.
<A-NAME-FOR-YOUR-EKS-CLUSTER>: A unique name for your EKS cluster. You’ll use this to identify your cluster in the Gremlin web app and for selecting experiment targets.
<YOUR-GREMLIN-TEAM-SECRET>: Your Gremlin <a href="https://www.gremlin.com/docs/infrastructure-layer/authentication/%23create-a-secret">team secret</a>.
In the AWS CloudFormation console, create a new stack using this template and enter a name for the stack. Create the stack, then monitor the Events tab. Once the stack is deployed, you will see an event with the status ‘CREATE_COMPLETE’:
You can verify that Gremlin was successfully deployed by logging into the Gremlin web app, clicking Clients, and selecting the Kubernetes tab. You will see your cluster listed by name, along with its namespaces. You can now run experiments by clicking the Attack cluster button.
To learn more about Gremlin’s CloudFormation integration, visit our GitHub repository.
Amazon Web Services is a sponsor of The New Stack.
Feature image via Pixabay.