What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Kubernetes / Operations

10 Kubernetes Best Practices in DevOps — without ChatGPT

When it comes to Kubernetes best practices, generative AI still has much to learn and should not be seen as a silver bullet. Human knowledge still outpaces AI.
Oct 5th, 2023 6:58am by
Featued image for: 10 Kubernetes Best Practices in DevOps — without ChatGPT
Image from azrin_aziri on Shutterstock.

Since the introduction of ChatGPT, the chatbot has been used globally for a variety of use cases. Of interest to many is how AI compares to other offerings and search engines when it comes to providing useful and accurate answers to specific queries.

With this in mind, we conducted an experiment with ChapGPT to determine where given answers were correct and where it yielded questionable responses on the topic of Kubernetes, proving that humans and their expertise remain irreplaceable.

The following are 10 current best practices for using Kubernetes in DevOps, not authored by an AI but by firsthand human experience.

1. The Right Pod-to-Node Ratio Is Crucial

The key to using Kubernetes lies in using different node types based on workload requirements, such as CPU or memory optimization. Properly aligning containers with the CPU-to-memory ratio of nodes allows organizations to optimize resource usage.

However, finding the right number of pods per node is a balancing act that considers the varying consumption patterns of individual applications or services. Distributing the load across nodes using techniques like pod topology spread constraints or pod anti-affinity optimizes resource usage and adjusts to changing workload intensities.

2. Securing the Kubernetes Control Plane

Monitoring the Kubernetes control plane is critical, especially with managed Kubernetes services. Even though cloud providers offer solid control and balance, their limits must be recognized. A slow control plane can significantly affect cluster behavior, including scheduling, upgrades and scaling operations. Even with managed services, there are boundaries that must be acknowledged. Excessive use of the managed control plane can lead to catastrophic crashes. It’s essential to always remember that control planes can become overloaded if not properly monitored and managed.

3. Optimizing Application Uptime

Prioritizing critical services optimizes application uptime. Pod priorities and quality of service identify high-priority applications that are always on; understanding priority levels allows optimization of stability and performance. Simultaneously, pod anti-affinity prevents multiple replicas of the same service from being deployed on the same node. This avoids a single point of failure — meaning if there are issues on one node, the other replicas remain unaffected. Additionally, setting up specific node pools for mission-critical applications is beneficial. For instance, a separate node pool for ingress pods and other crucial services like Prometheus can significantly enhance service stability and end-user experience.

4. Scaling Planning

Companies need to be prepared to handle large deployments and provide necessary capacity growth without causing negative impacts — and ideally without forcing existing systems to grow. Automatic cluster scaling in managed services can help, but it’s essential to know the limits of cluster size. A typical cluster may include around 100 nodes. Once this limit is reached, another cluster should be set up rather than forcing the existing one to grow. Both vertical and horizontal application scaling need to be considered. The key is to strike the right balance to better utilize resources without straining them too much. Horizontal scaling and replicating or duplicating workloads are generally preferred but with the caveat that this might affect database connections and storage space.

5. Prepare for Failures

Planning for failures has become commonplace across various aspects of application infrastructure. To ensure preparedness, develop playbooks that cover various failure scenarios, including application failures, node failures and cluster failures. Implementing strategies like high-availability application pods and pod anti-affinity helps ensure coverage in case of failures.

Every company needs a detailed disaster recovery plan for cluster outages and should practice it regularly. Controlled and gradual deployment during recovery helps avoid overloading resources.

6. Securing the Software Supply Chain

The software supply chain is consistently vulnerable to errors and malicious actors. Controlling every step of the pipeline and not relying on external tools and providers without carefully vetting their trustworthiness is essential. Maintaining control over external sources includes measures like scanning binary files sourced from remote repositories and validating them with a software composition analysis (SCA) solution. Teams should also apply quality and security checks throughout the pipeline to ensure higher trust both from users and within the pipeline itself, guaranteeing higher quality of delivered software.

7. Runtime Security

Using admission controllers to enforce rules, such as blocking the deployment of blacklisted versions, contributes to runtime security. Tools like OPA Gatekeeper assist in enforcing policies that might, for example, only allow controlled container registries for deployment.

Role-based access control is also recommended to secure access to Kubernetes clusters, and other runtime protection solutions can detect and address risks in real time. Namespace isolation and network policies help block lateral movement and protect workloads within namespaces. Consider running critical applications on isolated nodes to mitigate the risk of container escape scenarios.

8. Securing the Entire Environment

To secure your environment, assume that the network is constantly under attack. Auditing tools are recommended to detect suspicious activity in clusters and infrastructure, as are runtime protection measures with full transparency and workload controls.

Best-of-breed tools are helpful, but a robust incident response team with a clear playbook for alerts or suspicious activity is essential. Similar to disaster recovery, regular exercises and practices are necessary here as well. Many companies also employ bug bounties or external researchers to attempt compromising the system to uncover vulnerabilities. The external perspective and objective investigation can yield valuable insights.

9. Continuous Learning

When evolving systems and processes, continuous learning is crucial, involving collecting historical performance data to assess and apply measures. Small, continuous improvements are commonplace; what was relevant in the past might not be anymore.

Proactively monitoring performance data can help identify memory or CPU leaks in your services or performance glitches in third-party tools. Actively evaluating data for trends and anomalies improves system understanding and performance. Proactive monitoring and evaluation lead to more effective results than reacting to real-time alerts.

10. Humans Are the Weakest Link

Automation minimizes human involvement wherever possible, and sometimes that’s a good thing. Humans are the weakest link when it comes to security. Explore a range of available automation solutions and find the best fit for your processes and definitions.

GitOps is a popular approach to move changes from development to production, providing a familiar contract and interface for managing configuration changes. A similar approach uses multiple repositories for different types of configurations, but it’s important to maintain a clear separation between development, staging and production environments, even if they should resemble each other.

Looking to the Future

AI-driven solutions hold much promise for the future, as they help reduce operational complexity and automate tasks related to environment management, deployments and troubleshooting. Nevertheless, human judgment is irreplaceable and should always be considered. Today’s AI engines rely on public knowledge, which might contain inaccurate, outdated or irrelevant information, ultimately leading to incorrect answers or recommendations. At the end of the day, common sense and acknowledging the limits of AI are of utmost importance.

Join us at KubeCon + CloudNativeCon North America this Nov. 6-9 in Chicago for more on Kubernetes and the cloud native ecosystem. 

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.