Best Practices for Running Confluent Kafka

With over 80% of Fortune 100 companies using Kafka, it’s safe to describe it as the industry-standard solution for streaming data between applications. Major enterprises like Netflix and Google as well as small startups and individual projects rely on Kafka to broker messages, track website activity, monitor operational data, aggregate logs, process data streams and more. Companies use Kafka to drive events such as video metadata, real-time vehicle information and financial transactions.
Notably, Kafka enables organizations to integrate large portfolios of applications and microservices by serving as a message broker, making it an ideal solution for Kubernetes-based infrastructures. But as a distributed system that interacts with all of these disparate microservices, Kafka can be difficult to manage on its own.
Kafka and Confluent
Because of this complexity, vendors like Confluent provide an enterprise-ready Kafka distribution and managed service. Rather than build their own connectors, stream processing, data governance and security, disaster recovery and other capabilities and components, enterprise teams use Confluent to hit the ground running.
Of course, using any management framework will automatically offer less control and flexibility compared to a DIY implementation. Organizations that want maximum flexibility, have a unique application ecosystem and/or have a technically competent team with sufficient resources may want to implement and manage Kafka themselves.
However, for the vast majority of organizations, the time, money and effort saved by using Confluent far outweigh any loss of operational control.
Key Pitfalls to Avoid When Working with Confluent Kafka
Treating Kafka as a Stateful, Persistent Database
While Kafka does stream real-time events into an immutable, replayable ledger, this ledger should not be treated as a stateful database. Confluent Kafka lacks features that purpose-built databases normally include, such as minimized downtime and maximized recoverability after a disaster, complex query capabilities and more.
To avoid this, developers should use the outbox pattern, which consists of storing a history of Kafka messages in a persistent, relational database such as PostgreSQL. This way, you can recreate and recover Kafka queues at any time, enabling high resiliency and scalability and reducing the overall operational costs of managing Kafka.
Relying on One Kafka Cluster
You’re using Kafka to unite your disparate microservices and applications, so it makes sense to keep it as centralized as possible, right? Unfortunately, doing so puts your entire ecosystem at risk should something happen to the cluster you use to host your single Kafka instance.
Instead, it’s a better practice to use Kubernetes to create multiple, smaller Kafka clusters that are each tailored to their specific environments. Then, using Kubernetes’s native capabilities and Confluent’s managed services, automate as much of the deployment as possible. This way, you don’t ratchet up the complexity of your environment even though your system is more distributed, and you retain the resiliency that comes with distributed systems.
Being too Liberal When Accepting and Processing Kafka Data
Some messages are going to be invalid and cause unexpected behavior when processed by Kafka. In the effort to automate everything, some developers might try to implement a “best-effort” model that processes these messages, but that can lead to bigger problems in other systems.
Instead, developers should make use of Kafka’s dead letter queue. This feature stores invalid messages or those that failed delivery to a queue of their own. Developers can configure alerts to be raised when messages land in that queue, enabling support teams to address these invalid messages manually. Even though manual intervention should be minimized when possible, it’s best to permit human eyes to vet invalid messages rather than pass them on to other systems and potentially break something.
Underestimating Kafka’s Storage Sensitivity
Unfortunately, Kafka behaves quite poorly when its underlying storage system isn’t performing well or has spotty availability. When data isn’t immediately available or highly consistent, Kafka’s own workloads and operations can get jammed up. Even when managed by Confluent, Kafka still needs extremely performant storage to function properly.
To get a reliable underlying storage infrastructure, Confluent Kafka users commonly turn to Amazon’s Elastic Block Storage (EBS). The issues with EBS, however, are quite well known. While highly performant, EBS is limited to a single Elastic Compute Cloud (EC2) instance in a single Area Zone (AZ) and only available on Amazon Web Services (AWS). As a result, organizations looking to scale and use multiple Kafka clusters across their ecosystem will need to purchase multiple EBS volumes across multiple AZs and must be prepared to manually intervene to preserve availability during an AZ outage.
Fortunately, this scalability challenge is a perfect use case for Ondat.
Ondat’s Approach to Storage for Confluent Kafka
What Confluent Kafka needs to mitigate its storage challenges is a data layer. Ondat serves as a highly performant, encrypted and available data layer that pools storage across nodes. As a result, Confluent Kafka instances can access shared storage resources even if they’re distributed across multiple clusters and availability zones.
If an enterprise decides to use Ondat with EC2 Instance Storage, for example, it can leverage the guidance above to enable a maximum performance Kafka cluster at minimal cost. Even if they are using EBS, they could enable automated replication using Ondat’s delta sync technology to ensure availability across AZ failures. The same application configuration could be used with any infrastructure and Ondat can provide the same features on any platform, from on-premises through to cloud. Thus, enterprises gain a more cost and resource-efficient way to consume storage services.
This approach to storage meshes nicely with Confluent’s application-level enhancements for Kafka, such as the Confluent Replicator. Replicator enables quick replication of topics from one Kafka cluster to another. With Ondat serving as the underlying data plane across Kafka clusters, enterprises gain a best-in-class disaster recovery regardless of their underlying storage infrastructure. Enterprises also benefit from a single interface for data that enhances data availability that ensures consistency and accessibility across all teams.
Sign up for Ondat’s new community edition providing unlimited nodes and up to 1TiB per cluster for free! For more information on how to get started visit our docs site.