Stream Data Across Multiple Regions and Clouds with Kafka
Multicluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception as businesses aim for uptime and reliability. In this article, I’ll dive into several scenarios that may require multicluster solutions and showcase real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka.
Apache Kafka is a distributed data streaming platform that handles failures, like issues with a disk or network, automatically to avoid downtime or data loss. Nevertheless, Kafka is often deployed across data centers or clouds to survive the outage of one data center. Let’s explore the use cases, each with its trade-offs and concrete real-world examples.
1. Disaster Recovery Between Regions
Critical business transactions require failover and recovery in the case of a disaster such as the outage of a data center or cloud region. Data is replicated in real time between two independent Kafka clusters in separate data centers, cloud regions or even two cloud providers. Active-active and active-passive architectures are possible. Usually, applications switch to another cluster if a disaster strikes. Business continuity is ensured.
The biggest trade-off is that the replication between the clusters happens asynchronously. Hence, a few messages might be lost. If you need zero data loss, there is a more advanced (and complex) option: stretched clusters.
2. Stretched Clusters for Zero Downtime and Data Loss
A stretched Kafka cluster operates as a single deployment across different data centers or cloud regions. The benefits are zero downtime and zero data loss even in the case of disaster. This architecture is compliant with the most challenging legal and business requirements.
However, there are significant disadvantages and requirements to using this architecture, so I’d only recommend it if there is no other way:
- Very good and stable latency is required between the regions
- Operation is much more complex than a local Kafka cluster in a single region
- Additional features like choosing which data to replicate synchronously (such as critical payment data) vs. asynchronously (like non-critical log data) is usually needed and only available in commercial platforms.
3. Hybrid Integration Between Data Center and Public Cloud
The Kafka cluster in the data center connects to existing legacy applications like a database, mainframe or on-premise ERP system. The Kafka cluster in the cloud connects to SaaS offerings, cloud native microservices, analytics platforms, etc.
With true decoupling and automatic backpressure handling, Kafka acts not just a messaging, but also as an event store.
Replication with Kafka between two or more Kafka clusters is set up via a Kafka-native replication tool as the single source of truth. This creates many benefits:
- Avoiding a spaghetti architecture with many point-to-point integrations
- The heart of the integration is real-time, reliable, and scalable
- Guaranteed ordering even across the data center and cloud
Real-world example: Siemens AG (Berlin and Munich) is a global technology powerhouse that has stood for engineering excellence, innovation, quality, reliability and internationality for more than 170 years. Siemens connected its SAP system to Kafka on-premises. It improved the business processes and integration workflow from daily or weekly batches to real-time communication by optimizing the SAP integration. Siemens later migrated from self-managed on-premises Kafka to Confluent Cloud via Confluent Replicator. Integrating Salesforce via Kafka Connect was the first step of Siemens’ cloud strategy. More and more projects and applications join the data streaming journey as it is easy to tap into the event stream and connect it to other tools, APIs and SaaS products after the initial streaming pipeline is built.
4. Edge Computing and Aggregation in the Data Center or Public Cloud
Each edge site (retail store, factory, etc.) operates a small Kafka cluster (sometimes just a single node without high availability) for edge operations like pre-processing and filtering or advanced analytics with stream processing. The curated data is ingested into the large Kafka cluster in the data center or cloud where the integration with the rest of the IT infrastructure runs, like the data warehouse and the data lake.
This architecture has a few benefits compared to choosing distinct technology at the edge:
- The core is real time, scalable, and reliable even end-to-end across edge sites and the cloud
- Using the same technology, API, development tools and vendor for edge and cloud deployments and the replication. This usually enables better end-to-end service-level agreements (SLAs), cost-efficiency and time to market.
- Disconnected and air-gapped environments can be used for safety- or privacy-critical use cases while analytics operates in elastic and more flexible cloud infrastructure
Real-world example: A major cruise line implemented one of the most famous use cases for Kafka at the edge. Each cruise ship has a Kafka cluster running locally because of bad and costly connectivity to the internet. Use cases include payment processing, loyalty information, customer recommendations, etc. When back in the harbor with a stable internet connection for a few hours, relevant data is replicated to a large Kafka cluster for big data analytics and other use cases.
5. Migration from Self-Managed Kafka to a Fully Managed Cloud Service
While many multicluster Kafka deployments run long-term for hybrid integrations or disaster recovery, some use cases only require two clusters for a planned infrastructure migration with a final cutover. Two common scenarios are the migration from open source Kafka to a commercial vendor or cloud service, or the move from on-premise infrastructure to the public cloud.
6. Multiple Kafka Clusters are the Norm, not an Exception!
This article showed various architectures and use cases for multiple Kafka clusters. All alternatives have trade-offs regarding efforts, cost and risks. Make sure to begin the evaluation with the requirements for your service-level agreements (downtime, data loss, compliance, security) before digging deeper into the potential deployment options. Many projects are multiyear journeys. Kafka allows you to connect legacy and cloud native applications with any kind of protocol (Kafka, message queue, file, database, etc.) or communication paradigm (real-time, batch, request-response) and progress at your own pace. The heart of the infrastructure and data replication is real time, scalable and reliable.
Many tools exist on the market for the replication between Kafka clusters: Open source MirrorMaker 2 is part of open source Apache Kafka. More advanced commercial tools bring benefits. For instance, Confluent Cluster Linking leverages the native Kafka protocol for the replication. This makes operations much easier and less costly, and provides more capabilities for critical scenarios like failover in case of a disaster or for common security requirements like initiating the connection from the source site.
No matter if you choose open source, a commercial platform or a cloud service, make sure to understand the trade-offs between the different architectures and products. And be aware that even the best technologies alone do not make a critical multicluster Kafka project successful. Get help from trusted experts who do similar projects on a daily basis to understand all the best practices and trade-offs.