How to Get Started with Data Streaming
Streaming data is everywhere, and today’s developer needs to learn how to build systems and applications that can ingest, process and act on data that’s continuously generated in real time.
From millions of connected devices in our homes and businesses to IT infrastructure, developers are tapping into an endless number of data streams to solve operational challenges and build new products, which means the learning opportunities are endless too.
While the term “data streaming” can apply to a host of technologies such as Rabbit MQ, Apache Storm and Apache Spark, one of the most widely adopted is Apache Kafka.
In the 12 years since this event-streaming platform made open source, developers have used Kafka to build applications that transformed their respective categories.
Think Uber, Netflix or PayPal. Data-streaming developers within these kinds of organizations and across the open source community have helped build applications with up-to–the-minute location tracking, personalized content feeds and real-time payment transactions.
These real-time capabilities have become so embedded into our daily lives that we now take them for granted and expect them. But before bringing those capabilities to life, developers first had to understand the platform’s decoupled publish-subscribe messaging pattern and how to best take advantage of it.
Understand How Kafka Works to Explore New Use Cases
Apache Kafka can record, store, share and transform continuous streams of data in real time. Each time data is generated and sent to Kafka; this “event” or “message” is recorded in a sequential log through publish-subscribe messaging.
While that’s true of many traditional messaging brokers like Apache ActiveMQ, Kafka is designed for both persistent storage and horizontal scalability.
When client applications generate and publish messages to Kafka, they’re called producers. As the data is stored, it’s automatically organized into defined “topics” and partitioned across multiple nodes or brokers. Client applications acting as “consumers” can then subscribe to data streams from specific topics.
These functionalities make Kafka ideal for use cases that require real-time ingestion and processing of large volumes of data, such as logistics, retail inventory management, IT monitoring and threat detection.
Because the two types of clients — publishers and consumers — operate independently, organizations can use Kafka to decentralize their data management model. This decentralization has the potential to free Kafka users from information silos, giving developers and other data-streaming practitioners access to shareable streams of data from across their organizations.
Technical Skills Every Data-Streaming Developer Needs
At the heart of it, a developer’s job is to solve problems with code. Kafka simply provides a new platform for solving user and business challenges. In some ways, the approach to problem-solving that data streaming requires is simpler than traditional object-oriented programming. But that doesn’t mean it won’t take time and effort to learn the basics and then master them.
In many Kafka use cases, downstream applications and systems depend on the data streams to which they’re subscribed to initiate processes, screen for changes in the world around us and trigger planned reactions to specific scenarios.
But the only way Kafka can enable these possibilities is if developers can bring data in and out of the platform. That’s not possible without the broader ecosystem of tools like Kafka Connect, Kafka Streams and ksqlDB.
For developers first learning how to use the data-streaming platform, Kafka Connect should be their initial focus. This data integration framework allows developers to connect Kafka with other systems, applications and databases that generate and store data through connectors.
Once developers have mastered creating data pipelines with Kafka, they’ll be ready to begin exploring streaming processing, which unlocks a host of operational and analytics use cases and the ability to create reusable data products.
Exploring What You Can Build with Kafka
Stream processing means ingesting and processing streaming data, all in real time. Transforming data in flight not only allows applications to react to the most recent, relevant information, but often also allows you to turn that data into a more consumable, shareable format.
As a database purpose-built for stream processing, ksqlDB allows developers to build pipelines that transform data as it’s ingested, and push the resulting streaming data into new topics after processing. Multiple applications and systems can then consume the transformed data in real time.
One of the most common processing use cases is change data capture (CDC), a data integration approach that calculates the change or delta in data so that information can be acted on. Applications like Netflix, Airbnb and Uber have used CDC to synchronize data across multiple systems without impacting performance or accuracy.
Although CDC can be achieved in other ways — for example with Debezium or Apache Flink — many organizations use ksqlDB to further enrich and transform the delta generated with CDC, stream the transformed data in real time and enable multiple downstream use cases as a result.
On the other hand, Kafka Streams is a client library that simplifies the way developers write client-side applications that stream data to and consume data from Kafka.
By learning how to use both these tools alongside Kafka and its connectors, developers can go from building “dumb” pipelines to stream-processing pipelines that transform data in real time, making it ready for streaming applications to act on.
Choosing Your First Kafka Project
The most successful developers are often the ones who feel inspired by the possibilities of what they’re creating. It doesn’t need to be complex or “innovative”; it just needs to be something that you’ll look forward to achieving. And as you prepare to start your first Kafka project, whether you want to use streaming data from a video game, develop your plant-monitoring system or create a market screener to fully take advantage of this platform, you should write down:
- The source of streaming data you want to use.
- Two or three problems you can solve or use cases you can implement using your chosen data stream.
- The name of a learning partner or mentor you can contact to brainstorm and debug your code out loud. (In a pinch, rubber duck debugging is always an option, but it pays to have someone help you unpack the thought process behind your real-time problem-solving.)
While developers can build their own connectors to bring data in and out of Kafka, there are a wealth of open source and managed connectors that they can take advantage of. Many of the most common databases and business systems like PostgreSQL, Oracle, Snowflake and MongoDB already have connectors available.
Developers learning Kafka at work need to learn how to build data pipelines with connectors to quickly bring the data they work with every day into Kafka clusters. Those learning Kafka on their own can also find publicly available data-streaming sets available through free APIs.
- Find a client library for your preferred language. While Java and Scala are most common, thanks to client libraries like Kafka Streams, there are still lots of options for using other programming languages such as C/C++ or Python.
- Gain a deep understanding of why Kafka’s immutable data structures are useful.
- Update the data modeling knowledge that you learned with relational databases so you can learn how to effectively use Schema Registry, Kafka’s distributed storage layer for metadata.
- Brush up on your SQL syntax to prepare to use Kafka’s interactive SQL engine for stream processing, ksqlDB.
When you’re ready to start, create your first cluster, and then build an end-to-end pipeline with some simple data. Once you’ve learned to store data in Kafka and read it back — ideally using live, real-time data — you’ll be ready to begin exploring more complex use cases that leverage stream processing.
Set Up Yourself for Long-Term Success When Learning Kafka
As with any other developer skill set, the learning never really ends when it comes to Kafka. Even more important than the technologies and frameworks in the ecosystem, you need to learn to problem-solve with a data-streaming mindset. Instead of thinking of data as finite “sets of sets,” as they’re stored in relational databases, you’ll have to learn how to apply data stored as immutable, appending logs.
Connecting with other developers on the same journey — or ones who have been in your shoes before — allows you to learn from others’ mistakes and discover solutions you might never have considered.
Not only is Kafka one of the most active open source projects, but businesses in sectors like financial services, tech, logistics and manufacturing are doubling down on their investment in data streaming. With so many companies and industries standardizing on Kafka as the de facto solution for data streaming, there’s a robust community for newcomers to join and learn alongside.
Developers who invest time in learning how to solve these kinds of impactful use cases will have a wealth of job opportunities, which means more interesting problems to solve and space to grow their skills and careers.