From Apache Flink to GenAI: 5 Data Engineering Predictions
It’s always challenging to predict the future, but there’s at least one certainty for data engineers in 2024: Their work will continue to be highly valued.
The rapid growth of generative AI and the ongoing shift from batch to stream processing are among the trends that will keep data engineers busy next year.
Here are five predictions for how the data engineering landscape will progress in the year ahead.
1. GenAI will become commoditized and embedded in multiple applications
It seems unthinkable that a technology as powerful as GenAI will be commoditized as soon as next year, but in 2024, this will start to happen. LLMs and other foundational models are already becoming easier to train and fine-tune. Next year enterprises will start to embed GenAI into more of their applications.
A year ago, only a handful of LLMs were available, and these were extremely large and costly to train and operate. There are now many LLMs to choose from, including some that are smaller and trained for specific applications, like software development, as well as open source options that can be readily adapted.
To be useful for businesses, LLM-powered applications must be well-contextualized with relevant and accurate internal data. The availability of specific LLMs makes it easier for companies to train LLMs on their internal data and run them in their own secure cloud environments, which is often critical to meeting security needs.
As a result, in 2024 we’ll see more businesses embedding GenAI capabilities in the internal and external applications they build. This will boost productivity and provide much richer customer experiences.
2. Data governance will ‘shift left’ as companies collect more data for GenAI
As businesses collect larger volumes of data for their AI initiatives, they must add a governance layer to make the data useful. It’s much easier and more efficient to add governance when data is produced, and we will see data governance “shift left” next year to accommodate this need.
Governance investments are critical as they ensure data is reliable and can be made available quickly for use in applications. This governance includes recording the provenance of data, ensuring it is accurate, adding metadata to make it easier to work with and including it in a secure catalog so others know it’s available.
Storing unstructured and ungoverned data in a data lake makes it easier to save everything, but it becomes progressively more expensive to use any of this data. Companies must work smarter and shift processing to the left as much as possible.
This has several benefits. Adding governance sooner means the data is available more quickly, so developers can work with more timely data. It also allows an organization to discard data without future value, reducing storage costs and liability. In 2024, more companies will recognize these benefits and apply data governance earlier.
3. Apache Flink adoption will accelerate beyond software engineers, cementing its position as the de facto standard for stream processing
Historically, the adoption of stream processing has been held back due to its complexity. Stream processing must become simpler for people to use and see the most benefits from it.
In 2023, we saw several Flink as a Service (FaaS) offerings come to market, and next year, we’ll see more customers gravitate to these services as a path to reducing stream processing complexity. The overall developer tooling and experience will be transformed, and application and pipeline development will benefit from a cleaner integration within the software development life cycle.
Flink’s ecosystem of users will continue to diversify beyond software developers as data teams and business operations recognize the value of moving workloads upstream. We have seen more users wanting to query their streams in real time. With the introduction of a new Java Database Connectivity (JDBC) driver, we will see even more new systems and users connect to Flink for the first time.
4. Apache Flink 2.0 will embrace cloud native principles and eliminate the boundaries between batch and stream processing
Flink 2.0, expected in late 2024, is a big focus for the Flink community. Next year, Flink will continue modernizing and becoming more lightweight by embracing cloud native principles, such as disaggregated persistence layers. We can also expect the boundaries between batch and stream processing to disappear as systems will automatically choose the best mode.
In addition, the integration and synergy between Flink and Apache Kafka will continue to strengthen. Distributed transaction enhancements will enable more mission-critical use cases.
With serverless as the new benchmark for stream processing services, developers will be able to focus and spend more time building real-time stream processing applications rather than managing Flink.
5. Data as a product will go mainstream as governance tools evolve
Until recently, only large companies had the expertise and resources to create reusable data assets that could be repurposed easily across different teams and applications. Thanks to advancements in governance products required to build these assets, in 2024, more companies will be able to create reusable data products, greatly accelerating efficiency and data innovation.
Multiple teams can benefit from having access to the same data to build a service or application. However, this data must be presented in a way that is secure, well-contextualized and understandable for users who weren’t involved in its production. As data moves farther away from its initial source, it gets harder to determine and provide this contextual information, which makes it increasingly expensive. Starting the data governance process at the source is not only less expensive but also a better way to understand the data’s source and how it’s schematized.
New data governance capabilities that are pre-built into products such as cloud data warehouses, databases and other data infrastructure services can help to meet these needs. That means that developers no longer need to manually build the infrastructure to create and share reusable data products.
As a result, reusable data products will no longer be restricted to companies with large data engineering teams. With more companies building reusable data products, in 2024, developers will increase the value of their data and spend more time building innovative data applications and services.
Unlocking Greater Data Value in 2024
Data is the key driver for innovation in business today, and these predictions should be a good indicator of where many data engineers will focus their energy in 2024. GenAI is the newest kid on the block, but data streaming and steam processing remain equally critical as businesses try to unlock even more value from their data. In this rapidly shifting landscape, data engineers will be the main architects of change, and their expertise and creativity will shape the data infrastructures of tomorrow.