The Future of Data Engineering
In our previous post, “The Data Engineering Megatrend: A Brief History,” we looked at the history of data engineering. Specifically, we looked at how IT organizations went from being the keepers of all the important data 20 years ago, to being considered a blocker for other teams. Frank Slootman, who is now the CEO of Snowflake, described this dynamic bluntly: “IT leaders bore me.” That sentiment is changing though, especially in the last five years, even if the pull isn’t felt by every company yet.
But what about the next five to 10 years? As the data engineering megatrend impacts companies across industries, what will the big changes be in the field of data and for the role of data engineer specifically?
Data Roles Will Get a Board Level Seat
There will certainly be changes in data engineering “on the ground,” but one trend that will shape organizations in a significant way over the next 10 years will be the increasing value and responsibility of data executives.
In the past, many data-specific roles were broken out by department: Head of Analytics, Head of Data Science, Head of Data Engineering, etc. Increasingly, though, data-specific roles are entering the C-suite and have presence in the boardroom. Over the last five years, the term Chief Data Officer has become popular. Searching LinkedIn for the term produces over 10,000 results (with many more if we include variations).
Leadership and the decisions made by leadership shape organizations. As we see more and more data roles in the boardroom, the data function (and data itself) will increasingly become a first-class citizen and a key consideration in all decision making; whereas in the past it has been important but not critical, it’s now treated as a required business function for modern organizations.
In full fruition, this will shape organizations around data. Data engineering and related functions will be strategically positioned to accelerate everything happening at the company (as opposed to answering requests from “internal customers”), with the goal of leveraging data as a key competitive advantage.
As the lines between data science and data infrastructure blur, Data/ML engineering roles will replace data science roles as the most sought-after hires.
A recent blog and Hacker News thread, “We Don’t Need Data Scientists, We Need Data Engineers,” reinforces the fact that data engineering has become cool again. No one would deny the power of data science when leveraged well, but companies are realizing that the more pressing need is to solve more fundamental issues around collecting, cleaning, storing and analyzing data — before they can do anything with the data.
In fact, as data science becomes more prevalent, the need for data engineering will significantly increase — both for scale and for velocity. In a recent Data Stack Show podcast episode, one of the hosts asked Arian Osman, senior data scientist at Homesnap, about how much work they had to do on the data they used to build models. His excitement about a robust data engineering function was palpable:
“Most of the work is done by the data engineering teams…thankfully, I don’t have to touch very much of anything. And when I first started at Homesnap [and] I looked at the database, I was just amazed. I mean, I was just amazed at the architecture of the database and the normalization that was used and the partitions that were used. So, we have a great team at Homesnap when it comes to getting that data and cleaning it as much as we can. “
Advanced companies may require data science-heavy machine learning (ML) and a tight alignment between data science and engineering. Stepping back, though, the reality for many companies and business applications is that they only need basic machine learning, not things like advanced neural networks. Basic machine learning skills can be picked up by developers and engineers, and we are already seeing this change begin to take place with the rise of “ML Engineering” roles — where people are conversant in designing ML algorithms, as well as training them on real data and deploying them in production.
The underlying technological force behind this trend in Data Engineering / Data Science roles is the blurring of the lines between data infrastructure and data science. For some time it has been common for data scientists to work in data engineering because of the hygiene needs mentioned above, but increasingly we are seeing process and technology on the infrastructure side deliver data science and ML products (Pachyderm is an example of “data science as infrastructure”).
Dedicated Data Engineering Support for Every Team
As mentioned above, data and data functions will become first-class citizens. In the organizational and day to day operations of businesses, we’ve already seen the beginning of centralized data engineering teams. They provide data products and services to other parts of the organization.
The most advanced companies, though, are going beyond the concept of a shared service center and are proactively creating dedicated resources for individual teams. This takes the term “data-driven” to another level. Instead of using data to perform existing tasks with more velocity and impact, teams at companies like Mattermost are designing and rebuilding initiatives, tactics, systems and processes in partnership with data engineering.
Instead of asking “how can we use data to make this better,” teams are partnering with data engineering to ask, “how can our data and data systems shape the way we think about solving this problem.”
Over the next 10 years, this strategic collaboration will be the standard in business operations and organization structure.
An Increased Number of ‘Unicorns’ Solving Data Problems
As the megatrend hits its stride, the software industry will rise to the occasion (as it always has). When we think about unicorns, the examples that come to mind are companies like Databricks and Snowflake, which have built multi-billion dollar businesses solving hard problems around data processing and storage.
In the last five years, as the nascent data engineering megatrend has started to become mainstream, we’ve seen the first crop of unicorns who were early movers in the space. Companies like Segment (acquired for $3.2 billion by Twilio) and FiveTran (valued at $1 billion and growing) have built huge businesses around the collection of data. Other companies like DBT or Looker (via LookML) have built significant businesses around the processing of data.
Over the next 10 years, the number of companies in the data space will only accelerate in response to the immense demand generated by all of the previous points in this post.
Technology for Moving Data Will Become Commoditized
For many companies today, moving data is still a non-trivial problem; which is why the companies mentioned above were able to build multibillion-dollar enterprises. Still, they achieved less than 10% market penetration!
Many companies today have to make some sort of sacrifice in building pipelines, whether that’s eating costs internally if they build themselves, or navigating the process of vendor selection, new technologies, etc. In 10 years, though, and likely much sooner, there will be standard playbooks, tools, and architectures for building and connecting data pipelines within a company.
In addition to mass adoption, cost and competition will also drive commoditization. First, the hard cost of moving data is decreasing. Second, more companies are moving to an “owned data” infrastructure, so that they can decrease the number of data silos and decrease the cost of storing copies of their data across multiple vendors.
Competition will decrease the premium that companies can charge simply for solving a hard problem. Before, moving data from X to Y was hard and there weren’t many options, so businesses were willing to pay more to solve that pain point. Increasingly, though, there are more options and those options will become more cost-effective.
Instead of buying point solutions to solve acute pain points, businesses will have the luxury of solving those pain points as a standard part of architecting their data pipelines and data stack.
Real-Time (and Near Real-time) Infrastructure Will Become Standard
As many vendors as there are in the Customer Data Platform and Customer Data Pipeline spaces, there are actually very few who enable real-time use cases out of the box.
Customer Data Platforms generally excel at things like customer profiles and customer journey activity — not the pipelines that get the data into the system. Customer Data Pipelines, on the other hand, are a non-trivial technology to build and commercialize at scale (remember the billion-dollar enterprises mentioned above that have only achieved 10% market penetration).
Because real-time pipelines are still nascent, many companies build their own solutions — which requires significant effort to develop and maintain.
Over the next 10 years, though, as more companies enter the customer data infrastructure market and build their products on modern cloud technology, real-time pipelines will be used by most companies and difficult challenges like real-time personalization will be turnkey.
The future looks good for data engineers and businesses who value data.
These trends are good news for both data engineers and the companies who employ them. As data becomes more valued inside of businesses and the technology makes what were once challenging problems easy to solve, data engineers will be able to spend more time adding strategic value — not trying to make the data and plumbing work.
The same is true for businesses — resources that were once spent on building and maintaining customer data infrastructure will be focused on building better products and services.