Data engineering, as a distinct field, whose practitioners have a cohesive group identity as data engineers, is fairly new. So new, in fact, that there are many people who don’t seem to understand exactly what data engineering is and what it is not, and where the border exists between data engineering, data science and software engineering.
“When you look at some job descriptions, a lot of times you’ll see that they want a data engineer, but when you read the details, the company is actually looking for someone who specializes in machine learning, or someone who has a background in data science or someone who’s an analyst or a visualization engineer,” explained Robbie Smith, senior data engineer at Guild Education. “In a lot of job descriptions, data engineering is conflated with other data-related professions.”
Perhaps ironically, data engineering, as a profession, has more in common with software engineering than with data science. Most data engineers started as software engineers, and there’s a fairly broad overlap in skills sets used for data engineering and software engineering.
“I see data engineering as a subcategory of software engineering,” explained Luke Feeney, co-founder and chief operating officer at TerminusDB. “In most shops, we see that the data engineers are writing Python scripts to get their data from point A to point B. These are people who are coders.”
This sentiment was echoed by Smith, who described his own career trajectory as starting out as a software engineer before specializing in data engineering when offered the chance to do so at a new job.
What about Data Science?
In fact, the largest misconception about data engineering is that it is closely related to data science. The two disciplines are related, but in the same way goats are related to grass, not the way goats are related to sheep. Data engineers build the pipelines that data scientists depend on, but the two professions are very different. Whereas there is a giant overlap in skill sets between software engineering and data engineering, the skillsets and career path of a data engineer and a data scientist are quite different.
Data engineers are responsible for building a beautiful data pipeline that works every time, that has revision control and is very structured and orderly. Data scientists are trying to make sense of that data — to understand why anyone should be moving around in the first place and to use the data for business reasons. They generally have Ph.D.s in statistics and approach their work like scientists — they want to run experiments, not write code.
So Data Engineers are Code Slingers?
Andrew Stevenson, chief technology officer at lenses.io, thinks organizations should value data engineers who can do more than create sleek pipelines, but admits that what many organizations see them as. “I used to see great data engineers who best understood business requirements being muscled out in an organization because they didn’t adopt the latest open-source, bleeding-edge technologies,” he said. “Many of these big data projects failed.”
This is something that could be said about software engineers as well: The best software engineers will understand not just the technical requirements for a particular software, but also what business outcome the organization is hoping to achieve.
In fact, the largest misconception about data engineering is that it is closely related to data science. The two disciplines are related, but in the same way goats are related to grass, not the way goats are related to sheep.
This does not mean that data engineers are low-level grunts. “I think there’s this impression that it’s kind of a crude task,” Feeney said, about building a data pipeline. “There’s nothing further from the truth. If the data pipeline doesn’t work and you’re building a data-intensive application or running a series of experiments, then everything falls apart.” Getting high-quality data out of transactional systems is challenging. “We work with a lot of incredibly talented data engineering teams that are faced with shocking challenging beating databases into submission so that the data comes out in a usable format,” Feeney said.
Data scientists depend on data engineers to get high-quality data so that the experiments aren’t plagued by ‘garbage in, garbage out’ problems. “We’ve seen cases where the data science team thinks they’ve had some amazing breakthrough,” Feeney said. “Then the data engineering team tells them, no, actually somebody changed the way we record this on the 14th of June. There’s nothing there.” So getting the pipeline right is incredibly important.
The Future of Data Engineering
“The businesses that understand their data and how that data can inform their business are going to be the ones that are successful,” Smith said, about why he things that data engineering as a speciality will only expand. “Companies will need good data management strategies, and they will need more people to specialize in these underlying systems.”
Beyond the fact that companies are likely going to rely even more on data pipelines in the future, there are a couple buzzwords come up when talking about the future of data engineering. First, let’s talk about DataOps: embracing as much automation in the data pipeline as possible and allowing data engineers to focus less on low-level coding and more on creating tooling that will allow data scientists and business experts to self-serve as much as possible. Stevenson sees this as the future of data engineering: Data engineers who are more technology advisors than writing Python scripts.
There’s also the idea of data mesh — increasingly embedding data engineers into business teams, so that a data engineer isn’t just moving data from point A to point B but rather is part of the conversation about how specific types of data sets can be used, what needs to happen to the data to make it useable and what kinds of business use cases the data can drive. “I’m trying to provide tools for domain-driven decentralization so that you have data owners, data producers and data engineers working within specific domains, then cooperating to make that data available as a product,” Feeney said.
Data engineering is also relevant to the ongoing conversation about data privacy. “I’m in the European Union, so GDPR is a big issue,” Feeney said. “And that’s a data engineering challenge.” In many cases, it might involve getting certain data out of a database and to a business owner while stripping all the personally identifiable information out. “There’s a lot of really interesting work going on there around data privacy and how you can take control of your personal data.”