Leaky Data Pipelines: Uncovering the Hidden Security Risks
Security professionals know that the strength of their overall security posture is only as strong as their weakest link. While IT teams improved remote access infrastructure and beefed up IAM services during the pandemic, internally developed data pipelines are now one of the weak links that can lead to significant data breaches and business disruption. Data pipelines are increasingly important for running businesses today, as they facilitate the movement and processing of sensitive information, and any interruption or breach of this data can have significant business impacts.
We’ve seen self-hosted pipelines expose thousands of credentials from Apache Airflow – because pipelines were misconfigured, contained hardcoded passwords, and were not current on security updates. A pipeline may not store sensitive data, but it will have access to sensitive data for backups, data syncs, and other data movement tasks. If an attacker gains access to a pipeline, they can start monitoring or extracting data without a security team even noticing.
Customer and public-facing systems receive the majority of security attention, including security audits, pen tests, and design reviews. This is due to these systems being highly visible and critical for companies. Data pipelines are less visible but equally important systems that are often overlooked. Unless a company applies the same scrutiny to its pipelines as its production systems, it will be at risk. As we’ve seen from Solar Winds to Log4j to 3CX, supply chain attacks have been increasing so quickly the White House issued an Executive Order.
Hackers may find that internally developed data pipelines are particularly attractive due to their access to sensitive data. In addition, these internal pipelines are often maintained outside of normal engineering processes, and internal security teams may not even be aware of the risk, much less have the tools to monitor data leakage from an internal pipeline.
When data pipelines are maintained outside of software engineering teams by analytics staff or IT support teams (or worse, “shadow IT” deployments), different budgets, priorities and procedures can lead to gaps in securing the entire infrastructure. When pipelines are maintained by IT generalists, this increases the risk of a misconfigured setting or hard-coded password leading to an exposure. And the nature of a data pipeline means it’s not as easy to detect if a bad actor is monitoring the data flow and just siphoning off information over time.
To help prevent these under-the-radar types of pipeline hacks, companies should apply the same security practices for their data pipelines as leading SaaS companies use to protect customer data. A “set it and forget it” mentality is obsolete in a connected world.
Unfortunately, the amount of work needed to secure in-house datastreams and on-premises servers is about the same for a 50-person company as it is for a 5,000-person company, which means most 50-person companies won’t make the effort unless they use a SaaS approach. Some of the benefits of a managed pipeline include automated patching, encrypted storage of credentials, ongoing penetration testing, and multiple compliance audits.
One of the first steps is to utilize security checking/configuration management tools from the cloud providers, such as AWS Trusted Advisor, or third-party cloud security posture management tools. These tools can check for common mistakes and ensure compliance with industry best practices. And companies should schedule regular audits and reviews to catch any misconfigurations that come up, along with annual pen test reviews. Utilizing a managed pipeline can reduce management overhead and ensure thorough testing.
Staying up to date on security patches and updates is always crucial, but out-of-date software can be easier to miss if it’s only running on internal systems. With all the supply-chain attacks, constant monitoring and patching is necessary to address vulnerabilities. The recently announced Supply-chain Levels for Software Artifacts (SLSA) Version 1.0 should help protect software code from tampering and facilitate secure development practices. Regardless, constant monitoring and patching are necessary to address vulnerabilities for external and internal services.
Another overlooked aspect of self-hosted pipelines is their architecture. Secure architectures define one primary function per server or container. PCI DSS and NIST 800-53 both require this as part of secure configuration and hardening. Most self-hosted pipelines run as a single server or container to simplify management. This is a risk if one component of the service (e.g. Log4j) has a vulnerability, an attacker can easily move anywhere in the system. A secure pipeline will separate the web application administrative interface from the database storing sync settings, and from the worker or connectors syncing data. With proper isolation, administrators can harden the system by limiting what each component can do, and alert when anomalies occur.
IT security professionals need to recognize the critical importance of securing data pipelines in a data-driven corporate environment. By utilizing comprehensive security practices, capitalizing on managed pipeline solutions, and implementing robust architectural controls, organizations can safeguard their data pipelines effectively. By doing so, companies will significantly enhance their overall security posture and protect sensitive information from potential breaches and disruptions.