CI/CD / Data

Test Data? Get Real

2 Jan 2019 6:00am, by

Karun Bakshi
Karun Bakshi is vice president of product marketing at Delphix. Bakshi loves imagining, building and talking about software-driven innovation. He has spent his career in software in nearly all associated capacities in engineering, product management, evangelism, partnerships, business development, and product marketing at various companies including Lockheed Martin, Oracle, Microsoft, Pivotal and Delphix. Whether it's discussing corner cases of algorithms or go to market strategy, he's game if you are.

It’s a story we’ve seen time and again. Software tends to fail when it does not accurately account for reality. We saw it nearly two decades ago with the Y2K scare and we saw it earlier this year when the New York Stock Exchange had to suspend trading on stocks using four digits. These are tales of data-related defects: when software systems break down due to unanticipated, incoming data exercising the software in unexpected ways. Such seemingly small defects are often incredibly costly and surprisingly common. In fact, I’m sure most organizations have dealt with a data-related defect fallout in some form or another.

Whereas the earlier examples are well known for their failure to model (future) reality well, a much more common and mundane scenario these days is the failure to robustly manage the dynamic, complexity of data states that can exist in a software system over time. That new customer field you added to your application was probably well tested in isolation. But, have you fully tested how it interacts with previous modules you wrote, or subsystems developed by other teams in all their dynamic, richness of real life? Chances are, no.

Collectively, we’ve gotten into a bad habit of using synthetic data which, by definition, is unrealistic, resulting in a lot of poorly built, fragile applications that don’t accurately reflect reality. A simple oversight and it can cost companies substantial money, time, credibility, opportunity and users to recover from it.

Synthetic data works well if you don’t have access to real data (e.g. prior to initial launch or an installed app) or the systems are simple enough that its various states are easily understood and handled. Today, however, most apps are SaaS, and even simple apps are quite complex because they interact with multiple backend systems. So, production data — a more accurate reflection of data — is readily available. Nevertheless, many of us feel compelled to keep using unrealistic, synthetic data. For most of us, it seems there is little choice. Using unrealistic data is often a necessity due to time, security, and technical constraints.

How Did We Get Here? Synthetic Data’s Slippery Slope

With the world awash in the race to Digital Transformation, time to market, and consequently, agile development and speed to market have become paramount. Starting with synthetic data for testing makes sense when you build a new app or a new feature. But as the app becomes more complex, our testing approach remains stagnant. It’s easy to build simple test cases with synthetic data. Relevant test data for more complex tests is time-consuming and painstaking to construct synthetically. One of the most common reasons we’ve come to rely on unrealistic data is the constant need to make up time in the development cycle.

The other piece of the puzzle is security and data privacy. If you’re building something that requires sensitive information as part of the development cycle, using real user data can be incredibly powerful when it comes to modeling customers’ needs and behaviors. But, we can’t of course, in today’s age of data breaches, sacrifice consumer privacy to leverage it.

Facebook’s Cambridge Analytica scandal is a stark reminder that data privacy is of paramount importance and can result in significant implications. So, directly visible personally identifiable information (PII), protected health information (PHI) or other sensitive information cannot be part of the standard testing modus operandi. And so, we settle for synthetic data as a proxy to sensitive user information.

If we want to build dependable, scalable, high-performance applications, the days of cutting corners and faking data are over. Modern development requires realistic test data to be delivered with speed and security. I know what you’re thinking: that’s easier said than done. But it is possible to deliver speed and maintain data privacy with realistic data today, and it’s the way the future is moving.

So, What’s the Alternative? A Reality Check

The alternative to synthetic data is real data — production data. Many organizations turn to copies of production data. However, creating a copy of production data can be a frustratingly slow process, when faced with the prospect of the IT ticket with a three-week backlog. Creating an obfuscated copy of production data that hides sensitive data while preserving business value (e.g. a social security number still has nine digits, referential integrity is maintained, etc.), can be similarly time-consuming and additive. Moreover, most organizations do these activities in manual, ad hoc ways fraught with errors and delays, and limiting their ability to do this frequently and consistently.

Rising to the challenge, the DataOps movement has emerged to bring discipline and efficiency to the flow of data in the modern enterprise. A cultural movement bridging the needs of data consumers and data managers as much as a technology play, a DataOps practice and platform can bring speed, security consistency and automation to the provisioning of (test) data across the enterprise.

A mature DataOps approach will comprise several key elements. It should seamlessly integrate and scale with the heterogeneous enterprise IT landscape and work with all relevant data wherever it exists (SQL/NoSQL, cloud/on-prem, etc.). It should facilitate data capture, processing and delivery in a form that consumers can use on a day-to-day basis with minimal overhead or delay. And, finally, it must proactively identify and mitigate risk as data flows across the enterprise.

With DataOps, production data delivery can be automated to accelerate application development and testing, delivering both speed and security. Without that, teams are left using stale or high-risk datasets or waiting on provisions and refreshes.

Through secure, self-service and automated access to data, devs can accelerate their workflows to receive data when and where they need it for dependable, scalable, and high-performance applications. Integrated with modern CI/CD pipelines, DataOps automation can remove one of the few remaining sources of friction in software delivery: data environment provisioning.

Data-related defects are far too common and costly in the modern enterprise. It’s time to stop cutting corners and leave “fake data” behind for good. It will take some work to get there, but it is possible to achieve speed and security when accessing realistic production data. When you do, data flows easily and securely, and great things happen.

Feature image via Pixabay.

A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.