Quite frequently in The New Stack, you’ve seen us discuss how continuous integration and continuous delivery (CI/CD) help developers automate the process of testing and staging applications. But what if these developers are automating the building of applications designed to utilize massive customer databases?
Think about this more deeply for a moment: Borrowing snippets of real customer data from a real database to test a database-driven application in a real-world situation is probably non-compliant at the least, and may moreover be unethical, and could quite possibly be illegal, depending on your employer or client. Generating synthetic data (“Doe, John Q,” “1000 Liberty Lane,” etc.) is too time-consuming a process, especially if the volume of data an application uses at any one time is critical to the realism of the test.
If infrequently-run tests dependent upon homemade, fake data are automated based on that same infrequent schedule, then it’s no longer truly continuous integration. It’s simulated continuity.
“Companies do lots of workarounds, to try to do as best they can without virtual data,” said Dan Graves, the vice president of product management for a company called Delphix.
Virtual data is not fake data. In the open source community, there are multiple ongoing projects dedicated to producing APIs for retrieving fake data suitable for use in applications testing. For variable-length strings, for example, JSONPlaceholder can be set to produce chains of meaningless tangles of Latin according to specific parameters. A Node.js module called Faker produces false records that help build tables of personal data. MockAPI can generate not only false records but fake relations between them for relational database testing. But if all you need is just the data without the methodology, there’s the Web site generatedata.com.
Does the production of fake data introduce performance risks, especially in “big data” environments where volume may impact parallelism? In an O’Reilly book published last December entitled Sharing Big Data Safely: Managing Data Security, authors Ted Dunning and Dr. Ellen Friedman go so far as to suggest testing fake data before authorizing its use in apps testing.
In citing a particular case study, Friedman and Dunning wrote, “Once it was established inside the security perimeter that a particular version of synthetic data matched KPIs [key performance indicators] sufficiently well in the context of the current model training algorithms, the fake data was then used outside the security barrier to build new and improved versions of the model of interest.”
“A lot of the data breaches you’ve seen over the last few years,” he explained, “came from attacking the non-production systems — dev, test, and recording systems — because they’re typically less well-defended.”
All this performance tuning of pig-latin phrases in real-world simulations is typically avoided, Graves told us, by carving small wedges of real customer data out of active databases. But that introduces a host of problems, the biggest of which is not as obvious as the one you’ve just thought of.
“If the actual application has a big, ten terabyte database,” he told us, “they may be running with a hundredth of that so that they can refresh it in 30 minutes instead of a day. But the problem obviously is. Therefore, it’s not that representative of production. And they’re going to find data errors.”
Never Play with a Full Deck
But even if the segmentation process only consumes a half-hour of testing time, he said, the fact that it has to be done at all may constrain the total amount of time during an average week that organizations devote to automated regression tests, even if they’re baked into what purports to be a CI/CD process. “At the end of a run, can you run it again two minutes later? Or a day later? If it’s a day later, it means you’re only actually running seven hours of tests a week, even though it’s only an hour to run one test.”
Graves’ suggested solution, as you may have already guessed, is version 5.0 of Delphix’ data operations software, which was generally released in April. Tucked deeply beneath Delphix 5’s marketing message of “driving business agility” is the discovery that it provides a service called selective data distribution. If a database were a deck of cards, imagine this as a way to shuffle the deck while removing some of the cards that would complete a full house or a straight flush — the items that would make a record identifiable.
As Graves described it, the process “de-identifies” data, in such a way as to satisfy two main objectives: maintaining compliance with regulations such as SOX and HIPAA, and reducing security risk. “A lot of the data breaches you’ve seen over the last few years,” he explained, “came from attacking the non-production systems — dev, test, and recording systems — because they’re typically less well-defended.”
The Delphix profiling system scans through HIPAA-regulated data identifies fields that require masking and complies with policies and instructions regarding who should have access to that data. The results are sanitized, full-volume copies containing non-identifiable records. What’s more, since a real-world database schema produces unions and joins of records from multiple sources (for HIPAA-regulated systems, Graves said, numbering in the dozens), each false record must follow the same relational pattern. This way, whenever a real person is replaced with a pseudonym, the same pseudonym will apply in all circumstances.
“There’s a lot of science in the world of data masking,” he remarked, “to make sure that, while we are anonymizing information, we’re not destroying its validity for running analytics or doing a QA test. We spend a lot of time tuning these algorithms, to preserve what needs to be preserved.”
Feature Image: A 2010 bridge construction collapse in Canberra, Australia, by Flickr user Richard, licensed under Creative Commons.