Machine Learning has a data quality problem. A set of bad data can work its way through the entire modeling process before someone may notice that it’s faulty, forcing a laborious debugging process. And while there are various tools to check data as it enters ML operations, there are few frameworks out there to standardize data validation across an entire system, or company.
“In true Python spirit, it’s the Wild West, so you just got to do stuff yourself,” said Niels Bantilan, a machine learning software engineer at ML software provider union.ai, said about ML data validation during a talk at the Linux Foundation‘s Open Source Summit, being held this week in Seattle.
Bantilan introduced two open source programs that he created that can help root out bad data before it is used in production, as well as standardize the process of data validation.
Commercially supported by Union.ai, Flyte is a Kubernetes-friendly DAG-based data pipelining framework that can type check material that has been ingested as Data Frames in the Python Pandas format. And Pandera builds on this framework by also providing additional statistical and validations checks against data, allowing an organization build out a data schema that embeds some domain knowledge around the acceptable data ranges and types.
When used together these programs can validate data as correct, throwing out alerts at runtime when they are validated. In machine learning, type safety is vitally important if for no other reason than it can save considerable time and resources.
An organization could base their business decisions on faulty data, and it can be a real chore to go step back through an unannotated ML process to learn to find out what went wrong.
Much of the data used today in ML is encoded in Python Data Frames, which are basically tables of imported data with little or no additional context. Python, however, is a dynamically typed language, in that it does not check what type of data is entered as a variable.
On its own, Python could not flag when, say, a string of data is inadvertently entered as a value, instead of an integer. Such an error would result in an error during runtime. Even if all the values are strings, Python’s math operators when applied to strings could lead to undesired results, Bantilan said.
A strongly-typed “data lineage tracking platform,” Flyte can among other things, perform type checking, preproduction checks to ensure only integers are in the integer column.
With Flyte, the ML engineer writes tasks that can preprocess data. Each task is actually a Python Decorator function run in its own container. Tasks can be chained together as workloads, with the input and output of each task clearly defined.
With Flyte schemas, you can build a fully typesafe DAG ML workflow, which can ensure that the data used is correct.
“This is a great feature to have. Because now that you have type information, you basically have function types for your functions. And your function now can be analyzed to see what is a valid set of operations. So you can assess your workflow for validity just on the basis of the allowed input and output types,” Bantilan explained.
The resulting Python code can be run locally and deployed into a production environment. Flyte can be installed through the Python pip.
Beyond Type Safety
Pandera is a statistical typing and data testing tool that can be integrated in Flyte to validate additional properties beyond data types, in effect adding guardrails to a data processing pipeline.
Statistical typing specifies the properties of collections of data points. For instance, if you already know the range of values for input, you can check to ensure the data falls within this range. You can match against a regular expression, or that the null value is not entered too many times. You can check for the uniqueness of a column, or its “Monotonicity” (are the values increasing or decreasing?)
With data testing, Pandera can both validate the live data coming in as well as the functions handling that data. You can encode assumptions about Data Frames as schemas, which can be used as Python Type annotations, and checked by way of function calls.
” You can easily integrate data frame types with your pipelines that get informative formative errors if something goes wrong,” Bantilan said.