Apache Daffodil Tackles Problems of Universal Data Interchange
Daffodil, one of the latest projects to achieve Apache Software Foundation’s top-level status, isn’t some flowery new programming language, but an implementation of the Data Format Description Language to convert between fixed-format data and XML/JSON.
In distributed computing, in order for software and hardware to work together, they must be able to read and write data in a variety of formats. Daffodil is an open standard framework developed by the Open Grid Forum. It is not itself a data format, but a way of describing the attributes of any data format and enabling universal data interchange.
“Despite the fact that the world has lots of data formats, people continue to invent new ones,” said Mike Beckerle, vice president of Apache Daffodil.
People talk about legacy data formats “as if there was something bad about them,” he said. “The legacy data formats that have stuck around and are still in heavy use are, in some sense, widely successful. And so accommodating them more affordably and in a standardized way is very important.”
The project has a long history. Beckerle said he started working on DFDL, a way to describe general text and binary data in a standard way, back in the early 2000s, although that wasn’t his day job. He and others teamed up with a project created in 2009 at the University of Illinois National Center for Supercomputing Applications, and the Daffodil project joined the Apache Incubator in 2017.
DFDL is used to describe both textual and binary data formats — scientific and numeric, legacy and modern, commercial record-oriented, and many industry and military standards. The Department of Defense and IBM were two major backers early in the project, Beckerle said.
DFDL is a subset of W3C XML schema to describe the logical format of the data and add annotations within the schema to describe the physical representation.
DFDL can be used to describe legacy data files, to simplify transfer of data across domains without requiring global standard formats, or to allow third-party tools to easily access multiple formats. DFDL also can be a powerful tool for supporting backward compatibility as formats evolve, according to the documentation.
It enables developers to use XML or JSON to consume, inspect, and manipulate fixed-format data without having to know all the details of the data structure, Beckerle said. Daffodil can also be used to reverse that process to return the data back to its original format.
Daffodil is in use at many large organizations such as DARPA, GE Research, Naval Postgraduate School, Owl Cyber Defense, Perspecta Labs, and Raytheon BBN Technologies, among others.
Startup companies that get acquired by the likes of Oracle, IBM and SAP typically bring in data in ad hoc ways, then the acquiring company has a portfolio of products in myriad data formats, Beckerle explained.
“In most cases, they’re not libraries, you have to buy this product and somehow deploy it,” he said. “The contribution of DFDL and the Apache Daffodil implementation is to provide a standard … an open source implementation, so as to make this constant proliferation of these things stop.”
Apache Daffodil can be embedded into data ingestion tools to understand data formats without the need for those tools to develop their own approaches to understanding data formats. It also can be used in data-directed routing, in which by understanding the data, it is routed to the appropriate place.
It has Java and Scala APIs, and provides Apache NiFi processors for parsing and unparsing NiFi FlowFiles.
As part of its path forward, the project has undertaken what Beckerle describes as an “ambitious project” to make Daffodil faster — to be able to generate programs written in C “that are as tight and fast as what you could write by hand.”
It’s also working to make Daffodil easier to use, including improving its data debugger.
Beckerle, by day principal engineer at Owl Cyber Defense, explained how his company uses Daffodil:
“You have a firewall between an untrusted network like the internet, and a secure network, where your servers and databases are, and you have data coming across that firewall. So most people are familiar with the notion of a whitelist in the firewall [where] there’s only certain kinds of data are allowed across. And so the question is: How does a firewall know that a given piece of data has magic numbers and the right file extension and so forth?
So the way you do that is proof by construction — the data comes in, you rip it apart using Daffodil and a DFDL schema that describes that data, break it down into this data structure … inspect all that, then you put it all back together using the unparsing capability of Daffodil, which is the inverse process of parsing. Now you definitely know, by construction, it is what it says it is. And if it can’t survive that process, well, then it wasn’t valid data.”