Y’all Against My Lingo? Why Everyone Hates on YAML
We’ve always needed software configuration files, especially when tuning code to work for our purposes. Way back when, when I used to run the tech side of a UK national ISP, much of my day used to be building and managing all kinds of config files, across everything from PCs to routers, with configurations stored in source control systems.
Some were easy to use, others obtuse and complex, where a simple missed character could take out email service for thousands of users. Remember the old sysadmin joke, that the only way to get a working sendmail configuration was to have a cat walk across a keyboard?
That was thirty years ago. Surely the industry should have moved on by now.
We still seem to have all the same problems with writing and managing configuration files, whether they’re designed to be human-readable or left for the machines. In fact, things seem to have become worse, with complex node-based configurations written in what is almost, but not quite, a set of key value pairs. And what should we blame this on?
Originally named “Yet Another Markup Language,” its name has morphed with its role into the official “YAML: Ain’t Markup Language,” though most of us haven’t read the memo. That name change was meant to reflect its role as a data serialization language, not as an HTML-like markup language. Instead, it’s been designed to be a JSON-like language that uses markup to embody data structures and their content. Or at least that’s what it’s intended to be.
In practice, however, it’s clearly not. Perhaps the biggest problem with YAML is that it’s everywhere. You must have seen the meme based on an old cartoon: a man is with a fortune teller who is looking into a crystal ball. She looks at him in horror, saying “YAML, I see so much YAML.”
If I Had a Hammer
When ubiquity is a joke, it’s time to worry, as it means that the world has a hammer, and everything is a nail and that people have noticed what’s happening and have realized that they don’t like it.
That ubiquity has pushed YAML right across the spectrum of applications and services, targeting everything from Kubernetes to consumer applications. YAML might work well for managing enterprise services, where we can write our own custom configurators that output ready-to-use files, but is it suitable for mom-and-pop and their nascent smart home? Friends rant about configuring the popular Home Assistant IoT hub, trying to remember the syntax for each type of device, and managing the various config files they need — often with basic text editors on Raspberry Pi single-board computers.
Why Is YAML so Complex?
One of the biggest issues with YAML is formatting. It’s strict, and as a result, illogical at first glance, much like making lists in markdown. White space shouldn’t need to matter in today’s world. I learned (and used) FORTRAN many, many, years ago and even then, counting spaces and tabs was tedious and problematic. YAML’s block formatting is at best rigid, especially when you have to remember to terminate a line with a space. Things get more complex when nodes in a YAML document have complex mappings and content quickly becomes hard to read.
Part of the problem is perhaps that we’re still using many of the same tools I was using to write FORTRAN back in the early 1990s. Basic editors like vi and vim remain the main editors on devices like the Pi, and there’s not really any way to use them to force language-specific formatting rules. They’re simple, quick editors, and that makes it possible to make trivial YAML errors that are hard to debug — as each edit cycle means restarting processes and waiting to see whether programs crash or that your configuration does what you intended.
Things get more complicated with string data in YAML. You don’t have to put quotes around them, and it’s possible for a string to get confused with an intrinsic value. This has become known as the Norway Problem, where when using two-character country codes, Norway is evaluated as a Boolean and set to false by many parsers if not explicitly quoted.
YAML is intentionally complex. Its specifications take years to write and fill entire books, with a lack of test suites to help parser designers. That leaves many parsers lagging the specification, so you’re left using trial and error to determine what YAML documents and formats work for you. And that leaves a worrying gap between authoring tools and parsers: what if your syntax highlighting parses a different version of the YAML specification from the parser in the application you just installed?
There’s a bigger question here: should data serialization languages be used for configuration files? You could argue that they make it easier for applications to read data, parsing the basic primitives used by both YAML and JSON, but what’s good for machines is often bad for people. YAML constructs are often complex, and there’s the risk that different libraries may parse them differently — whether reading or generating YAML for you.
Getting parsers to agree is critical. It’s the heart of specifications like XNL and SGML, where parser design goes hand in hand with the underlying language grammar. YAML’s deliberately informal approach makes it impossible to have a standard parser, you only have to look at the errors that show up on the YAML test matrix — which itself is based on only one version of the specification, with a significant number of processors (many of which are still in use in old and unmaintained code) still using older versions of the YAML specification.
The result is a specification and a tool that’s drowning under the weight of its own complexity. Even JSON is simpler to use and parse — and its specification is only a few pages long. That’s not to say there aren’t good points to YAML. Describing configurations as node trees makes sense, even if it is hard to visualize.
In the decade or so it took to go from YAML 1.1 to 1.2 the industry has moved on. We’re thinking in terms of platform engineering, and while configuration as code is more important than ever, there are new entrants with newer ways of thinking that make more sense when managing complex distributed systems.
So, what do I prefer? I’ve become a convert to procedural approaches to code-based configuration. Languages like Azure’s Bicep and platforms like Pulumi’s make much more sense to me. I can use familiar constructs to build the infrastructures I need, and the same time put in place and manage the platforms my code needs. Sure, they might compile my infrastructure code down to declarative descriptions in JSON and YAML, but I don’t need to worry about the outputs. I even get to use flow control and loops.
As Aaron Kao, Vice President of marketing at Pulumi notes, it’s also a world where automation is increasingly important. We’re building multicloud services, where engineering teams need to control much more than before, and where we need to be able to let teams build what they want, how they want, but keeping them within limits by providing guardrails and guidelines, something that’s nearly impossible to do in YAML. We need to be able to manage tens, hundreds, even thousands of configurations, and controlling that amount of YAML is near impossible.
Of course, that all leads to another argument that general-purpose data description languages like YAML are simply too high-level to deliver the results we want. Instead, we should be thinking about working with more focused domain-specific languages, able to encompass the idiosyncrasies and specific requirements of the platforms they target.
Perhaps we’ll be better off in a world where there’s not one hammer and one nail, but a toolbox that contains the right tool for the job you want to do. Maybe it’s one where we get down to the nitty-gritty of specialized tools, or maybe it’s one where we simply write code and let it compile to manage the command lines, the APIs, and, yes, the configuration files we need.
But I shouldn’t leave things on a low note. After all, there’s one good thing to say about YAML: it’s still better than working with sendmail.conf.