What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Software Development

Buggy Software Foils Dreams of Infrastructure-as-Code

Nov 10th, 2016 11:51am by
Featued image for: Buggy Software Foils Dreams of Infrastructure-as-Code

Many of us think of infrastructure-as-code as something of a given, or at least a possibility of how things could be done.

But at OSCON Europe, Joe Damato, who is the founder and CEO of the Packagecloud hosting provider, discussed some of the challenges and pitfalls of trying to programmatically manage infrastructure software.

“People are using infrastructure as code today … with varying levels of success and in many cases great pain,” he said.

In many cases, these challenges come from the many combinations of languages and specific hardware used.

In 1999, AMD Opteron Revision E and pre-release Revision F processors had a hardware bug in the way they were handling atomic instructions. This type of bug can be disastrous for some types of applications, though it is possible to build software that protects against hardware bugs. Damato says that, “MySQL is pretty dope … MySQL detected that bug and guarded against a hardware error corrupting your database.” He wrote a blog post if you want to learn more about this issue in all of its gory detail.

“Different languages have different tradeoffs,” Damato said. Some languages like Assembly, C, and C++ are more difficult, but they are less abstract and can be used defensively. He went on to say that “some languages are perceived as easy, but are terribly difficult,” like Ruby, Perl and Bash. “You must be an expert in C to write good, fast Ruby,” he said.

He offered a story to illustrate this issue that involved two particularly insidious bugs in the Ruby VM implementation related to MRI [Matz’s Ruby Interpreter] segfaults and MRI threading. The segfault bug is a result of a design error in the way garbage collection and the object allocator work, and the catch is that because this is buried within the VM, there is nothing you can do about this when you are programming in Ruby. The threading bug resulted in the wrong set of system calls being used, which caused severe performance issues.

The moral of this story is that “your code does things outside of your reference frame … unless you’ve read every line all the way down (you haven’t),” Damato said.

These sorts of issues can cause things to fail pretty spectacularly. For example, Damato recounted “an MRI bug once made Puppet peg CPU usage,” which was causing a single Puppet run take over 20 minutes, instead of a few minutes. He mentions that the only way to resolve this issue is by fixing the underlying bug in Ruby, which he speculated is why Puppet has said that they are re-building their client-side technology stack.

Damato says that Chef had a similar issue where Chef “Ohai crashes on Solaris 11, Ubuntu 12.04” because of a garbage collection bug in the Ruby VM. The workaround was to disable the garbage collector, do all the stuff, and then re-enable the garbage collector. In other words, “the workaround is to disable a major feature of the language.”

The upshot of issues like these is that they make the idea of programmatically changing your infrastructure extremely difficult, if not outright possible.

“We won’t be able to have truly reproducible infrastructure until we figure out better ways of building computer systems,” he said. “We need to be more honest and responsible about our choices and analysis of technology.”

Another roadblock: “It’s impossible to install a program securely on most Linuxes,” he charged.

For instance, using the combination of the yum installer along with GPG encryption tool, “doesn’t work most of the time and is nearly impossible to get it working.”

He says that in yum, the pygpgme package is required for GPG, but when it’s not installed, yum doesn’t verify it and just fails silently as if nothing bad happened. In order to verify the repository metadata, you also need to use repo_gpgcheck, but the vast majority of people don’t use it because it is usually disabled by default. As for GPG V3 signatures, he shows that they are incredibly complicated and goes on to say that most people just don’t have the time to learn all of this and figure out how to make it work securely.

This isn’t just a yum issue. Damato has the same summary of APT packaging tool and GPG, a combo that also “doesn’t work most of the time and is nearly impossible to get it working.” There is the option to use debsigs or dpkg-sig, since both can sign deb packages, but they don’t understand each other, so you need to make sure you always use the same one. However, they are usually disabled by default, and Ubuntu and Debian repositories don’t use GPG signing anyway, so he describes it as being a bit pointless.

Damato talks about a few other issues with both APT and yum:

  • “Both are vulnerable to replay attacks.”
  • “Neither deal with key revocation.”
  • “Both are vulnerable to several GPG related attacks.”

A big part of the issue is that critical things are built on top of layers and layers of scripts and other code that no one fully understands through all of these layers.

“Huge companies making billions of dollars on top of these software systems should take the initiative to invest in making them better,” Damato concluded. “We haven’t found the ‘answer’ yet. what we have is better than what we had, but we need to think bigger.”

Disclaimer: Dawn Foster worked at Puppet from 2012 through 2015 and currently owns stock in Puppet.

Feature image by Wisconsin Department of Natural Resources, CC BY-ND 2.0 license.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.