The Silent Semantic Failures That Break Distributed Systems

It’s hard enough debugging complex distributed systems when things go wrong, but it’s even harder when they don’t. And it’s these silent errors sysadmins need to watch out for.
A group of John Hopkins researchers, studying nine widely-used distributed systems, came away with a dozen “informative findings,” or potential gotchas that ops folk may want to consider.
They also built a tool that “automatically infers semantic rules from past failures and enforces the rules at runtime to detect new failures,” according to a paper, authored by Chang Lou, Yuzhuo Jing, and Peng Huang. accompanying a talk given at the USENIX Symposium on Operating Systems Design and Implementation, held last week in Carlsbad, California.
If you’re interested in correctness of distributed systems, you’ll likely enjoy “Demystifying and Checking Silent Semantic Violations in Large Distributed Systems” https://t.co/9ipf0ue211 from folks at JHU.
— Marc Brooker (@MarcJBrooker) July 18, 2022
Most failures in distributed systems are obvious: crash, timeout, error code, exception. Those can be remedied. But there is another group of errors that may not cause immediate disruption, but instead, screw things up for the end user. For instance, a distributed file system may not make as many copies of data as it was directed to.
These are errors that violate the semantics of the system, in the researchers’ language. They can arise from improperly-implemented API calls and public methods, buggy command line interface commands or even incorrect configuration settings. And they can be devastating. A component that doesn’t do what is expected of it can break all the components that rely on it. “Since the violation is silent, the damage exacerbates over time,” the researchers write. The longer the corrupt file system runs, the more corrupt files it generates.
Most distributed systems worth their salt are tested regularly (and maybe even chaos tested as well), so many admins feel their systems are largely free from such errors. The researchers have found that this is not the case, however. They found silent errors haunting 36% of the systems they studied, even the well-tested ones. As more features are added to a system, the more likely such errors will build up. Often the failure could be within a single component. Routine maintenance operations can often bring them out.
For a full rundown of what could go wrong, check out the paper.
There were a few red flags in my Etsy interview which I was not conditioned to notice or emphasize at the time. It was scheduled for 3PM, but didn’t start until 3:30 because my interviewer was “really hung over.” 10/
— Dan McKinley (@mcfunley) July 21, 2022
This Week in Programming
- Do We Need Another Unsafe Programming Language? This week, a Google engineer introduced a new project to build a programming language from scratch, called Carbon. Just as Microsoft built Typescript to update JavaScript’s shortcomings and Kotlin was created to streamline Java, Carbon could serve as a successor language to C++, one that offers an easy jumping-off point for developers to a newer language that addresses modern development concepts such as memory safety and generics. In the discussion that followed, many have wondered if we need an entirely new language to address these concerns. While the Carbon engineers promise features that will ensure memory safety, they are not in place yet. And this is a sticking point for some, including Swift co-designer Doug Gregor, who wrote on Twitter, “It takes an enormous amount of effort to bring a new language into the world and make it useful, to port code, reimplement core libraries. If you aren’t getting safety out of it, why incur that cost? Is the end result actually better, or just more pretty?”
I don’t think any programming language unveiled in 2022 should lack memory safety. We have to move on from the “it must be as fast as unsafe C” mindset, because the engineering cost of unsafe-by-default is so very high. Nice syntax and whiz-bang features don’t make up for it.
— Doug Gregor (@dgregor79) July 21, 2022
- Engineer Uses Machine Learning to Predict the Location of His Cat: If machine learning continues to struggle with predicting human behavior, then it may have better luck predicting the actions of other creatures. As captured by TowardsDataScience, Welsh engineer Simon Aubury built an ML-based system to predict the location of his cat as the feline moves around in and around the house through the day. He used a bluetooth tracker and six receiving nodes planted about the home (Snowy was, helpfully, “ambivalent to data privacy concerns”) to capture hourly updates of Snow’s whereabouts. Home Assistant captured temperature, humidity and rainfall observations, and moved that info into a database using dbt. He created a model with Python‘s scikit-learn and built the app using Streamlit. The app would predict the location of Snowy, given the time of day, and the average indoor and outdoor temperatures of that time period.
- Golang Finally Gets a Proper Memory Model: As InfoWorld reported, The Go programming language’s memory model has been revised to align memory models used by C, C++, Java, JavaScript and Swift. To date, Go has only mapped “sequentially consistent atomics.” It lacked descriptions of some synchronization operations, such as atomic commits. The new model, debuting in Golang 1.19, “specifies conditions under which reads of a variable in one goroutine can be guaranteed to observe values produced by writes to the same variable in a different goroutine.” The new model “gives a more formal overall description of the Go memory model, adding a more formal description of multiword competing states, runtime.SetFinalizer, more sync types, atomic operations, and compiler optimizations,” adds the SoByte dev news site.
View this post on Instagram