The Silent Semantic Failures That Break Distributed Systems
It’s hard enough debugging complex distributed systems when things go wrong, but it’s even harder when they don’t. And it’s these silent errors sysadmins need to watch out for.
A group of John Hopkins researchers, studying nine widely-used distributed systems, came away with a dozen “informative findings,” or potential gotchas that ops folk may want to consider.
They also built a tool that “automatically infers semantic rules from past failures and enforces the rules at runtime to detect new failures,” according to a paper, authored by Chang Lou, Yuzhuo Jing, and Peng Huang. accompanying a talk given at the USENIX Symposium on Operating Systems Design and Implementation, held last week in Carlsbad, California.
If you’re interested in correctness of distributed systems, you’ll likely enjoy “Demystifying and Checking Silent Semantic Violations in Large Distributed Systems” https://t.co/9ipf0ue211 from folks at JHU.
— Marc Brooker (@MarcJBrooker) July 18, 2022
Most failures in distributed systems are obvious: crash, timeout, error code, exception. Those can be remedied. But there is another group of errors that may not cause immediate disruption, but instead, screw things up for the end user. For instance, a distributed file system may not make as many copies of data as it was directed to.
These are errors that violate the semantics of the system, in the researchers’ language. They can arise from improperly-implemented API calls and public methods, buggy command line interface commands or even incorrect configuration settings. And they can be devastating. A component that doesn’t do what is expected of it can break all the components that rely on it. “Since the violation is silent, the damage exacerbates over time,” the researchers write. The longer the corrupt file system runs, the more corrupt files it generates.
Most distributed systems worth their salt are tested regularly (and maybe even chaos tested as well), so many admins feel their systems are largely free from such errors. The researchers have found that this is not the case, however. They found silent errors haunting 36% of the systems they studied, even the well-tested ones. As more features are added to a system, the more likely such errors will build up. Often the failure could be within a single component. Routine maintenance operations can often bring them out.
For a full rundown of what could go wrong, check out the paper.
There were a few red flags in my Etsy interview which I was not conditioned to notice or emphasize at the time. It was scheduled for 3PM, but didn’t start until 3:30 because my interviewer was “really hung over.” 10/
— Dan McKinley (@mcfunley) July 21, 2022
This Week in Programming
I don’t think any programming language unveiled in 2022 should lack memory safety. We have to move on from the “it must be as fast as unsafe C” mindset, because the engineering cost of unsafe-by-default is so very high. Nice syntax and whiz-bang features don’t make up for it.
— Doug Gregor (@dgregor79) July 21, 2022
- Engineer Uses Machine Learning to Predict the Location of His Cat: If machine learning continues to struggle with predicting human behavior, then it may have better luck predicting the actions of other creatures. As captured by TowardsDataScience, Welsh engineer Simon Aubury built an ML-based system to predict the location of his cat as the feline moves around in and around the house through the day. He used a bluetooth tracker and six receiving nodes planted about the home (Snowy was, helpfully, “ambivalent to data privacy concerns”) to capture hourly updates of Snow’s whereabouts. Home Assistant captured temperature, humidity and rainfall observations, and moved that info into a database using dbt. He created a model with Python‘s scikit-learn and built the app using Streamlit. The app would predict the location of Snowy, given the time of day, and the average indoor and outdoor temperatures of that time period.
View this post on Instagram