Facebook has finally run into the sorts of problems typical enterprises are used to: legacy hardware. For Facebook, being so large as to handle over 2.2 billion users means that by the time a new data center has been rolled out, the last one is out of date. With so many machines around the world built to various specifications, the teams at Facebook were encountering a great deal of painful out-of-memory errors (OOMs).
The problem was that while some machines could handle the growing Facebook stream of applications, others weren’t up to the task. Picking those specific machines out of the crowd and avoiding them wasn’t an option, as Facebook treats its servers like cattle, not like pets. The company has come up with a few solutions to this issue, however, and at the end of last week, the company released a link to a related project on GitHub.
While the actual art of Location-Aware Distribution, or LAD, as Facebook calls it, is still being formalized, the system is in use currently at the company. It takes the form of two components: a proxy agent installed on every machine, and a distribution server. The proxies contain configuration files which can be distributed laterally across nodes in a peer-to-peer fashion. The proxy nodes themselves arrange in a tree-based peer-to-peer grouping structure, allowing updates to trickle down through only the machines that need them.
Facebook built this new model after learning a lesson from Apache ZooKeeper: Separate metadata updates and distribution from content distribution. That’s not to say that this new peer-to-peer model is all honey and roses, of course. The Facebook team behind this work quickly learned that debugging such a system is quite difficult.
According to Facebook, “LAD is currently being deployed into production as the data distribution framework for our configuration management system. We are also evaluating other applications for large-scale content distribution.”
The social media giant has also been doing work at the kernel level, with the newly released oomd, the user-space out-of-memory killer. The traditional OOM killer built into Linux isn’t quite situationally aware enough to handle the problems Facebook was facing. Traditionally, the OOM killer in Linux will simply cut in and kill random processes it feels are infringing on the kernel’s RAM when things get tight. Facebook wanted a way to make this process friendlier, and more aware of the processes it was trying to kill. From Facebook’s blog posting on oomd, the team states that the project was built on top of two other major developments in the Linux community:
- Pressure Stall Information: PSI is a new utility, currently pending upstream integration, developed by Facebook engineer Johannes Weiner. PSI tracks three major system resources — CPU, memory, and I/O — and provides a canonical view into how the usage of these resources changes over time. PSI provides quantifiable measurements of overall workload performance by reporting lost wall time due to resource shortages. When PSI is deployed in production, we find that its metrics act as a barometer of impending resource shortage, allowing userspace to take proactive steps as needed.
- Cgroup2: This successor to cgroup is a Linux kernel mechanism that organizes processes hierarchically and distributes system resources along the hierarchy in a controlled and configurable manner. Oomd makes use of cgroup2’s sophisticated accounting mechanisms to ensure that each workload is behaving appropriately. For example, cgroup2 reports accurate resource consumption for each workload as well as process metadata. Cgroup2 also has a PSI interface that oomd uses.