Etcd is a distributed key-value for shared configuration and service discovery. Google Kubernetes, Cloud Foundry and Red Hat all use etcd.
As Blake Mizerany of CoreOS wrote for The New Stack, “Etcd serves as the backbone of distributed systems by providing a canonical hub for cluster coordination and state management — the systems source of truth. While etcd was built specifically for clusters running CoreOS, etcd works on a variety of operating systems including OS X, Linux and BSD.”
The etcd service is used in Tectonic, a new platform from CoreOS that combines its portfolio and Kubernetes — Google’s open source project for managing containerized applications. At CoreOS Fest last week, Intel announced they were developing a Tectonic stack. Intel will make it commercially available through Redapt and Supermicro when CoreOS makes the Tectonic stack generally available.
Understanding the Problem
To make the Tectonic stack a reality, Intel had to understand a particular scaling problem that the Kubernetes community had commented about. Kubernetes had scale limits.
How they solved the problem shows the way new technologies are not always the answers to the challenges posed by distributed architectures. Sometimes, it’s the technologies that were developed decades ago that make the difference.
Google developed Kubernetes, an orchestration system, basing it on Borg — its own container system that manages just about everything at Google. Tectonic is a new service developed by CoreOS that combines its OS and components with Kubernetes. This combination is hoped to make Google’s form of infrastructure available to any enterprise customer that needs to scale their operations in some manner.
CoreOS, which last month received $12 million from Google Ventures, has made its mark with a container-centric Linux distribution for large-scale server deployments and its secure, distributed platform for auto-updating servers. Much as the Chrome browser updates automatically, so does CoreOS for Linux server deployments. It’s used by companies such as MemSQL, Rackspace and Atlassian.
Etcd uses the Raft consensus algorithm. On the Raft consensus algorithm web page, consensus is described as a fundamental problem with fault-tolerant distributed systems. The consensus involves what is described as multiple servers agreeing on values. Once a decision is made about a value, the decision is final:
Typical consensus algorithms make progress when any majority of their servers are available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).
A paper written by the Raft creators provides a more detailed analysis of the algorithm.
In etcd, the Raft consensus algorithm is most efficient in small clusters — between three and nine peers. For clusters larger than nine peers, etcd selects a subset of instances to participate in the algorithm in order to keep it efficient.
According to CoreOS, when writing to etcd, the peers redirect to the leader of the cluster, which then redirects back to the peers immediately:
A write is only considered successful when a majority of the peers acknowledge the write.
As described by CoreOS, that means in a cluster of five peers, the write operation is only as fast as the third fastest machine. Leaders are elected by a majority of the active peers before cluster operations can continue. With this process, write performance becomes an issue when running at scale, as there must be acknowledgement of the leaders by the peers before a cluster operation can continue, which can be an issue when writing about performance in high-latency environments, such as a cluster spanning multiple data centers.
Intel’s team under Nicholas Weaver, director of emerging technologies, looked at the Raft protocol for the first place they could find a bottleneck.
They discovered in the etcd code that every inbound entry to a follower was writing to disk before acknowledging to the leader. Weaver has a background in storage. He looked at the volume of entries and recognized how expensive things could get at scale. The likely reason: a requirement for what looked like “stable storage.”
The concept of stable storage runs deep in tech history. According to Wikipedia, “stable storage is a classification of computer data storage technology that guarantees atomicity for any given write operation and allows software to be written that is robust against some hardware and power failures. To be considered atomic, upon reading back a just written-to portion of the disk, the storage subsystem must return either the write data or the data that was on that portion of the disk before the write operation.”
Stable storage offers a view into the late 1980s when the demand for mirroring data became a higher order of need, simply due to the cost that had come with using mainframes, the massive computers of the day. Again, from Wikipedia, the RAID controller served as way to implement the disk writing algorithms, which then allowed the disks to act as a means of stable storage.
But in this all lies a problem in the requirements to acknowledge all the writes that comes with the Raft protocol. That’s where DRAM enters the picture, and more specifically, “Asynchronous DRAM” (ADR), a feature that, according to Intel, automatically flushes memory controller buffers into system memory, and places the DDR into self-refresh mode in the event of a power failure.
To reduce the latency impact of storing to disk, Weaver’s team looked to buffering as a means to absorb the writes and sync them to disk periodically, rather than for each entry.
Tradeoffs? They knew memory buffers would help, but there would be potential difficulties with smaller clusters if they violated the stable storage requirement.
Instead, they turned to Intel’s silicon architects about features available in the Xeon line. After describing the core problem, they found out this had been solved in other areas with ADR. After some work to prove out a Linux OS supported use for this, they were confident they had a best-of-both-worlds angle.
And it worked. As Weaver detailed in his CoreOS Fest discussion, the response time proved stable. ADR can grab a section of memory, persist it to disk and power it back. It can return entries back to disk and restore back to the buffer. ADR provides the ability to make small (<100MB) segments of memory “stable” enough for Raft log entries. It means it does not need battery-backed memory. It can be orchestrated using Linux or Windows OS libraries. ADR allows the capability to define target memory and determine where to recover. It can also be exposed directly into libs for runtimes like Golang. And it uses silicon features that are accessible on current Intel servers.
Using the new capability, Intel did its work with Raft and etcd in the lab. They tested a five node etcd cluster and found the maximum number of writes without ADR is about 4,000 to 5,000 writes per second:
With ADR, etcd could handle about 10,000, essentially doubling the transaction speed:
What This All Means
Weaver and his team have demonstrated how the state of a single machine has much in common with distributed systems. The difference comes in terms of how the state gets managed. In some respects it comes down to what gets monitored. Until more recently, the most valuable tools monitored the machines that existed in client/server environments.
I talked about this topic last week with SignalFx Founder and CEO Karthik Rau. In our conversation we discussed containers and how they behave requires people to collect the data and analyze it so the applications work accordingly across clusters.
That, more than anything, means a change in how people communicate. Apps are at the center of the universe. Increasingly, the compute will swarm to the data, which will be automated and orchestrated through microservices architectures. The services will consist of disposable containers that are portable, connected to git environments and the rest. The way containers are programmed and behave on these architectures means people will need to have different communication patterns themselves. That’s a business issue that can not be solved with org charts. The answers will surface in the data to such questions as speed and latency issues.
CoreOS, Rackspace, Red Hat and SignalFx are sponsors of The New Stack.
Feature image via Flickr Creative Commons.