Making Bare Metal Fly with Tinkerbell Provisioning and Management
Equinix sponsored this post.
A few years ago, everyone was heading to the public cloud with abandon. And with good reason: Amazon Web Services started a revolution by making it possible to build, deploy and operate applications at any scale using declarative APIs — and on a “pay as you go model” no less! But what we’ve learned over the past few years is that cloud doesn’t work for all use cases, and we’re witnessing a rise in hybrid cloud and a growing focus on the edge.
As hybrid multicloud becomes the architecture of choice for digital businesses, there are challenges to address along with the opportunities. These include the push for increased performance and lower latency, along with improved security and lower costs.
With these factors in mind — and bolstered by cloud native and open source concepts — bare metal is rising to the top of the infrastructure conversation in a way I haven’t seen in a decade.
Why Bare Metal?
Bare metal is just a fancy name for dedicated, physical infrastructure. And if dedicated infrastructure had a reputation in the 2010s, it was for being CAPEX heavy, slow to deploy, difficult to manage, and inflexible to operate. Why choose metal over a seemingly more straightforward (omnipresent) public cloud option?
In a word: control.
Bare metal offers the ultimate in control. Depending on how nerdy you want to get, with bare metal you can control just about every aspect of your infrastructure: the hardware specifications, of course, but also the tenancy and all aspects of the software running on the machine. Think beyond the virtualization layer and consider elements like the BIOS, SGX, and SR-IOV.
Most people invest in bare metal to drive two outcomes: lower price and better performance. Single tenancy is a big driver of each of these, resulting in a more stable application that can consume 100% of the hardware resources. Of course, it is not until your use case has some reasonable scale that this kind of optimization is necessary, but when it does — wow, it can really have an impact.
Increasingly, leading companies are turning to bare metal for even more distinct advantages, especially around the latest and greatest advancements in silicon. From new chips and ML cards, to SmartNICs and faster memory, having access to cutting edge technology can drive incredible benefits. Just thinking about my new Macbook with optimized Apple Silicon, it’s easy for me to see why having more control over your computers can drive incredible results — especially at scale.
Deploying Physical Hardware Is Hard
Back in the day, installing software on my home PC involved a stack of floppy disks (or CDs) and a few hours of spare time. That wasn’t fun when I had to do it on one computer, but it’s even more painful to think about how hard it would be with a few hundred servers in a data center. No wonder we were all so anxious to move to the cloud!
Unfortunately, the technology stack for managing physical servers hasn’t improved all that much over the last few decades.
Let’s just think about getting an operating system on a server, and leave aside all the other pesky bits — like firmware management and getting IP addresses. The first bit of technology you’ll need (DHCP) was introduced in 1993… the same year that Intel shipped its first Pentium chip and Bill Clinton became President in the U.S. The second (PXE, or a Preboot Execution Environment) was introduced in the late 1990s.
It goes to show how long we’ve been using the same old technologies for managing servers.
Making Hardware Feel Like Cloud
It was always Packet’s (now Equinix Metal) vision to help more people take advantage of physical infrastructure. A big part of the story was based on a future that they believed would be built on increasingly diverse hardware, in vastly more locations. The main engineering challenge was to make it 100% automated, so that a developer — or more likely the software she wrote — could consume it.
Basically, Packet wanted to provide a “cloud-like” experience for physical infrastructure, no matter what it looked like, where it lived, or who owned it. Heterogeneity meant that the team needed a really flexible and extensible way to normalize and automate the entire lifecycle of physical hardware. That’s how Tinkerbell and its related components were born (to run Packet and now Equinix’s bare metal as a service), and why we announced it last year in The New Stack and then in May of 2020 open sourced the project.
The goals are simple: bring a modern, cloud native approach to lifecycling diverse physical infrastructure at scale. By modernizing existing technologies, and exposing them through an API, Tinkerbell helps a new generation of developers unlock the value of hardware programmatically.
Why is this Still a Problem?
I’ve always believed that before you can automate or improve something, you need to understand how it works. While making servers turn on and off isn’t rocket science, it is fairly niched knowledge.
Alex Ellis does a great job of breaking down the key terms and processes in a TNS post from last April: “Bare Metal in a Cloud Native World.” Pay special attention to DHCP, TFTP and PXE. While not complex on its face, in reality, it is a fragile process for a few key reasons.
First is the chaos introduced through the variety of hardware. A lot of challenges can be mitigated by working with the same exact systems. Diversity is what makes this complex. Each change in hardware components or configuration (let alone firmware, operating systems and other software aspects) brings a new “workflow” to the situation.
Second, unlike software-based environments, we are interacting with physical systems. Power cables come unplugged, disk drives fail, and pushing the “reset” button is hard to do when you’re a few thousand miles away. This is a paradigm shift that asks developers to “expect the unexpected” and write software in a more failure-ready way.
How Tinkerbell Works
Tinkerbell improves upon this state of affairs, mainly by bringing a cloud native approach to foundational technologies like DHCP — while adding a workflow engine that helps to embrace and normalize diverse hardware.
There are five components to Tinkerbell, which automatically deploy during setup:
- Boots – Boots reimagines the components needed to “boot” a server remotely in Go: DHCP and PXE, primarily; but also tftp and iPXE.
- OSIE – OSIE stands for Operating System Installation Environment. It consists of an Alpine Linux based netboot image, which fetches a prebuilt Ubuntu container that does the actual installation of an operating system.
- Tink – Tink is a workflow engine that consists of a server and a CLI, which communicates over gRPC. The CLI is used to create a workflow and its building blocks: templates and targeted hardware. Basically, once OSIE is running, it reaches out to Tink to see what it needs to do. The Tink server will then send a declarative YAML file, telling it what to run.
- Hegel – The problem with brand new servers is that when they come “alive” they don’t know who they are, what else they can talk to, or even what their IP addresses are. Hegel is a simple metadata service that provides this information back to the server once it comes up.
- PBnJ – Power and Boot commands are handled by the “PBnJ” service, which talks to the BMCs (Baseboard Management Controllers) to perform these critical actions.
These five microservices work together to provision a bare-metal machine. To see a deep-dive, check out this live stream with Gianluca Arbezzano and Jeremy Tanner.
Entering the Sandbox
Tinkerbell recently entered the Cloud Native Computing Foundation as a Sandbox Project. This is an important move for the project, as it expands the community and shows our commitment to governance and transparency. We hope and expect that this will invite more developers and companies to lean into the project, adding their use cases, questions and contributions along the way.
But no need to wait! Tinkerbell is all open source today, and while the project is still fairly young, the community around it is strong. In addition to the upstream version, Tinkerbell also has a sandbox for you to play with. It’s built using Docker components and starts everything up for you as a full service.
Need help or have a question? Join the Tinkerbell Slack to geek out with the Equinix Metal team and other contributors, or give us some feedback on what you’d like added by submitting a proposal.
Amazon Web Services and the Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature image via Pixabay.