Performant and Programmable Telco Networking with eBPF
To keep the world connected, telecommunication networks demand performance and programmability to meet customers when and where they are, from streaming the winning goal of the world cup to coordinating responses to the latest natural disaster.
When switchboards were still run by human operators, telco companies were all about custom hardware with “black boxes” from vendors providing the speed the network needed. These black boxes controlled the performance of the network, which also made it dependent on where they were actually deployed.
As telcos moved from traditional phone calls to additional services like messaging and mobile data, the demands on the network pushed the boundaries of what was possible. Network Functions Virtualization (NFV) sought to allow telcos to use “white box” commodity hardware to scale out throughput and increase flexibility.
Technologies like the Data Plane Development Kit (DPDK) were developed to accelerate networking and bring it closer to the performance of custom hardware by reimplementing large portions of the Linux networking stack in userspace, thus achieving better throughput, lower packet processing latency, and making it easier to add functionality than trying to get it upstreamed into the kernel. However, these two trends became at odds with one another.
The performance of the boxes became dependent upon fine-tuning on a per-host basis which is hard to maintain at scale without absolutely uniform servers, but being absolutely uniform is also impossible at scale — creating an operational nightmare. Telcos were left with the worst of both worlds, getting cloud complexity without cloud benefits like flexibility and scalability.
Now, with consumers online 24/7 and every company moving to the cloud to deliver their services, what telcos need to deliver and what they can rely upon has drastically changed yet again. The rise of cloud native approaches to building scalable distributed systems has put the industry at a critical juncture, where they need cloud native infrastructure but are still stuck with the baggage from NFV, DPDK, and other related technologies.
Being able to weave together performance, flexibility, and operational scalability is what the next generation of networking for telco providers needs, but raises the question of whether we can finally deliver the vision of performant and programmable networks everywhere. Learning from these past technology transitions, we can see that the key is to have high-performance technology that is actually available everywhere. Enter eBPF.
eBPF is a Linux kernel technology with roots in networking. It allows users to programmatically extend the functionality of the Linux kernel while still doing it in a safe, performant, and integrated manner. Not only that, eBPF is part of the Linux kernel itself so is available everywhere that is running a semi-modern kernel (4.18 and above).
Instead of re-routing to where functionality can be implemented, eBPF enables efficient packet processing everywhere. eBPF is already transforming telco networks because it provides flexibility for integrating different protocols like SCTP, programmability for reducing operational complexity like leveraging NAT46/64 and SRv6, performant load balancing through XDP, and complete observability to see where the bottlenecks are.
With eBPF, telcos may finally be able to deliver the performant programmable networking that they have been driving toward for so long without getting tangled in an operational nightmare. Telco vendors as critical players in the value chain now have a unique opportunity and a toolset on the network level to modernize their Network Functions and make them suitable for the cloud native world with eBPF.
Bringing Network Function Virtualization up to Speed
As telcos began the transition from specialized hardware boxes to NFVs running on top of “white box” commodity hardware, their performance and operational considerations needed to change to address this new paradigm.
With dedicated devices like packet processors, DSPs, and FPGAs, key network performance characteristics like throughput (bits and packets per second), latency, and jitter (and the changes in the above) were more or less consistent and predictable. When relinquishing the performance advantages of “bare metal” and dedicated devices, telcos still needed to keep the performance characteristics of the network up to speed — even when scaling horizontally — to keep pace with demand and customer expectations.
Telcos now needed compensatory measures to address these crucial networking parameters. Dataplane Development Kit (DPDK) was created as a collection of building blocks to compose high-performance data processing with a strong focus on network packet processing.
Single Root I/O Virtualization (SR-IOV) allowed for “splitting” a single PCI device to multiple virtual functions. Thus instead of “sharing” a common device, multiple processes could get their own mini version of the original PCI device. Direct PCI assignment enabled detaching a PCI device from the kernel and allowing for a process (or virtual machine, or container) to directly operate it.
Combining these three concepts allowed for building powerful data planes which resembled the operation of the traditional packet switch. On top of that — combining them with the pre-existing control planes from vendors created a Virtualized Router (a VNF). They became (and still are) a huge topic in conferences, talks, and discussions.
Provisioning efficient networking for KVM virtual machines became a hot area including special functionality in KVM itself, QEMU, Libvirt and finally OpenStack (through Neutron ML2 plugins). The OPNFV project was also spun up to explore performance in the virtualized world with theoretical references, measurements, and test suits.
NFV Operations Still Running on Spreadsheets
While these technologies have been effective in bringing the performance of NFV closer to bare metal, they also impose significant operational burdens. Provisioning efficient networking for virtual machines requires meticulous orchestration and management of resources at scale.
More often than not, this “orchestration” is still not automated and a very labor-intensive process. It is not uncommon to see Excel sheets with PCI device addresses per server name. During normal operations, it is a manual and error-prone task, and during a network incident, it becomes a hopeless maze of cells and rows to resolve the issue and restore connectivity.
Attempting to automate these details spawned projects like Open Network Automation Platform (ONAP) and Airship but still left the network performance dependent on which server was running what software. And even with these projects, there are too many details to orchestrate when done at a telco scale across thousands of locations.
Trading performance for operational complexity left telcos in between a rock and a hard place where they needed to scale their networks to keep up with demands, but scaling out impacted performance and introduced additional operational complexity. Most NFVs were poorly ported versions of black-box software that was never designed to run in the cloud.
Vendors were caught behind and rewriting the entire stack from scratch was too expensive. A performant and programmable solution was needed across the network that worked wherever customers and coverage needed to go.
NFV Collides with Cloud Native
At this crucial inflection point, another technological revolution was happening in the IT world with the rise of cloud and cloud native computing. Containers and container orchestration were popping up faster than you could say Kubernetes and workload IPs addresses were changing faster than people switching apps on their phones.
In contrast to OpenStack, Kubernetes came from the world of hyperscale cloud providers and had fundamentally different assumptions about how the network would look than the telco world was used to. Instead of complex topologies and multiple interfaces, Kubernetes comes with a flat network where every pod should be able to communicate with every other pod without NAT.
Trying to mesh this model with the expectations of telco networks led to the creation of projects like Multus and an SR-IOV container network interface. However, rather than accelerating the transformation of telco networks, these technologies instead tried to replicate the NFV model, with PCI device IDs, CPU pinnings, etc, in the cloud native world thus missing the benefits of both. To really go cloud native at their scale, telcos needed a way to decouple and abstract their workloads from the hardware details.
Enter eBPF — Accelerating and Simplifying Networks Everywhere
Telcos now need the performance of bare metal while being decoupled from the underlying hardware details and it needs to be done in an increasingly dynamic world without increasing operational complexity.
Rather than waiting for the Linux kernel to change, eBPF has emerged because it can modify and accelerate networking while being seamlessly integrated with the Linux networking stack — and the best part — it is already part of the kernel and available everywhere. With eBPF, telcos can achieve a versatile and highly efficient implementation of packet processing, enabling improved throughput and performance without the operational overhead of figuring out where it is available.
By still being a part of the kernel and integrating with it rather than trying to bypass it, like DPDK did or pinning pods to specific nodes using specific PCIs pinned to specific CPUs, eBPF is able to both take advantage of the existing kernel networking stack while also modifying, accelerating, and making it more efficient when needed.
Because eBPF is part of the Linux kernel, it’s already available everywhere, allowing telcos to commoditize network scale-out while still keeping decent performance. In the cloud native era, eBPF offers a promising avenue to enhance networking capabilities while streamlining operations for telcos resulting in both a more manageable and scalable networking environment. It also offers an avenue for telco vendors to modernize parts of their network functions that are hardcoded to SRIOV, DPDK, or even to particular hardware, enabling them to work without having to worry about the underlying infrastructure.
Observability for Day and Decade 2
Since networks are always on and can take decades to retire, telcos also have to think about the day 2 and decade 2 operational concerns of their infrastructure. Physical boxes and VNF had tools that worked for static environments and were hardware dependent, but they can no longer keep pace in the cloud native world. With Kubernetes, infrastructure becomes much more dynamic and unpredictable so having a good operations and observability story goes from a “nice to have” to mandatory.
In contrast to previous tools, eBPF is part of the kernel itself which means that no modifications are needed to applications or infrastructure to get good systems, network, and protocol observability.
Rather than having to ask each vendor to instrument or modify their network functions, eBPF provides telcos complete visibility massively lowering the operational hurdle to get started. And with eBPF everywhere, observability now comes out of the box – even for legacy systems that can’t or won’t be updated. With eBPF telcos can gradually get rid of complex proprietary network and protocol tracing systems which are still significant operational cost drivers.
For telco networks, eBPF can provide the panacea of improved performance, simplified operations, and complete visibility that cloud native demands while still supporting existing systems.
Telco Networking in the Real World with eBPF
If eBPF seems too good to be true, let’s look at a few examples of how it is transforming networks in the real world today like integrating different protocols, supporting dual stack and IPv6, and increasing load balancing performance.
As a networking technology eBPF is well-positioned to support telco workloads because it works on the packet rather than the protocol level. Cilium, a networking project based on eBPF, was easily able to add support for SCTP and eBPF can do full GTP/GRPS protocol parsing despite the Linux kernel not fully understanding the protocol itself.
The world is also in the transition from IPv4 to IPv6, once again by understanding the packets, eBPF is able to translate seamlessly between IPv4 and IPv6. This also allows it to support advanced telco networking topologies like SRv6 and enables them to add value to their network. By processing packets in the kernel rather than transferring the information to user space, eBPF also offers low CPU cost observability.
Finally, with XDP, by processing the packets before they even hit the networking stack, eBPF is able to provide extremely high-performance load balancing. By switching to eBPF, Seznam was able to reduce CPU consumption by 72x and Cloudflare does all of its DDoS protection with eBPF allowing it to drop over 8 million packets per second with a single server. eBPF running on SmartNICs/DPUs also allows them to be programmed in a standardized way rather than being tied to a specific vendor interface. eBPF is transforming networks today, not just a promise for the future.
Performant and Programmable Telco Networking with eBPF
For telcos seeking to enhance their networking capabilities and streamline operations for the cloud native era and beyond, embracing eBPF presents a compelling opportunity. It offers versatile and efficient packet processing, decoupled from hardware-specific details while integrating seamlessly with the Linux networking stack. Since eBPF is already available in their networks through the Linux kernel, telcos can leverage it today rather than search through spreadsheets to find out which server it is available on. They can achieve improved throughput capacity, reduce operational burden, and enhance visibility and observability, and that is just the start.