Containers / Kubernetes / Technology / Sponsored / Contributed

Conductor: Why We Migrated from Kubernetes to Nomad

23 Aug 2021 12:00pm, by and

Conductor Technologies started its digital modernization journey with managed Kubernetes on a single cloud vendor. As our business continued to expand with multicloud offerings, we began to outgrow the existing platform and eventually migrated to HashiCorp Nomad. Here is the story of how and why we made the decision, and our experience with Nomad.

About Conductor

Jonathan Cross
Jonathan is lead software engineer at Conductor Technologies, working on backend with a concentration on rendering across multiple clouds.

As one of the largest cloud-based visual-effects rendering platforms, Conductor provides fast, secure, and efficient one-click service for a variety of movie studios, boutique VFX shops and independent artists. The Conductor rendering platform has been used to produce scenes for “Deadpool,” “Game of Thrones,” “Allied,” “Star Trek Beyond,” “The Walk,” “Pirates of the Caribbean: Dead Men Tell No Tales” and many other well-known movies, television shows and commercials. The visual effects (VFX) workflow is similar to software development, and just as tools like CircleCI automate and streamline software development processes, we help artists orchestrate and schedule their rendering tasks with a simple experience so they can focus on creative work instead of dealing with complex infrastructure technologies.

Platform Re-Architect with Managed Kubernetes

Carlos Robles
Carlos is senior DevOps engineer at Conductor Technologies, ensuring that the company builds and scales in a consistent, repeatable and secure way.

We decided to adopt Kubernetes when the company started to transform its legacy system running on virtual machines to a cloud native platform. We chose managed Kubernetes on the Google Kubernetes Engine (GKE) because we were primarily running on Google Cloud, and as a two-person platform team, we wanted to offload daily operational tasks to a managed service.

GKE is, in our opinion, the best-managed Kubernetes service. It had been working very well with our small- to medium-size deployment. We also gained substantial experience and knowledge through the rearchitecting journey building on GKE. These lessons helped us quickly adapt our platform with other orchestrators.

However, as our business grew and the platform reached a large scale with thousands of cloud instances, we started running into a series of issues. It became clear that we were outgrowing GKE for batch-related work, though we still use it to deliver our services. We eventually moved our core rendering applications to HashiCorp Nomad.

Reasons Behind the Switch

The key reasons we moved to Nomad include scalability, resource utilization, driver support and scheduling throughput. Let’s take a closer look at each one:

Scalability

We are heavy batch job users. The first major issue we ran into related to this job type was the GKE autoscaler. As customers’ workload increased, we started to have incidents where pending jobs were piling up exponentially, but nothing was scaled up. After examining the Kubernetes source code, we realized that the default Kubernetes autoscaler is not designed for batch jobs, which typically have a low tolerance for delay. We also had no control over when the autoscaler started removing instances. It was set to 10 minutes as a static configuration, but the accumulated idle time increased our infrastructure cost as we could not rapidly scale down once there was nothing left to work on. (Companies such as Atlassian have run into similar problems with Kuberenetes’ autoscaler and ended up creating their own tools as a workaround. )

We also discovered that the Kubernetes job controller, a supervisor for pods carrying out batch processes, was unreliable. The system would lose track of jobs and be in the wrong state. (Other batch users have experienced similar issues, as noted in this 2018 KubeCon talk.)

And there was another scalability issue. On the control plane side, there was no visibility into the size of the GKE clusters’ control plane (notably, GKE has recently begun providing logs with Stackdriver). As load increases, GKE would automatically scale up the control plane instances to handle more requests. But upgrading can often lead to long wait times. During that time, even though existing pods are still running, we were blocked from making major changes to the cluster or submitting new jobs. The wait time slowed our business throughput. (Other users have reported similar experiences.)

All these issues eventually convinced us to develop our own in-house autoscaler.

Resource Utilization

GKE maintains a fairly large footprint on each node for running system-level jobs. While it might not be obvious on large-size nodes, it can consume a significant portion of the total compute resource. For our small-size nodes, such as 2-core or 4-core machines, it was particularly painful to size customer workloads with only 60% to 70% of resources actually available. We started seeing a lot of out-of-memory alerts, even when nodes were not truly running out of memory. As our clusters scaled, this led to significant waste from resources reserved by the platform itself.

Driver Support

Our platform needs to support a broad range of rendering solutions and tools. Much GPU-based software requires the most up-to-date drivers. But we could not get the latest GPU capabilities through GKE because of outdated drivers. We also have Windows-based applications, and Kubernetes’ support for Windows server containers is still primitive. We spent too much time finding workarounds for placing custom images. This eventually became a deal breaker for our GKE deployment.

Scheduling Throughput

We run a multitenant system to allow different customers to run their rendering workloads simultaneously. The deployment requests are submitted in a rapid succession to accelerate processing. However, GKE limits API requests to 600 per minute, which slowed down our scheduling throughput.

The Turning Point — Mental Overhead

While we were running GKE clusters, we were also managing legacy VM-based applications on Google Cloud in the traditional way and started expanding to Amazon Web Services. The excessive effort to manage GKE made us not even consider EKS at the time, since it would require more hands-on work and additional operational cost. We chose AWS batch services instead for its simplicity. But because each platform presented a particular set of constraints and challenges, it soon became a daily headache to figure out different workarounds.

As the business continued to grow, the mental overhead of wrestling with multiple managed services became overwhelming. While we were excited to go multicloud and support more customers, it became clear that we needed to consolidate our deployment tools and workflows to scale efficiently. This led us to HashiCorp Nomad, an open source orchestrator with a different approach.

A Refreshing Experience with Nomad

Coming from the Kubernetes world, Nomad’s operational simplicity was a refreshing alternative. We were able to spin up a Nomad cluster over a weekend and move it to production in just two weeks. Customers were moved to Nomad production clusters in a month. This is the first time we’ve been able to deploy a new technology into production in only a few weeks, have it work and work well. The benefits we got include:

Faster Throughput

Nomad’s scheduling capability is so good, it actually caused an unexpected issue for us. Prior to Nomad, the time between stopping one job and starting the next one was long enough for us to perform a series of cleanup tasks. With Nomad, the lag was just 2-3 seconds —so fast that we had to rewrite our code to work more quickly.

Reduced Idle Time

Prior to using Nomad, cluster upgrade and management were operationally heavy tasks. In our experience, since the upgrade process could easily take hours, it was often better to destroy an entire Kubernetes cluster and rebuild it from scratch rather than run an in-place upgrade. And that doesn’t consider the even longer idle time when we rolled out a cluster in a different region.

With Nomad, we now have multiple options to upgrade the server groups, such as rolling updates with zero downtime. Since the Nomad agent is a single binary with one configuration file, we can easily pre-bake it into node images, which makes the Nomad client immediately available to schedule once we spin up a new cloud instance. This cuts down on the idle time that was previously spent downloading, configuring and waiting for the nodes to join the cluster.

Flexible Workload Support

We now have the flexibility and full control to support custom images required by Conductor’s rendering software. We can also orchestrate our legacy applications, Windows-based applications and containerized jobs with a single workflow. Nomad’s flexible and customizable device plugins let us enable more features from our state-of-the-art rendering software.

Cloud Agnostic 

Nomad is deployed and operated the same way in on-premises data centers and in public clouds. This lets us implement a unified orchestration solution for our enterprise customers in their own data centers or in any of the clouds that we support. Having a unified, single orchestrator drastically reduces our cognitive load when maintaining and troubleshooting multiple systems, tools and processes.

Benchmarking Nomad vs. GKE

To better understand the performance gain Nomad delivered, we performed a standard-case benchmark comparing Nomad against GKE. The test does not provide an absolute measure of speed, but it helped us gauge the solutions’ relative performance for Conductor’s use cases.

We used the canonical BMW car blender demo by Mike Pan. The rest of the benchmark includes:

  • The same job with 1,000 frames (tasks).
  • The same instance type: n1-standard-4 on Google Compute Engine-managed instance groups.
  • The same in-house autoscaler implementation with the same settings for timings, intervals, cooldowns, metadata checks and constraints.

The following charts show the best of three runs on each orchestrator, with the standard deviation where applicable.

Submit to First Start (Minutes)

This metric reveals the time from a Conductor job “pending” into the first confirmed render agent start or job “running.” GKE takes multiple steps to create new instances via user data that we can’t control. That extends the wait time, which increases our costs since we don’t start billing our clients until our rendering agent has started. In contrast, Nomad’s machine image is customizable in our CI/CD process with Packer, resulting in 63% faster run results:

Best Run Result Standard Deviation
Nomad 1:59 minutes 15 seconds
GKE 3:14 minutes 17 seconds

Average Task Completion (Minutes)

This metric shows the average time from Conductor “start” to “finish” per task or frame. Nomad performs slightly better due to being able to acquire more CPU and memory when creating the cgroup. GKE’s percentage-based reservations block us from using some of the machine resources, creating a slightly longer average runtime. We believe a self-managed Kubernetes cluster would match Nomad’s results.

Best Run Result STDEV
Nomad 2.49 minutes NA
GKE 2.73 minutes NA

Time to Completion (Minutes)

This metric addresses the total time from Conductor job “pending” to all 1,000 frames and/or tasks completing. With the accumulated user-data overhead on hundreds of instances, GKE takes considerably longer to finish the entire job. On the other side, Nomad benefits from being pre-baked into the machine image, and Nomad’s batch scheduler is optimized to rank instances rapidly using the power of two choices described in Berkeley’s Sparrow scheduler. Together that leads to a 42% faster best run result:

Best Run Result Standard Deviation
Nomad 22:10 minutes 48 seconds
GKE 31:29 minutes 4:01minutes

Peak Instance Count

Our final metric tracks the peak instance count of the clusters to run this job. In conjunction with total runtime of the benchmark, a smaller cluster size and finishing faster significantly reduces overhead costs for us and our customers. Nomad consistently required fewer clusters than did GKE to complete the same tasks, with a standard deviation of about 20 instances. While GKE’s best results were very close to Nomad’s, its standard deviation was almost five times larger.

Best Run Result Standard Deviation
Nomad 435 20.22
GKE 469 100.43

Nomad vs. Managed Kubernetes

You might assume that with Conductor’s lean operations team, the default system migration or modernization solution would be to rely on managed Kubernetes to eliminate as much operational overhead as possible.

But with the open source version of Nomad, we were able to engineer our own solution that replicates the managed-service experience and easily scales to multiple clouds and on-premises data centers, makes better use of resources, gets better driver support and enables higher scheduling throughput — all without introducing additional complexities as we grow. For example, we recently set a new internal record with 275,000 concurrent cores and 4,000 instances in a single region on Nomad. That compares to our previous record of some 150,000 concurrent cores and 2,500 instances.

Specifically, we’ve been able to confidently move into more regions around the world, which we had been hesitant to do with GKE. For example, we operated in just two regions with GKE on Google Cloud, but even with our small team, we are currently in five regions using Nomad on Google Cloud. That gives us better availability for our customers’ workloads. Even more important, both Conductor and our customers are enjoying significant cost savings by reducing the overall time to completion of rendering projects.

For more information, watch our session on Making Movie Magic With Nomad at HashiConf Europe 2021.

Lead image supplied by Conductor.