Breakdown: The Kubernetes-Run AI Video Generation Pipeline for NIUS.TV
NIUS.TV was a non-linear TV news aggregator for mobile devices that converts articles on topics that interest our users into short-form videos using AI. It was a project that ran from October 2019 to April 2021, though early experiments dated back to May 2018. We produced around 100 stories during this period, and more than 500,000 people watched our videos on social media.
This post is about the motivations and decisions made for the backend architecture to support the video generation pipeline.
Advances in deep learning are becoming more and more impressive, and allowed us to create our unique synthesized stories with a modest, dynamic Kubernetes (K8s) setup.
Instead of researching a new novel architecture to solve a text-to-video synthesis problem end-to-end (which can be prohibitively expensive and typically requires a very special kind of skill), we broke up the video generation problem into granular components we call “steps.”
We took advantage of robust, off-the-shelf AI systems that are publicly available, documented, and tested, and relied on containerization technology to stitch everything together.
First the rendering technology: Not too long ago, the first version of pix2pix (2016) created images with a resolution of 256×256. Today, NVIDIA’s pix2pixHD (2018) can output images up to 2048×1024, with a great level of detail, on a few epochs of training.
Projects like NVIDIA’s vid2vid (2018) take on the idea of image synthesis and extend it even further by synthesizing videos, which add a new dimension of difficulties like longer training sessions, higher memory requirements, requiring multiple GPUs for training (DGX-1), among others.
Here is a list of AI systems we used:
|Name||Description||DL framework||Inference time||GPU||Execution model||Release date||Publisher|
|pix2pixHD||Image to image synthesis||Pytorch 1.6.0||Yes||Job||2018||NVIDIA|
|gentle||Speech to phonemes||N/A||No||HTTP API||2015||lowerquality|
|g2p||Last-mile phonemes||Tensorflow 1.3.0||No||Job||2018||Kyubyong Park|
|Dlib||Facial landmarks||N/A||No||Job||2002||Davis King|
|tacotron2||Speech to waveform||Pytorch 1.0||Yes||Job||2018|
|waveglow||Waveform to wav||Pytorch 1.0||Yes||Job||2018||NVIDIA|
Let’s take a high-level view of the steps in the video generation pipeline:
This is what creating a video looked like:
- Using a webApp, a member of our staff wrote news stories and selected the right images for them. Once a story is saved, a message is published to a pubsub queue.
- Cluster-manager, who monitored messages in the pubsub queue, sees a new message and creates a k8s cluster. Once the k8s cluster is ready, it launches jobs-manager and gentle.
- Jobs-manager finally pulls the message from the queue and launches the init-story k8s batch job.
- and 5. Init-story then pulls all the motion graphics assets and news images from Google Cloud Storage (GCS) and stores them into a LocalSSD, and the rest of the pipeline continues…
6. The final video is then stored in GCS.
If new stories are created while the Kubernetes cluster is running, a new flow for that specific story will start from point 3 in the list above and run in parallel to other flows, like depicted in this image:
As noted in the architecture diagrams above and the AI systems table, each AI system has different dependencies, execution models, and some require GPUs.
Also, the steps in the pipeline need coordination, and we were able to scale capacity based on how many stories we have in the queue.
Let’s expand more on each of these challenges.
Although some AI systems share the same deep learning (DL) frameworks and python libraries, some require specific versions and build instructions. This fragmentation makes it challenging to build a single container and support all AI systems at the same time.
We used Docker containers to isolate each AI system independently. An advantage of using this approach is that we can exchange AI systems for new ones without breaking other systems.
Our videos needed to look sharp.
Currently, the video generation pipeline outputs PNG files of approximately 2MBs per frame at a resolution of 2048×1024, at 60 frames per second. Since our videos are typically around 35 seconds long, this created roughly 4.2GBs of data in just a few steps down the pipe. The final output of a typical video generation is approximately 16GBs.
To allow quick sharing of assets between steps, we exposed LocalSSDs as k8s volumes mount to pods at all steps in the pipeline.
One limitation is that because LocalSSDs can only be attached to one k8s node at a time, the story assets and their respective video generation steps are tied (affinity) to the same k8s node. This means that videos started on one k8s node cannot be completed on a different k8s node.
To coordinate steps in the pipeline, we created a job-manager component (running as a k8s deployment) that listens for messages created by the webApp and other steps.
Messages contained information about the video that is being produced and the step to be executed. Depending on the message, the job-manager would schedule a batch job in k8s.
In this event-driven setup, steps ran independently of each other, asynchronously. Naturally, some steps ran slower than others. For example, as detailed in the table below, pix2pixHD takes longer than the other steps and requires a full GPU, potentially creating a bottleneck for other steps that also need GPU time.
We left resource allocation and prioritization, and scaling elasticity to k8s.
Scalability and Elasticity
At NIUS.TV, we produced our videos in batches during the day, making it unnecessary to run a Kubernetes cluster all day long. Also, our batches of 10s of videos need to be produced as fast as possible, as the window for social media sharing is time-sensitive.
For this purpose, cluster-manager spun up and down a k8s cluster based on the number of messages in the queue. If the number of messages was greater than zero, cluster-managers provisioned a k8s cluster. On the other hand, if the queue is empty, deleted the cluster.
We used k8s-auto-scaling policies to scale the number of k8s nodes based on the number of stories we wish to produce simultaneously.
The cluster-manager ran outside k8s on its own VM.