Meta’s MultiRay, a ML Platform for Running Large Foundational Models

Training an AI model is no small feat. Specialized teams need a huge amount of data to train a large model to do a very specific task, say reading millions of posts to learn how to identify harmful speech. Helpful, incredibly so. But this task is also expensive and limited in scope. With the costs associated with training each model, it’s easy to see how this can spiral out of control. This leads to enormous costs might keep the most state-of-the-art AI models out of production-level code.
There’s no way to get around the expensive data-heavy computations of understanding content with AI. The machine has to learn. But where the learning takes place and how the learning takes place could change. Social media conglomeration Meta has developed a new platform for running state-of-the-art AI models that does just that. MultiRay’s primary aim is to democratize access to large foundational models at Meta.
Developed as part of Meta’s push to make its AI systems more efficient, MultiRay uses large universal, foundational ML models which are trained to perform well across a diverse set of tasks and domains. The foundational models are optimized for functionality across a variety of tasks, including similarity and classification. Multiple specialized, smaller models can now run off of the input (also known as embedding) from the universal model.
With the bulk of the computations more centralized, Meta was able to purchase more cutting-edge accelerators (specialized hardware) needed for the expensive computations. Software development is also benefiting as development teams can now quickly iterate and improve upon ML models.
Currently, MultiRay powers over 125 use cases across Meta and it supports up to 20 million queries per second (QPS) while serving 800 billion queries per day.
MultiRay’s Modalities
MultiRay’s first model (in production since 2020), TextRay, focuses on text-understanding applications and can perform tasks ranging from detecting inauthentic content to improving users’ search experiences.
Building off of TextRay, the second model, PostRay, joins text and image understanding because to truly understand a post, which can include images, video, and text, a system needs to have the capacity to analyze each individually and within the context of one another.
Before PostRay, this functionality of portray required combining several different models together and consumed too many compute and power resources to actually bring the ML models into production.
PostRay models are complex to train, deploy, and maintain because they incorporate advanced research in multiple fields but only need training once. It has several use cases across Meta, including topic classification which is used for Reels.
How MultiRay Works
MultiRay centralizes execution on accelerations and uses a cache to save on recomputation costs.
MultiRay’s large foundational models return a point in a high-dimensional vector space that represents the input. The point is the “embedding” and it’s more ML-friendly version of the original input. Rather than processing the raw input — the text and images — task-specific models can consume the embedding from MultiRay which is simpler to handle.
The embeddings are huge, much larger than the inputs themselves (many kilobytes).
Why Centralize?
Software perspective The blog post generalized the upper bound with the smaller, individual team workflow as the burden of creating, maintaining, and upkeeping individual models as well as difficulties with applying sophisticated optimization techniques. The centralized workflow alleviates most of that with teams able to focus just on developing and iterating on task-specific models.
Hardware perspective Large models and latency constraints are very demanding on graphics processing units (GPUs) which are the accelerators used for MultiRay. The centralized model allows for top-shelf GPUs to be shared across the teams rather than for multiple teams to have their own GPUs.
MultiRay’s Cache
The multilayered cache trades hit rate at the cost of speed for each layer. The layers start from a fast but small per-host local cache in the RAM of every MultiRay server and end with a slower but much larger globally distributed cache in flash memory. Cache storage is finite thus it’s not possible to store cache results for a long time.
MultiRay measures request patterns across clients to determine the best cache settings (size, time-to-live, update policies) to reduce the cost of the service. For example, Meta uses the measured data to simulate the energy required for various cache lifetime settings trading the cost of re-computation of a request on accelerations versus serving it from the cache. This feedback loop allowed us to improve the efficiency of MultiRay even while client behavior constantly changes.
The Challenges of a Centralized Service
Some of the challenges already solved for large-scale systems (ie databases) such as client management, quotas, and cost attribution had to be adapted for the AI domain. Query size and cache hit rate both affect the energy required to process queries so quota are more complex. Another challenge is that the expenses accrued during the builds of these models only make sense if the models are used. This is a moving target that undergoing continuous innovation in new model architectures, heavy investment in model refresh, and training flows.
Additional Learning
MultiRay has become a sandbox for Meta’s ML and systems specialists to contribute key optimizations that support the broader PyTorch and accelerator ecosystem. MultiRay was the first large use case to deploy PyTorch’s Better Transformer in production at meta. This brought significant capacity savings with no impact on quality.
The research below is from Meta’s Foundational AI Research (FAIR) team which led to its development.
- Unsupervised cross-lingual representation learning at scale — where researchers first demonstrated that multilingual modeling can be done without sacrificing per-language performance.
- General purpose text embeddings from pre-trained language models for scalable inference — where researchers demonstrate a solution for NLP in which multiple tasks are performed on the same text using large-scale pre-trained models at a fraction of the compute cost.
- Multiscale vision transformers and Masked autoencoders as spatiotemporal learners — foundational research pointing toward how MultiRay can be applied to video-related tasks in the future.