Is It too Early to Leverage AI for WebAssembly?
AI and its application to IT, software development, and operations are just beginning to take hold, portending profound implications and disruptions for how humans’ roles will evolve, especially in the near and long term.
On a smaller scale, WebAssembly represents a technology that is generating significant hype while demonstrating its viability. However, a successful business model adoption has yet to be realized, mainly due to a lack of standardization for the final endpoint. Meanwhile, at least one vendor, Fermyon, believes that applying AI to WebAssembly is not premature at this stage.
So, how can AI potentially help Wasm’s development and adoption and is that too early to determine? As Angel M De Miguel Meana, a staff engineer at VMware’s Office of the CTO, noted how during the last year, since the introduction of ChatGPT brought AI to the forefront of software development, the AI ecosystem has evolved drastically. Meanwhile, “WebAssembly provides a solid base to run inference not only on the server, but in many different environments like browsers and IoT devices,” De Miguel Meana said. “By moving these workloads to end-user devices, it removes the latency and avoids sending data to a centralized server, while being able to work on the type of heterogeneous devices often found at the edge… Since the Wasm ecosystem is still emerging, integrating AI in early stages will help to push new and existing AI related standards. It is a symbiotic relationship.”
“We started Fermyon with the goal of building a next-wave serverless platform. AI is very clearly part of this next wave. In our industry, we frequently see revolutionary technologies grow up together: Java and the web, cloud and microservices, Docker and Kubernetes,” Matt Butcher, co-founder and CEO of Fermyon Technologies, told The New Stack. “WebAssembly and AI are such a perfect pairing. I see them growing up (and growing old) together.”
“Baking” AI models, such as LLMs [large language models] or transformers, into the WebAssembly runtime, is the logical next step to accelerate the adoption of WebAssembly, Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack. Similar to calling, e.g. a database service via API, compiled WebAssembly apps (binaries) could then send their API request to the WebAssembly runtime that in turn would relay this call to the AI model and pipe the model-response back to the originator, Volk said.
“These API requests will become very powerful once we have a common component model (CCM) that provides developers with one standardized API that they can use to access databases, AI models, GPUs, messaging, authentication, etc. The CCM would then let developers write the same code to talk to an AI model (e.g. GPT or Llama) on any kind of server in the data center, cloud or even at edge locations, as long as this server has sufficient hardware resources available,” Volk said. “This all boils down to the key question of when industry players will agree on a CCM. In the meantime, WebAssembly clouds such as Fermyon can leverage WebAssembly to make AI models portable and scalable within their own cloud infrastructure where they do not need a CCM and pass on some of the savings to the customer.”
Solving the Problem
Meanwhile, Fermyon believes that applying AI to WebAssembly is not premature at this stage. As Butcher noted, developers tasked with building and running enterprise AI apps on LLMs like LLaMA2 face a 100x compute expense for access to GPUs at $32/instance-hour and upwards. Alternatively, they can use on-demand services but then experience abysmal startup times. This makes it impractical to deliver enterprise-based AI apps affordably.
Fermyon Serverless AI has solved this problem by offering sub-second cold start times over 100x faster than other on-demand AI infrastructure services, Butcher said. This “breakthrough” is made possible because of serverless WebAssembly technology powering Fermyon Cloud, which is architected for sub-millisecond cold starts and high-volume time-slicing of compute instances which has proven to alter compute densities by a factor of 30x, he said. Extending this runtime profile to GPUs makes Fermyon Cloud the fastest AI inferencing infrastructure service, Butcher said.
Such an inference service is “very interesting” as the typical WebAssembly app consists of only a few megabytes, while AI models are a lot larger than that, Volk said. This means they would not be able to start up quite as fast as traditional WebAssembly apps. ”I assume that Fermyon has figured out how to use time slicing for providing GPU access to WebAssembly apps so that all of these apps can get the GPU resources they need by reserving a few of these time slices via their WebAssembly runtime,” Volk said. “This would mean that a very large number of apps could share a small number of expensive GPUs to serve their users on-demand. This is a little bit like a time-share, but without being forced to come to the lunchtime presentation.”
So, how would the user interact with Serverless AI? With Fermyon’s Serverless AI, there are no REST APIs or external services — it’s just built locally to Fermyon’s Spin and also in Fermyon Cloud, Butcher explained. “Anywhere in your code, you can simply pass a prompt into Serverless AI and get back a response. In this first beta, we’re including LLaMa2’s chat model and the recently announced Code Llama code-generating model,” Butcher said. “So, whether you’re summarizing text, implementing your own chatbot, or writing a backend code generator, Serverless AI has you covered. Our goal is to make AI so easy that developers can right away begin leveraging it to build a new and jaw-dropping class of serverless apps.”
Using WebAssembly to run workloads, it is possible to use Fermyon Serverless AI to assign a “fraction of a GPU” to a user application “just in time” to execute an AI operation, Fermyon CTO and co-founder Radu Matei wrote in a blog post. “When the operation is complete, we assign that fraction of the GPU to another application from the queue,” Matei wrote. “And because the startup time in Fermyon Cloud is milliseconds, that’s how fast we can switch between user applications that are assigned to a GPU. If all GPU fractions are busy crunching data, we queue the incoming application until the next one is available.”
This has two big implications, Matei wrote. First, users don’t have to wait for a virtual machine or container to start and for a GPU to be attached to it. Also, “we can achieve significantly higher resource utilization and efficiency for our infrastructure,” Matei wrote.
Specific features Serverless AI offers that Fermyon communicated include:
- This is a developer tool and hosted service for enterprises building serverless applications that include AI inferencing using open source LLMs.
- Thanks to our core WebAssembly technology, our cold startup times are 100x faster than competing offerings, cutting down from minutes to under a second. This allows us to execute hundreds of applications in the same amount of time (and with the same hardware) that today’s services use to run one.
- We provide a local development experience for building and running AI apps with Spin and then deploying them into Fermyon Cloud for high performance at a fraction of the cost of other solutions.
- Fermyon Cloud uses AI-grade GPUs to process each request. Because of our fast startups and efficient time-sharing, we can share a single GPU across hundreds of apps.
- We’re launching the free tier private beta.
There's also a lot of ecosystem work that has to be done in the space-just to have inferences is not enough, @juntao said today during his talk with @Vmware's @_angelmm "Getting Started with AI and WebAssembly" at #WasmCon 2023. pic.twitter.com/00vlszH5qr
— BC Gain (@bcamerongain) September 7, 2023
However, there is certainly a way to go before Wasm and AI concurrently reach their potential. During WasmCon 2023, Michael Yuan CEO and co-founder of Second State, a runtime project for Wasm, and WasmEdge discussed some of the work in progress. He covered the topic with De Miguel Meana, during their talk “Getting Started with AI and WebAssembly” at WasmCon 2023.
“There’s a lot of ecosystem work that needs to be done in this space [of AI and Wasm]. For instance, having inferences alone is not sufficient,” Yuan said. “The million-dollar question right now is, when you have an image and a piece of text, how do you convert that into a series of numbers, and then after the inference, how do you convert those numbers back into a usable format?”
Preprocessing and post-processing are among Python’s greatest strengths today, thanks to the availability of numerous libraries for these tasks, Yuan said. Incorporating these preprocessing and post-processing functions into Rust functions would be beneficial, but it requires more effort from the community to support additional modules. “There is a lot of potential for growth in this ecosystem,” Yuan said.