TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
Large Language Models / WebAssembly

Demo: Use WebAssembly to Run LLMS on Your Own Device with WasmEdge

In this demo of WasmEdge, Second State's Michael Yuan shows how to create a lightweight execution environment for the operation of large language models.
Jan 19th, 2024 6:46am by
Featued image for: Demo: Use WebAssembly to Run LLMS on Your Own Device with WasmEdge

At November’s KubeCon+CloudNativeCon North America, Michael Yuan, co-founder of Second State — which offers Wasm for Cloud Native environments — and maintainer of CNCF project WasmEdge, showed how WasmEdge works to The New Stack correspondent B. Cameron Gain for an episode of TNS Demos.

Yuan showed how open source WasmEdge can use WebAssembly to run a large language model on your own device — whether it’s a Mac, a laptop, or an edge device like a Raspberry Pi. With a lightweight execution environment, larger language models can be run efficiently in on such disparate devices types.

“ChatGPT wouldn’t be able to run in these environments, but with such a lightweight large language model like WasmEdge, you are able to run it,” Yuan said.

Python, commonly associated with all things ML, is not part of the equation. “Why not use Python? To do large language inference with Python, you need a whole PyTorch and GPU driver and all that stuff installed,” Yuan said. “That stuff is like, three gigabytes. I dare not install it on my computer.”

Python code is not designed for portability because when running the LLM on a different computer with a different GPU means “you have to start all over again,” Yuan said. “The Wasm runtime is a virtual machine, like  JVM. So, it provides cross-platform compatibility and it’s not just across CPUs but also across GPUs,” Yuan said.

Also, Python is an interpreted language, and it’s very slow in a way because, when using Python for machine learning, the user must rely on the underlying C-based library like PyTorch “to actually do the work,” Yuan said. “So, with Wasm, we use a bunch of more C-like languages, like Rust, to bridge the gap,” Yuan said.

Small Steps

As Yuan showed, there are only three steps. The first step involves installing Water Manage, and here’s the command. As seen, the conference has a bad internet connection, so the process may take a minute to download and install. Once installed, you won’t need an internet connection for future installations.

The second step is to download a large language model. Here’s the command, and you can refer to the documentation for further details. This is one of the Llama models.

Lastly, the third step involves simply cutting and pasting the Wasm application.

Where did you find LLM data to plug into WasmEdge? Yuan recommends the Hugging Face repository, where thousands of LLM tutorial models can be downloaded. “Some of them can generate a SQL query, some of them can generate code, some of them can answer all kinds of different questions, so you can download the model that you like, and put it in WasmEdge,” Yuan said.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.