Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
AI / Large Language Models

The Building Blocks of LLMs: Vectors, Tokens and Embeddings

Understanding vectors, tokens and embeddings is fundamental to grokking how large language models process language.
Feb 8th, 2024 7:32am by
Featued image for: The Building Blocks of LLMs: Vectors, Tokens and Embeddings
Photo by La-Rel Easter on Unsplash.

When you are dealing with LLMs, you often come across the terms “vectors,” “tokens” and “embeddings.” It’s important to thoroughly understand these concepts before delving into building chatbots and AI assistants. With multimodal approaches gaining ground, these terms go beyond just large language models (LLMs) to also interpret images and videos.

The objective of this tutorial is to introduce you to these core concepts through simple, straightforward examples and code snippets.

Vectors: The Language of Machines

Vectors play a crucial role in the functioning of LLMs and generative AI. To understand their significance, it’s essential to grasp what vectors are and how they are generated and utilized in LLMs.

In mathematics and physics, a vector is an object that has both magnitude and direction. It can be represented geometrically as a directed line segment, where the length of the line indicates the magnitude, and the arrow points in the direction of the vector. Vectors are fundamental in representing quantities that can’t be fully described by a single number — such as force, velocity or displacement — and which have both magnitude and direction.

In the realm of LLMs, vectors are used to represent text or data in a numerical form that the model can understand and process. This representation is known as an embedding. Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents. The process of converting text into embeddings allows LLMs to perform various natural language processing tasks, such as text generation, sentiment analysis and more.

Simply put, a vector is a single-dimensional array.

Since machines only understand numbers, data such as text and images is converted into vectors. The vector is the only format that is understood by neural networks and transformer architectures.

Operations on vectors, such as a dot product, help us discover whether two vectors are identical or different. At a high level, this forms the basis for performing similarity search on vectors stored in memory or in specialized vector databases.

The code snippet below introduces the basic idea of a vector. As you can see, it is a simple one-dimensional array:

While the vector shown above has no association with text, it does convey the idea. Tokens, which we explore in the next section, are the mechanism to represent text in vectors.

Tokens: The Building Blocks of LLMs

Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process.

When text is passed through a tokenizer, it encodes the input based on a specific scheme and emits specialized vectors that can be understood by the LLM. The encoding scheme is highly dependent on the LLM. The tokenizer may decide to convert each word and a part of the word into a vector, which is based on the encoding. When a token is passed through a decoder, it can be easily translated into text again.

It’s common to refer to the context length of LLMs as one of the key differentiating factors. Technically, it maps to the ability of the LLM to accept a specific number of tokens as input and generate another set of tokens as output. The tokenizer is responsible for encoding the prompt (input) into tokens and the response (output) back into text.

Tokens are the representations of text in the form of a vector.

The below code snippets explain how text is converted into tokens for an open model like Llama 2 and a commercial model such as GPT-4. These are based on the transformers module from Hugging Face and Tiktoken from OpenAI.

So, the key takeaway is that tokens are vectors based on a specific tokenizer.

Embeddings: The Semantic Space

If tokens are vector representations of text, embeddings are tokens with semantic context. They represent the meaning and context of the text. If tokens are encoded or decoded by a tokenizer, an embeddings model is responsible for generating text embeddings in the form of a vector. Embeddings are what allow LLMs to understand the context, nuance and subtle meanings of words and phrases. They are the result of the model learning from vast amounts of text data, and encode not just the identity of a token but its relationships with other tokens.

Embeddings are the foundational aspect of LLMs.

Through embeddings, LLMs achieve a deep understanding of language, enabling tasks like sentiment analysis, text summarization and question answering with nuanced comprehension and generation capabilities. They are the entry point to the LLM, but they are also used outside of the LLM to convert text into vectors while retaining the semantic context. When text is passed through an embedding model, a vector is produced that contains the embeddings. Below are examples from an open source embedding model, sentence-transformers/all-MiniLM-L6-v2, as well as OpenAI’s model, text-embedding-3-small.

Comparison and Interaction

Tokens vs. Vectors: Tokens are the linguistic units, while vectors are the mathematical representations of these units. Every token is mapped to a vector in the LLM’s processing pipeline.

Vectors vs. Embeddings: All embeddings are vectors, but not all vectors are embeddings. Embeddings are vectors that have been specifically trained to capture deep semantic relationships.

Tokens and Embeddings: The transition from tokens to embeddings represents the movement from a discrete representation of language to a nuanced, continuous and contextually aware semantic space.

Understanding vectors, tokens and embeddings is fundamental to grasping how LLMs process language. Tokens serve as the basic data units, vectors provide a mathematical framework for machine processing, and embeddings bring depth and understanding, enabling LLMs to perform tasks with human-like versatility and accuracy. Together, these components form the backbone of LLM technology, enabling the sophisticated language models that power today’s AI applications.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Simply.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.