Fine Tuning Isn’t the Hammer for Every Generative AI Nail
Large language models (LLMs) are pretty amazing at creating new content based on their immense knowledge, but they’re far from perfect. LLMs don’t always return an answer that is useful, or even accurate. This is because LLMs are “stateless” — they don’t store data, so they can’t answer prompts with information that they weren’t trained on.
This fact doesn’t stop them from trying, however, and this can result in what is known as a “hallucination,” when an LLM fills in the gaps of its knowledge by confidently making up answers that sound plausible but are incorrect.
There are two ways to get information into an LLM to help prevent these hallucinations and ensure that the model provides accurate, relevant answers. You can train the model with the data, either when the model is built or by “fine-tuning” it after the fact. Or you can do it at runtime — at the “time of inference.”
When supplying information at runtime, which is known as retrieval augmented generation (RAG), the model is provided with additional data or context in real time from other sources, most often a database that can store vectors.
Let’s compare the two methods and clarify how to choose between fine-tuning and RAG.
Fine-Tuning Is Fine for Some Tasks
Adapting a pre-trained model to do a specific task or solve a specific problem requires introducing smaller, more specific data sets. This is sometimes referred to as “retraining” an LLM.
Here’s an oversimplified example: Let’s say you are using an LLM in an application that might recommend where to eat lunch. The model might be trained on the array of restaurant options within 20 miles of you, but it probably doesn’t “know” what you’ve eaten over the past week.
You could fine-tune the model with your recent meals so that it can ensure a certain amount of variety when it makes lunch recommendations (check out this article for a detailed explanation of different fine-tuning techniques). But how often would you have to do that to ensure the tuned model is up to date: Monthly? Weekly? Every day? The prospect of ensuring the recency of an LLM’s training is daunting at best, and in many cases impossible.
This would be a real problem with an AI-driven e-commerce app built on an LLM. What if a product sells out a few minutes before a customer tries to purchase it? Fine-tuning on a daily basis wouldn’t happen often or fast enough to prevent a glitch that could result in a bad customer experience. It probably wouldn’t be feasible either.
In the lunch recommendation example, what if the LLM is ignorant of any dietary restrictions or health concerns or medical issues you might have. Would you consider fine-tuning it with your personal health information?
If there were any concerns about your privacy, the short answer is: probably not. That’s because LLMs are leaky — any data put into an LLM can be output by an LLM when answering a prompt. No promise from any model provider can guarantee that it’s safe to use personal identifiable information (PII) when training a model. For this reason and others, training LLMs with PII has already come under regulatory scrutiny.
It’s important to recognize that there are situations where fine-tuning is appropriate; retraining embedding models is one. An embedding model takes semantic inputs and generates vectors to represent the inputs. If an embedding model doesn’t recognize or understand a word, then that word will map into a vector that isn’t related to the meaning of that word.
For example, if an embedding model was built before the arrival of social media platform TikTok, it might not recognize the meaning of the word and create a vector that is associated with “clocks” or “sounds” (or maybe even Disney characters). Arming the embedding model with the correct meaning of TikTok would require fine-tuning.
RAG for Speed and Security
There are many differences between RAG and fine-tuning, but the most significant one might be that RAG doesn’t alter or introduce new data directly into an LLM. Rather, it draws on data securely stored elsewhere to help the model produce relevant, accurate responses to a query.
In a RAG system, documents are stored in a vector database. Because this data is stored as vectors, it relies on the same underlying principles as LLMs, which turn prompts into vectors. Stored as vectors, data becomes a reservoir of knowledge that is instantly findable and accessible, greatly expanding an LLM’s inherent ability to produce answers related to a given query.
To fine-tune an LLM, a sufficient amount of information has to be collected first. As an analogy, consider the human mind. If you’re having a conversation, your mind is constantly summarizing and storing those summaries in your memory. At some point, though, that information gleaned from conversations will reach a critical mass and might change the way you think about things or make decisions.
If your mind is the vector database, RAG gathers and makes available all that data in real time; fine-tuning is how you update the LLM once certain conclusions can be drawn from a mass of accumulated data.
When the data is in a vector database, it can be instantly updated and available for querying, avoiding the need for fine-tuning to “batch up the data.”
RAG also offers fine-grained access control by giving developers the ability to determine precisely what is queried. For example, with a chatbot, the developer can introduce code to ensure that the query filters only allow personal data to be retrieved related to the person asking for the data.
RAG is made possible by orchestration tools like LangChain or LlamaIndex, which act as a sort of bridge that connects (or “chains”) models to a wide variety of proprietary and public data.
Fine-tuning is the right choice when a vector embedding model needs to be updated to produce more accurate vectors. It might be sufficient when updating a model with publicly accessible data that isn’t likely to change or need updates often. Fine-tuning a model that uses the right corporate vernacular is one example; the new data is not likely to be time-constrained, and the training dataset is most likely limited. You might also use it when a certain task requires accuracy to a degree that it justifies throwing a lot of resources at it (Bloomberg did this when it built BloombergGPT earlier this year).
But when data needs to be protected, private and up-to-the-second accurate, it needs to be in a database, not in the model.
The reality is that AI in production ends up being personalized AI. It’s a company like Skypoint AI making medical recommendations. It’s Macquarie Bank giving you financial planning advice. It’s Priceline helping you plan your travel based on your travel preference.
Under no circumstances should a model be fine-tuned with an electronic medical record or a bank account or your personal travel history.
Making RAG a part of an AI architecture extends the capability of the LLM by securely providing it with relevant, proprietary knowledge exactly when it’s required; the corpus of knowledge that’s accessible to the model in a vector database can be expanded and updated without expensive model-training jobs.