TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
AI / Large Language Models

How Perplexity’s Online LLM Was Inspired by FreshLLMs Paper

We dig into the technology behind Perplexity’s Copilot, which was inspired by the FreshLLMs paper that proposed search engine-augmented LLMs.
Jan 24th, 2024 4:00am by
Featued image for: How Perplexity’s Online LLM Was Inspired by FreshLLMs Paper
Photo by Marten Newhall on Unsplash.

Perplexity has been making waves since its appearance at the AWS re:Invent keynote in December 2023. Intrigued by the approach, I signed up for Copilot when it was launched. Out of many AI assistants that I have access to, I found Perplexity’s Copilot to be the most useful and functional. That’s because it offers the best of both worlds: generative AI and conventional search experiences. I soon replaced my default search engine with its search companion.

Perplexity user interface

Now let’s understand the technology behind Perplexity AI’s Copilot.

Currently, large language models (LLMs) have two major challenges: obsolete data and hallucinations. Since foundation models have a cut-off date based on their pre-training dataset, they cannot respond with the most recent data. Even the most capable models tend to make up answers, leading to hallucinations.

The first problem, which is a lack of access to the latest data, can be addressed by performing a web search and feeding the LLM with the output to help it make informed decisions. This can be accomplished by integrating APIs such as SerpAPI, which provides programmatic access to Google Search. Each time a prompt is sent, the LLM decides if it needs access to the web and then invokes the search API if required. The scrapped content from multiple sources is then summarized and added as context to the prompt, which enables the LLM to respond with a useful and meaningful response.

The second problem related to hallucination can be addressed through a proven technique called retrieval augmented generation, or RAG. Unlike the previous approach that made a dynamic call to the search API, RAG expects data to be retrieved from a well-known data store like a vector database or a full-text search index maintained externally.

It’s important to note that the first approach works best for context built from the data available in the public domain. If you are building a Q&A application or a summarization app for data that’s internal and private to your organization, RAG is the ideal solution.

Perplexity AI relies more on a search engine-based approach for its Copilot. For use cases that need access to private data, it offers an OpenAI-compatible API that can be used with RAG.

FreshLLMs: Bringing Current Data to LLMs

Perplexity AI is inspired by the mechanism explained in the paper, FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation, which proposed search engine-augmented LLMs. Similar to how RAG injects context into the prompt, FreshLLMs advocate the idea of injecting the summary of top hits sorted by the publication date from the search. Apart from adding context, it also proposes the use of few-shot prompting that teaches the LLM how to respond based on a few examples.

FreshLLM classifies the questions into four categories:

  1. Never-changing, where the answer almost never changes.
  2. Slow-changing answers that may change over the course of several years.
  3. Fast-changing answers, such as flight status and weather, which may change multiple times.
  4. False-premise, where the questions are factually incorrect and need to be rebutted.

The authors of the paper created a dataset with 600 questions divided into the above categories. Called the FRESHQA benchmark, it involved testing a model’s ability to answer questions accurately with a human evaluation of over 50,000 judgments to assess factual correctness. The evaluation uses two modes: RELAXED, focusing on the main answer’s correctness, and STRICT, ensuring that all claims are factual and current. The study highlights the limitations of LLMs, especially with rapidly changing information and false-premise questions, and suggests that simply increasing model size doesn’t guarantee better performance. It concludes that FRESHQA presents a significant challenge for LLMs, indicating a need for further advancement.

The study found that pre-trained LLMs, such as T5, PaLM, GPT-3.5, and GPT-4, struggled on the FreshQA dataset. The response accuracy ranged from 0.8% to 32.0% under STRICT and 0.8% to 46.4% under RELAXED. The STRICT evaluation, which requires all information to be factual and current, causes a significant drop in accuracy for models like GPT 3.5 and GPT-4, primarily due to their inability to access real-time information, resulting in outdated or refused answers. PALM also sees a notable accuracy decrease under STRICT, often due to response artifacts and hallucinations. Conversely, FLAN-PALM and CODEX perform better, showing minimal hallucination thanks to their more concise and direct responses.

The authors have experimented with a technique called FRESHPROMPT, which introduces contextually relevant and up-to-date information from a search engine to a pre-trained LLM. Given a question, the method uses the question to query a search engine, retrieving all search results, including the answer box, organic results, and other useful information — such as the knowledge graph, questions and answers from crowdsourced QA platforms, and related questions that search users also ask. This information is then used to teach the LLM to reason over the retrieved evidence, improving the model’s ability to provide accurate and current responses based on few-shot prompting.

How Perplexity AI Implemented the Idea of FreshLLMs

Perplexity AI has built two online LLMs, pplx-7b-online and pplx-70b-online, which can access real-time information from the internet, enabling them to provide up-to-date and accurate responses. These models leverage open sourced models, in-house search technology, and fine-tuning to effectively use information from the web. They are designed to overcome the limitations of offline LLMs by providing responses to time-sensitive queries and offering the most relevant and valuable information. The models are publicly accessible via an API, allowing developers to integrate the technology into their applications and websites.

The model pplx-7b-online is based on mistral-7b, while pplx-70b-online is built on top of llama2-70b base model. They have been fine-tuned to effectively use snippets from the web to enhance their responses. According to Perplexity, it curates high-quality, diverse and large training sets through in-house data contractors to ensure high performance in terms of helpfulness, factuality and freshness. Additionally, the models undergo regular fine-tuning to continually improve their performance. These efforts enable the models to provide accurate, up-to-date and contextually relevant responses by leveraging real-time information from the internet.

Apart from focusing on the freshness and the current nature of the responses, Perplexity AI ensures that the models deliver helpful and factually accurate answers.

Recently, Perplexity AI announced the availability of an API to access its online models as well as other models such as mixtral-8x7b-instruct, llama-2-70b-chat and codellama-34b-instruct. The pro subscribers of Perplexity Copilot get $5 credit to use the API.

In my next article, I will walk you through a tutorial on how to build applications based on Perplexity AI’s API. Stay tuned.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.