What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
AI / Large Language Models / Software Development

How Built an AI Search Using an LLM Gateway

How the tech news aggregator created an AI search, including using an LLM Gateway to streamline and optimize interactions with LLMs.
Nov 7th, 2023 9:00am by
Featued image for: How Built an AI Search Using an LLM Gateway
Image via

Today at we are launching the beta version of our new AI-powered Search, which aims to deliver concise, accurate answers to your technical queries.

In this article, we’ll explore and learn what it takes to build a system like this at scale.

The Workflow

In its essence, this use case is similar to other question-answering scenarios of LLMs. Once the user submits a question, we need to generate a list of candidates from our database, extract the content, build a prompt and send it to an LLM to get the answer. Below you can see a simplified workflow for our search:

simplified workflow for AI search

  1. Query Generation: The journey begins when a user submits a question. We then need to turn it into a search engine query, by removing irrelevant words and focusing on the user’s intent.
  2. Candidates generation: The query is immediately sent to a web search API like those from Google or Bing. The API searches the web and returns the candidates in a matter of milliseconds.
  3. Scraping: The selected web pages undergo a scraping process. Here, the focus is on concurrency and caching, essential for real-time data extraction and content accumulation, optimizing the system’s performance and efficiency. The content is needed to build the final prompt and answer the question.
  4. Context Building: With the extracted content, building a prompt for the LLM follows. The process is intricate due to token limitations inherent in LLMs. Strategies such as ranking documents or sections by vector similarity to the question come into play, aiming for an optimal balance between detail and conciseness in the generated context. Every unnecessary token may cause higher latency and cost, and every missing token may cause hallucinations, so finding the balance is hard. Evaluate different strategies to find the best match for your use case.
  5. Answer Generation: The resulting prompt is sent to an LLM. Multiple models are evaluated to ascertain suitability based on several factors such as response accuracy, cost, and performance. The objective is to select a model that aligns with the quality expectations and constraints of the system.

LLM Gateway

We have integrated LLMs across various facets of, starting with search and extending to areas like post-enrichment and recommendation systems. We found out that GPT 3.5 Turbo models work the best for our case, with a good balance of accuracy and cost. Recognizing the transformative potential of LLMs, we introduced a centralized service: LLM Gateway (internal codename Bragi).

It’s a service that streamlines and optimizes our interactions with LLMs. Written in Python, the Gateway acts as a conduit, offering foundational LLM building blocks such as chat, completion, and embedding functionalities. It also provides higher-order pipelines — prebuilt scenarios that enhance efficiency, like post-enrichment.

The LLM Gateway simplifies the complexity involved in interacting with LLMs. It autonomously selects the appropriate model, crafts prompts, and serializes responses from the LLM, ensuring a smooth, coherent workflow. Centralizing this expertise within a single service alleviates the broader team from the nuanced intricacies of LLM, allowing for a focused, specialized approach to managing and optimizing LLM interactions.

We chose gRPC for communication protocol; we were drawn to its native streaming support and strict contract protocols, essential for maintaining production stability and integrity. The LLM Gateway is more than an access point — it also stores the input and output data, enabling us to fine-tune models, optimize performance, and manage costs effectively.

The Orchestrator

At this point, the workflow is clear and we have a service that can satisfy our LLM requirements. The last piece is to build an orchestrator, a service that is able to execute the workflow from start to finish and integrate everything. Our standard backend language is Go; and given the efficiency this service needs, it was an easy decision for us.

The internal codename is Magni (Bragi’s brother for the Norse myth fans). A crucial aspect of this service is the full traceability and debugging of requests. Given a search ID, we are able to understand exactly the input and output of every step in the workflow, allowing us to improve the process and fix bugs. It also provides an API for our application to provide feedback (upvote/downvote) per answer, user’s history and other application-related queries.

Other Challenges

Building a system like this, especially at our scale, comes with its own set of hurdles. Here’s a look at some of the main challenges we encountered and how we tackled them:

  • Stateful services: Both the LLM Gateway and the orchestrator maintain a connection until a search is completed, making them stateful services. We had to be careful, ensuring that connections weren’t lost during service shutdowns and that the workload was spread evenly across different replicas to avoid any interruptions in the search process.
  • Cost analysis: Working with LLMs can get costly. We kept a close eye on expenses to avoid going over budget. An initial alpha stage, open to a limited number of users, helped us gauge usage patterns and costs, allowing us to make necessary adjustments before a broader rollout. (A minor heart attack was noted when I miscalculated the cost, estimating it at X1000 of what it actually was for about 20 minutes; this is to serve as a warning for people that pricing is per thousand tokens!)
  • Performance: Given the complexity of the workflow and the need for real-time results, performance was a top priority. We focused on optimizing the system for speed and efficiency, using strategies like concurrency and caching, ensuring that reusable components (like scraped pages) were efficiently managed to speed up the search process.
  • Prompt engineering: Crafting the perfect prompt was challenging. We invested time in finding a prompt that resonated with our desired tone, conciseness, and accuracy. Through head-to-head evaluations and utilizing specialized tools, we were able to refine our approach and select the most effective prompts for our use case.


Building Search has been a journey filled with technical challenges and learning. From managing stateful services to fine-tuning prompts, every step was crucial in developing a tool that effectively serves the developer community with accurate and immediate search answers.

We invite you to see it in action. Your feedback will be instrumental in refining and enhancing the tool further. Sign up to the waiting list here.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.