Exploring the RAG Rabbit Hole: Tackling the Production Challenges of Naive RAG

Noel Cahyanto
May 5, 2025

Getting a basic Retrieval-Augmented Generation (RAG) prototype running is often the easy part. The real challenge? Turning that simple “naive” RAG – the kind you build following initial tutorials – into a robust, production-ready system that delivers relevant results quickly and reliably.

This article is a field report documenting my journey tackling exactly that challenge. Forget best-practice theory; these are notes from the trenches, sharing the practical hurdles encountered and the specific areas where the common baseline setup often falls short under real-world constraints.

So we’re aligned, here’s the “naive” baseline RAG I started with, which often proves insufficient:

Chunking: Simple fixed-size cuts, minimal overlap. Just chopping up text.
Embedding: Off-the-shelf models, no specific tuning.
Retrieval: Basic vector search (brute-force or default ANN), taking raw top-k.
Prompting: Retrieved context jammed directly into the LLM prompt.
Constraint Awareness: Little thought given to local models, performance limits, or specific language needs.

While this naive setup can produce some output, its cracks appear fast in real applications. You immediately face issues with result relevance, latency, scalability, and robustness – falling far short of production expectations.

That’s the gap we’ll explore. This report digs into the crucial areas where this baseline fails and shares the interventions I experimented with – from optimizing retrieval speed to improving semantic relevance and adapting to constraints – aiming for a RAG system you can actually depend on.

Consideration #1: Retrieval Performance – Speed & Precision Are Non-Negotiable

Alright, so as I mentioned, one of the first walls you hit with a basic RAG setup is retrieval performance. Standard vector similarity search (like brute-force checking everything) might be okay when you’re just playing around with a tiny dataset, but it becomes a massive bottleneck once your data grows or you need quick responses. This is a serious problem because production systems demand both low latency and high relevance. You can’t have users waiting forever, and you can’t serve them junk.

This is where more sophisticated Approximate Nearest Neighbor (ANN) algorithms come into play. In my experiments, I focused specifically on implementing HNSW (Hierarchical Navigable Small World). Why HNSW? Because it generally offers a really good trade-off between search speed and the accuracy of the results.

The Basic Idea Behind HNSW

At its core, HNSW organizes your data vectors into a multi-layered graph structure. Picture a stack of graphs:

Top Layers (Sparse): These have fewer connections, but the connections span longer distances across the vector space. Think of them as express highways for quickly navigating to the general vicinity of your target.
Bottom Layers (Dense): These have lots of connections between nodes that are close neighbors. They allow for more precise, fine-grained searching once you’re in the right neighborhood.

This layered structure lets HNSW find likely nearest neighbors efficiently without having to compare your query vector against every single data point.

How Does the Search Actually Work?

Conceptually, searching in HNSW goes something like this:

Enter from the Top: The search starts at an entry point in the highest (most sparse) layer.
Greedy Navigation: From that entry point, the algorithm “greedily” moves to the neighboring node in that layer that’s closest to the query vector. It keeps doing this until it can’t find any neighbor closer than its current spot in that layer.
Drop Down a Layer: Once it finds the best candidate in a layer, the algorithm moves down to the corresponding node (or its closest representation) in the layer below.
Repeat: It repeats the greedy navigation (Step 2) and dropping down (Step 3) process until it reaches the very bottom layer (Layer 0, the densest one).
The Result: The closest node found in that bottom layer is considered the best candidate for the nearest neighbor to the query vector. This process can be adapted to find the k nearest neighbors (k-NN).

Building the Graph – Indexing

This HNSW graph gets built during the indexing phase. When you add a new vector, it finds its nearest neighbors in the existing graph and creates connections. Each new node is also probabilistically assigned to a certain number of upper layers, which is how the hierarchy forms.

Implementation Results & Field Insights

In my project, implementing HNSW genuinely delivered a significant speed-up in retrieval compared to the initial baseline method. Response times got much, much better.

So, the insight here holds true: efficient ANN search algorithms like HNSW are crucial for any serious RAG system. They aren’t just a “nice-to-have” optimization; they often form the essential foundation for achieving acceptable performance in real-world applications, especially as your data volume increases. If your retrieval is slow, the whole user experience suffers.

Consideration #2: Semantic Quality – Embedding Strategy is Key for Relevance

The second major headache that often pops up with a baseline RAG setup is the quality of the embeddings themselves. This directly impacts how relevant the retrieved results are, and here’s why:

First off, that naive fixed-size chunking strategy? It’s often pretty crude. Sometimes it just slices right through the middle of an important sentence or idea. The result is that a single chunk might not contain a complete thought, or worse, it becomes ambiguous. When you turn that into an embedding vector, its semantic representation isn’t exactly optimal. It doesn’t fully capture the meaning.

Second, even good, general-purpose embedding models aren’t necessarily the perfect fit for the specific task you have in mind. If your RAG is focused on answering questions (Q&A), embeddings derived from plain text chunks might not “connect” perfectly, semantically speaking, with the vector representing the user’s question.

And let’s be real: in a production setting, the quality of the LLM’s final answer leans heavily on how relevant the context you feed it is. If the context is semantically off-base or just plain wrong, the LLM is likely to spit out an unsatisfying answer. The old “garbage-in, garbage-out” principle applies strongly here.

Recognizing this issue, my next experiments focused on the strategy for representing the text before the embedding process. Instead of just embedding raw text chunks directly, I tried a different approach: reformatting the text chunks into a Question-Answer (QnA) format first.

The hypothesis was this: By transforming pieces of information into relevant question-and-answer pairs, the resulting embedding vectors might align better semantically with user queries, which are usually questions themselves. The hope was that this would increase the chances of retrieving the chunk most suited to actually answering the user’s specific question.

Implementation & Observations: The process roughly involved taking each original text chunk and trying to generate a few potential questions that could be answered by that chunk. The chunk itself then acted as the answer. These QnA pairs were then embedded. Now, actually generating these QnA pairs is a challenge in itself – you could do it manually, semi-automatically, or even use another model entirely.

Qualitatively, based on the trials I ran, this QnA formatting strategy did seem to yield more relevant retrieval results for direct, question-style queries compared to just using embeddings from the standard fixed-size chunks. That said, it’s important to note this was an initial observation, and the world of embedding and chunking techniques is vast (as you might have heard me say before, “way too many techniques!” – and it’s true!). There are many other strategies out there, like sentence-window retrieval, semantic chunking, using metadata filters, and so on.

Field Insight: The key takeaway here is that embedding quality isn’t just about picking the fanciest embedding model. How you prepare and represent your text before you embed it (your chunking and formatting strategy) turns out to have a massive influence on the relevance of your retrieval results. This is one of those crucial “knobs” you really need to tune seriously if you want to improve your RAG’s quality. Often, it requires more task-specific experimentation with your data and use case than simply swapping out one embedding model for another.

Consideration #3: Dealing with Reality – Adapting to Models & Languages

Here’s something that often gets glossed over in basic RAG examples or simple tutorials: real-world constraints. You don’t always get to use the absolute latest, greatest State-of-the-Art (SOTA) models. You don’t always have beefy, top-tier hardware. And you often have to cater to very specific user needs, like their language. The naive approach typically just pretends these issues don’t exist.

In my RAG experiments, I bumped up against two significant constraints:

Local Embedding Model: For practical reasons (could be budget, privacy concerns, or just wanting more control), I opted to run the embedding model locally on my own machine. The consequence? The model I could realistically run probably wasn’t the absolute peak performer compared to some giant models requiring massive resources. This directly impacts the quality of the embedding vectors generated (like I mentioned earlier, the results sometimes felt a bit “kinda meh”).
User Query Language: The target users or the data I was working with were primarily in Indonesian. Meanwhile, many embedding models (especially easily accessible or open-source ones) are heavily trained on English data and thus perform best in English or have an inherent English bias. This creates a potential mismatch between the language of the query and the embedding model’s “strengths.”

You simply can’t ignore constraints like these if you want to build something that actually works in the real world. There’s no point designing a super-sophisticated RAG system on paper if you can’t implement it or if it doesn’t fit the users’ needs.

Investigation & Pragmatic Solutions

Facing these limitations, I had to look for pragmatic solutions or workarounds. To tackle the language issue and the potential for better model performance in English, the step I tried was preprocessing the user queries: specifically, translating the query from Indonesian to English before performing the vector search.

Why do this? The hope was that by “meeting the model halfway” – aligning the query language with the language where the embedding model was likely stronger – the relevance of the vector search results could improve, even if the embedding model itself wasn’t SOTA. It’s definitely not the most theoretically elegant solution, but it felt like a necessary adaptive step given the situation.

Observations

Qualitatively, adding this translation step did seem to help improve the retrieval results for Indonesian queries, compared to searching directly with the original Indonesian query using the chosen embedding model. Of course, this adds a little bit of latency because of the extra translation step, but that became a trade-off made to chase better relevance with the available resources.

Field Insight

This gets to a crucial part of real-world engineering. Building a production-ready RAG system isn’t just about chasing the fanciest tech; it’s also about being clever and adapting to the constraints you actually have. Sometimes you need to get creative and find workarounds or “good enough” patches that fit your specific situation. The ability to design a system that’s robust despite resource or model limitations is a critical skill. You work with what you’ve got.

Consideration #4: Context Quality Control – The Crucial Role of Re-ranking with Cross-Encoders

So, we’ve tackled retrieval speed using HNSW. That’s great for getting potential candidates quickly. But, as we’ve touched upon, those initial top-k results from the ANN search aren’t always the most relevant, nor are they necessarily in the best order. This is where the next gap appears – another hurdle to overcome for production-level quality.

Here’s the Problem: The initial retrieval step using ANN (like HNSW) typically relies on vector embeddings generated by what’s called a Bi-Encoder.

Bi-Encoder: This type of model works by encoding the query and the documents into vector representations separately. So, you get one vector for the query, and separate vectors for each document in your index. The big advantage? It’s fast. Encoding documents can be done offline, and the query encoding is quick, allowing for efficient similarity searches across potentially millions of documents in a vector database. This is what enables that initial fast retrieval. However, because the query and documents are processed independently, the bi-encoder doesn’t really capture the deep, nuanced interaction or meaning between the specific query and the content of a document. It’s basically measuring a ‘rough’ similarity between standalone vectors.

The Consequence? Your initial retrieval results might have good recall (they probably contain the relevant stuff somewhere in the list), but the precision and ranking can be sub-optimal. You might get noisy results, or documents that seem superficially similar but don’t actually answer the query well, creeping into your top slots.

The Solution: A Re-ranking Stage Using Cross-Encoders

To fix this, we need an extra layer of quality control: a re-ranking stage. And this is where Cross-Encoder models often shine.

Cross-Encoder: Unlike a bi-encoder, this model takes the query and a single document as a combined input pair: (query, document). It then analyzes the interaction between the words in the query and the words in the document directly. The result? The cross-encoder can produce a much more accurate relevance score because it captures deeper semantic relationships and contextual nuances.

But, there’s a catch. Because it has to process each (query, document) pair individually, a cross-encoder is way slower and more computationally intensive than a bi-encoder. You absolutely couldn’t use it to search through millions of documents from the get-go.

Bi-encoder vs Cross-encoder

How it’s Applied in RAG:

The most common strategy combines the strengths of both:

Fast Initial Retrieval: Use the Bi-Encoder + ANN (like HNSW) to quickly retrieve a larger set of initial candidates (say, N=50 or N=100). Speed and recall are the priorities here.
Precise Re-ranking: Feed these N candidates (along with the original query) into a Cross-Encoder. The cross-encoder scores each one for relevance much more accurately.
Final Selection: Take the top-k (say, k=3 or 5) results from the re-ranked list. These are the high-precision context snippets you’ll actually feed to the LLM.

Implementation of Bi-Encoder and Cross-Encoder in RAG

Observation and Field Insights:

Now, even if I didn’t get around to fully implementing a cross-encoder in this specific round of experiments, understanding this fundamental difference between bi-encoders and cross-encoders is critical. It clearly explains why re-ranking isn’t just some optional add-on; it’s a vital mechanism for significantly boosting the precision of the context given to the LLM.

So, the insight becomes even stronger: relying solely on the initial bi-encoder/ANN retrieval often isn’t good enough for production quality. A re-ranking step, ideally using a cross-encoder, acts as a high-precision filter. It ensures only the most relevant information actually reaches the LLM, making it a super important consideration for any serious RAG design aimed at accurate final answers.

Conclusion: Field Notes for Building More Resilient RAG Systems

So, what’s the main takeaway from all this RAG tinkering? If there’s one core message, it’s this: while getting a basic RAG up and running is often surprisingly easy, the path to a system that’s genuinely robust and ready for production demands serious attention to the nitty-gritty technical details. That initial prototype often just isn’t tough enough to handle real-world challenges.

Based on my direct experience wrestling with this stuff, here are a few key takeaways worth keeping in mind if you’re looking to elevate your RAG system beyond “it technically works”:

Fast Retrieval is Mandatory: Efficient ANN algorithms like HNSW aren’t luxuries anymore; they’re often a baseline requirement to prevent your system from feeling sluggish, especially as your data scales up. Slow retrieval kills the user experience.
Meaning Matters in Embeddings: Embedding quality isn’t just about the model you choose. How you prepare your text beforehand (your chunking and formatting strategy – like the QnA experiment) significantly impacts the relevance of what you retrieve.
Embrace Real-World Limits: Production systems have to live with constraints (local models, specific languages, hardware limits). Sometimes pragmatic workarounds (like query translation) are the necessary, practical solutions you need to implement.
The Final Precision Filter: Re-ranking (ideally with a cross-encoder) is super important. It acts as that final quality control step to ensure only the most relevant context actually makes it to the LLM, directly improving the accuracy of the final output.

Building a good RAG system turns out to be an iterative process. It’s not just about coding; it involves a lot of experimenting, tuning, and cleverly finding solutions within the constraints you’re given. Hopefully, these notes from the field offer some practical perspective for others out there who are also deep in the RAG trenches.

Code & Further Discussion:

Curious about the implementation details? You can find the code from these experiments in my GitHub repository:

https://github.com/eferist/RAG-Kitiran-Project

Feel free to check it out!

(IMPORTANT!) This article was powered by Generative AI in the loop.

DF Labs 2025