The RAG Pipeline
Mar 25, 2026
In This Chapter
- The main idea behind this part of the RAG system
- The trade-offs that matter in practice
- The interview framing that makes the topic easier to explain
What a RAG Pipeline Actually Is
A RAG system has one job: fetch the right context before asking the model to answer.
That sounds simple, but in practice it means you are running two separate pipelines:
OFFLINE / INDEXING
documents -> chunk -> embed -> store
ONLINE / QUERY TIME
question -> embed -> retrieve -> build context -> generateIf a candidate cannot explain those two phases clearly, they usually do not really understand RAG yet.
In interviews, the point is not just to list the stages. The point is to explain where the main tradeoffs and failures live.
Phase 1: Indexing
The indexing pipeline prepares your data before any user asks a question.
- Break source documents into chunks
- Convert each chunk into an embedding
- Store the chunk text, metadata, and embedding in a retrieval system
This phase runs when documents are added or updated, not on every query.
The main goal is to make later retrieval accurate and cheap.
Phase 2: Query-Time Retrieval and Generation
At query time, the system does the following:
- Embed the user question with the same embedding model used during indexing
- Retrieve candidate chunks from the vector store
- Optionally rerank or filter those chunks
- Build a prompt using the selected context
- Ask the LLM to answer using that context
This phase runs on every user request, so latency matters much more here.
Where RAG Pipelines Usually Fail
Most RAG systems do not fail because the LLM is weak. They fail because one stage in the pipeline is poorly designed.
| Stage | Typical failure |
|---|---|
| Chunking | Chunks are too large, too small, or cut in the wrong places |
| Embedding | Weak model, or different models used for indexing and querying |
| Retrieval | Relevant chunks are not returned in the top-k |
| Context building | Good chunks are retrieved but assembled poorly |
| Generation | The model ignores context or over-generalizes |
That is why interviewers often ask you to debug the pipeline step by step rather than discuss RAG in abstract terms.
Most production failures are retrieval, freshness, or context assembly failures before they are model failures.
A Good Mental Model
Think of the pipeline as a narrowing funnel:
all documents
-> candidate chunks
-> top-k retrieved chunks
-> final context window
-> grounded answerEach stage should reduce noise without dropping useful information.
If you lose relevant information too early, generation cannot recover it later.
Design Trade-Offs
Three trade-offs show up everywhere in RAG pipeline design:
- Recall vs precision: retrieve more chunks to avoid missing answers, or fewer chunks to reduce noise
- Quality vs latency: reranking and query rewriting improve accuracy but add cost and delay
- Freshness vs maintenance cost: more frequent re-indexing keeps knowledge current but increases operational work
These trade-offs matter more in interviews than memorizing tool names.
Another useful framing is:
- better retrieval quality often means more moving parts
- simpler pipelines are easier to operate, but usually less precise
That is why production RAG is a systems problem, not just a prompt problem.
What This Article Does Not Cover
This article is the overview. The details belong elsewhere:
- chunking strategy
- embeddings
- vector databases
- metadata filtering
- retrieval optimization
- evaluation
If one article tries to fully teach all of those, it stops being a pipeline overview and becomes repetitive.
Key Questions
Q: Walk me through a RAG pipeline from documents to final answer.
A RAG pipeline has two phases. In the offline indexing phase, documents are chunked, embedded, and stored in a retrieval system. In the online query phase, the user's question is embedded, relevant chunks are retrieved, optional filtering or reranking is applied, a context window is built from the selected chunks, and the LLM generates an answer grounded in that context.
Q: Why do people separate indexing from query-time retrieval?
Indexing is expensive but infrequent. Query-time retrieval must be fast because it runs on every request. Separating them lets you precompute embeddings and store them once, instead of repeating heavy preprocessing for every user question.
Q: Which stage matters most in a RAG pipeline?
Retrieval quality usually matters most. If the system retrieves irrelevant or incomplete chunks, even a strong LLM cannot produce a correct grounded answer. Good generation cannot fix bad retrieval.
Q: Why does a RAG pipeline need more than just vector search?
Vector search gives you candidate chunks, but production systems often also need metadata filtering, reranking, thresholding, and context construction. Otherwise the system either returns too much noise or misses the most useful evidence.
Q: What tradeoffs matter most in a RAG pipeline interview answer?
The most important ones are recall versus precision, quality versus latency, and freshness versus maintenance cost. A strong answer should show that improving one part of the pipeline often makes another part more expensive or more complex.