Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers naive RAG, summary indexing, re-ranking, and production evaluation metrics.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, achieving up to 81% accuracy.
Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Claude excels at general knowledge tasks, but when you need answers rooted in your internal documentation, customer support articles, or proprietary research, a standard LLM prompt often falls short. This is where Retrieval Augmented Generation (RAG) becomes your most powerful tool.
RAG enables Claude to search your knowledge base, retrieve the most relevant chunks, and generate answers grounded in those retrieved documents. In this guide, we'll build a RAG system from scratch using Claude, Voyage AI embeddings, and an in-memory vector store. We'll then go beyond "vibes-based" evaluation and show you how to measure and improve your pipeline with concrete metrics.
By the end, you'll understand how to move from a basic "naive RAG" setup to an advanced system using summary indexing and re-ranking — boosting end-to-end accuracy from 71% to 81%.
What You'll Need
Before we begin, set up your environment with these libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
You'll also need API keys from Anthropic and Voyage AI. Store them as environment variables:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"
Level 1: Basic RAG (Naive RAG)
Let's start with the simplest possible RAG pipeline. This is often called "naive RAG" — it works, but it has clear limitations.
Step 1: Chunk Your Documents
We'll split documents by headings. Each chunk contains the content under a single subheading:
def chunk_by_headings(text):
chunks = []
current_heading = None
current_content = []
for line in text.split('\n'):
if line.startswith('##') or line.startswith('###'):
if current_heading:
chunks.append({
'heading': current_heading,
'content': '\n'.join(current_content)
})
current_heading = line
current_content = []
else:
current_content.append(line)
if current_heading:
chunks.append({
'heading': current_heading,
'content': '\n'.join(current_content)
})
return chunks
Step 2: Embed and Store
Use Voyage AI to generate embeddings for each chunk, then store them in an in-memory vector database:
import voyageai
vo = voyageai.Client()
Generate embeddings for all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple dict (in production, use Pinecone, Weaviate, etc.)
vector_db = {}
for i, chunk in enumerate(chunks):
vector_db[i] = {
'text': chunk['content'],
'embedding': embeddings[i]
}
Step 3: Retrieve and Generate
When a user asks a question, embed the query, find the most similar chunks using cosine similarity, and pass them to Claude:
import numpy as np
from anthropic import Anthropic
client = Anthropic()
def retrieve(query, top_k=3):
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
scores = []
for idx, doc in vector_db.items():
similarity = np.dot(query_embedding, doc['embedding'])
scores.append((idx, similarity))
scores.sort(key=lambda x: x[1], reverse=True)
top_indices = [idx for idx, _ in scores[:top_k]]
return [vector_db[idx]['text'] for idx in top_indices]
def answer_question(query):
chunks = retrieve(query)
context = "\n\n---\n\n".join(chunks)
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
system="You are a helpful assistant. Answer the question based only on the provided context.",
messages=[
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
return response.content[0].text
This basic pipeline works, but it has a critical flaw: it only retrieves chunks that contain the exact query terms. If a relevant chunk uses different terminology, it will be missed.
Building a Robust Evaluation System
To improve your RAG system, you need to measure it. We'll evaluate two things separately:
- Retrieval quality — How well does the system find relevant chunks?
- End-to-end accuracy — Does Claude produce the correct final answer?
Creating an Evaluation Dataset
We synthetically generated 100 test samples. Each sample contains:
- A question
- A list of "golden" chunk IDs that contain the answer
- A correct answer string
import json
with open("evaluation/docs_evaluation_dataset.json") as f:
eval_data = json.load(f)
Preview
print(eval_data[0])
{
"question": "What is the max token limit for Claude 3 Opus?",
"relevant_chunks": [12, 45],
"correct_answer": "200,000 tokens"
}
Key Retrieval Metrics
We'll track four retrieval metrics:
| Metric | What It Measures | Formula |
|---|---|---|
| Precision | Of the chunks we retrieved, how many were relevant? | TP / (TP + FP) |
| Recall | Of all relevant chunks, how many did we retrieve? | TP / (TP + FN) |
| F1 Score | Harmonic mean of precision and recall | 2 (P R) / (P + R) |
| MRR | How high did the first relevant chunk rank? | 1 / rank_first_relevant |
def calculate_metrics(retrieved_chunks, relevant_chunks):
retrieved_set = set(retrieved_chunks)
relevant_set = set(relevant_chunks)
true_positives = len(retrieved_set & relevant_set)
precision = true_positives / len(retrieved_set) if retrieved_set else 0
recall = true_positives / len(relevant_set) if relevant_set else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant chunk
for rank, chunk in enumerate(retrieved_chunks, 1):
if chunk in relevant_set:
mrr = 1 / rank
break
else:
mrr = 0
return {
"precision": precision,
"recall": recall,
"f1": f1,
"mrr": mrr
}
End-to-End Accuracy
For the final answer, we use Claude itself to judge correctness:
def evaluate_answer(question, generated_answer, correct_answer):
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
system="You are an evaluator. Respond with exactly 'CORRECT' or 'INCORRECT'.",
messages=[{
"role": "user",
"content": f"Question: {question}\nCorrect answer: {correct_answer}\nGenerated answer: {generated_answer}\n\nIs the generated answer correct?"
}]
)
return response.content[0].text.strip() == "CORRECT"
Level 2: Summary Indexing
The first improvement is summary indexing. Instead of only storing raw chunks, we also generate and store a one-sentence summary of each chunk. During retrieval, we compare the query against the summaries first, then fetch the full chunks.
def generate_summary(chunk_text):
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system="Summarize the following text in one sentence.",
messages=[{"role": "user", "content": chunk_text}]
)
return response.content[0].text
During indexing
for chunk in chunks:
chunk['summary'] = generate_summary(chunk['content'])
During retrieval, embed the summary instead of the full text
summary_embeddings = vo.embed([c['summary'] for c in chunks], model="voyage-2").embeddings
This simple change improved our recall from 0.66 to 0.69 — we were now finding relevant chunks even when the query used different wording.
Level 3: Summary Indexing + Re-Ranking
The final optimization is re-ranking. After retrieving the top 10 chunks by summary similarity, we use Claude to score each chunk's relevance to the query, then keep only the top 3:
def rerank(query, chunks, top_k=3):
scored_chunks = []
for chunk in chunks:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
system="Rate relevance from 0 to 10. Only output the number.",
messages=[{
"role": "user",
"content": f"Query: {query}\nChunk: {chunk[:500]}\n\nRelevance score:"
}]
)
score = float(response.content[0].text.strip())
scored_chunks.append((chunk, score))
scored_chunks.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in scored_chunks[:top_k]]
Re-ranking dramatically improved our Mean Reciprocal Rank (MRR) from 0.74 to 0.87 — the first relevant chunk was now almost always at position 1.
Results Summary
Here's how the metrics improved across our three levels:
| Metric | Basic RAG | + Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Precision | 0.43 | 0.44 | 0.44 |
| Recall | 0.66 | 0.69 | 0.69 |
| F1 Score | 0.52 | 0.54 | 0.54 |
| MRR | 0.74 | 0.78 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
Key Takeaways
- Evaluate retrieval and generation separately. A perfect retrieval system is useless if Claude can't synthesize the answer, and a perfect generator is useless if it never sees the right context. Measure both.
- Summary indexing boosts recall. By matching queries against summaries rather than raw text, you capture semantically related chunks that naive keyword search would miss.
- Re-ranking dramatically improves MRR. Using Claude to score relevance after initial retrieval ensures the most useful chunks appear first, which improves final answer quality.
- Start simple, then optimize. Begin with basic RAG, establish your baseline metrics, then add complexity only where you see clear gaps.
- Use synthetic evaluation datasets. Generate 50-200 question-answer pairs from your own documents. This gives you a reliable benchmark without manual labeling.