Building a Production-Grade RAG System with Claude: From Naive to Optimized
Learn to build, evaluate, and optimize a Retrieval Augmented Generation system with Claude. Covers basic RAG, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced optimizations. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy, then improve results with summary indexing and re-ranking.
Building a Production-Grade RAG System with Claude: From Naive to Optimized
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your specific business context. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your own documents.
In this guide, we'll walk through building a RAG system using the Claude Documentation as our knowledge base. We'll start with a basic implementation, then show you how to measure performance properly, and finally apply advanced techniques that improved our end-to-end accuracy from 71% to 81%.
What You'll Learn
- How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
- How to build a robust evaluation suite with 5 key metrics
- How to implement summary indexing for better retrieval
- How to use Claude as a re-ranker to improve result quality
Prerequisites
You'll need:
- An Anthropic API key
- A Voyage AI API key
- Python 3.8+ with
anthropic,voyageai,pandas,numpy,matplotlib, andscikit-learn
Level 1: Basic RAG (Naive Approach)
Let's start with the simplest possible RAG implementation. This is often called "Naive RAG" in the industry, and it involves three steps:
- Chunk documents by heading (each subheading becomes a chunk)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks using cosine similarity
Setting Up the Vector Database
For this example, we'll use an in-memory vector database. In production, you'd want a hosted solution like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings."""
texts = [doc['content'] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, top_k: int = 3) -> List[Dict]:
"""Search for relevant documents using cosine similarity."""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
for emb in self.embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
Building the RAG Pipeline
Now let's create the full pipeline that retrieves documents and generates answers with Claude:
from anthropic import Anthropic
class BasicRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.anthropic = Anthropic(api_key=anthropic_key)
self.vector_db = InMemoryVectorDB(api_key=voyage_key)
def query(self, question: str) -> str:
# 1. Retrieve relevant chunks
chunks = self.vector_db.search(question, top_k=3)
# 2. Build context from retrieved chunks
context = "\n\n".join([chunk['content'] for chunk in chunks])
# 3. Generate answer with Claude
response = self.anthropic.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
system="You are a helpful assistant. Answer the question based on the provided context.",
messages=[
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.content[0].text
This works, but how well does it actually perform? That's where evaluation comes in.
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production systems. You need to measure two things independently:
- Retrieval performance – How well does your system find relevant documents?
- End-to-end performance – Does Claude actually answer correctly?
Creating a Test Dataset
We synthetically generated 100 evaluation samples. Each sample contains:
- A question
- The chunks that are relevant to that question (ground truth)
- A correct answer
The Five Key Metrics
#### 1. Precision
Precision answers: "Of the chunks we retrieved, how many were actually relevant?"
Precision = True Positives / Total Retrieved
High precision means you're not wasting Claude's context window on irrelevant information.
#### 2. Recall
Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"
Recall = True Positives / Total Relevant
High recall ensures Claude has all the information it needs.
#### 3. F1 Score
The harmonic mean of precision and recall, giving you a balanced view of retrieval quality.
F1 = 2 (Precision Recall) / (Precision + Recall)
#### 4. Mean Reciprocal Rank (MRR)
MRR measures how high the first relevant result appears in your retrieval list. If the first relevant chunk is ranked #1, that's perfect. If it's #3, that's worse.
def calculate_mrr(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_chunks:
return 1.0 / (i + 1)
return 0.0
#### 5. End-to-End Accuracy
This is the ultimate metric: does Claude's final answer match the expected answer? This requires human or LLM-as-judge evaluation.
Running the Evaluation
def evaluate_retrieval(rag_system, eval_dataset):
results = []
for sample in eval_dataset:
retrieved = rag_system.vector_db.search(sample['question'], top_k=3)
relevant = sample['relevant_chunks']
# Calculate metrics
true_positives = len(set(retrieved) & set(relevant))
precision = true_positives / len(retrieved)
recall = true_positives / len(relevant)
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
mrr = calculate_mrr(retrieved, relevant)
results.append({
'precision': precision,
'recall': recall,
'f1': f1,
'mrr': mrr
})
return results
Level 2: Summary Indexing
Our basic RAG had a problem: chunks were too granular. A chunk about "API Rate Limits" might not contain the phrase "how many requests per minute," even though it's the right answer.
Summary indexing solves this by creating a separate index of chunk summaries. When a query comes in, you search the summary index first, then retrieve the full chunks.def create_summary_index(documents, anthropic_client):
summary_index = []
for doc in documents:
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
system="Summarize this document chunk in 1-2 sentences.",
messages=[{"role": "user", "content": doc['content']}]
)
summary_index.append({
'summary': response.content[0].text,
'original': doc
})
return summary_index
Now when searching, we first find relevant summaries, then retrieve the corresponding full chunks. This improved our recall from 0.66 to 0.69.
Level 3: Summary Indexing + Re-Ranking
The final optimization is re-ranking. After retrieving candidates with cosine similarity, we use Claude to re-rank them based on actual relevance to the query.
def rerank_with_claude(query, candidates, anthropic_client):
# Ask Claude to score each candidate's relevance
scores = []
for chunk in candidates:
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=50,
system="Rate the relevance of this chunk to the query from 0-10. Return only the number.",
messages=[{
"role": "user",
"content": f"Query: {query}\n\nChunk: {chunk['content']}"
}]
)
try:
score = float(response.content[0].text.strip())
except:
score = 0
scores.append(score)
# Sort by score descending
ranked = [c for _, c in sorted(zip(scores, candidates), reverse=True)]
return ranked
This dramatically improved our MRR from 0.74 to 0.87, meaning the most relevant chunk almost always appears first.
Results Summary
Here's what we achieved with each optimization:
| Metric | Basic RAG | + Summary Index | + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.69 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 | 0.54 |
| Avg MRR | 0.74 | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
Key Takeaways
- Evaluate retrieval and generation separately. You can't improve what you don't measure. Use precision, recall, F1, and MRR for retrieval, and accuracy or LLM-as-judge for end-to-end performance.
- Summary indexing improves recall. By creating searchable summaries, you help the retrieval system find relevant chunks even when the query doesn't match the exact wording.
- Re-ranking with Claude significantly boosts MRR. Using Claude to re-rank candidates ensures the most relevant information appears first, which improves final answer quality.
- Start simple, then optimize. A basic RAG pipeline works surprisingly well. Only add complexity (summary indexing, re-ranking) when you've measured the baseline and identified specific weaknesses.
- Watch your rate limits. Full evaluations can be token-intensive. Consider sampling your dataset or using a smaller model for intermediate steps.