Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks through building a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement chunking, embedding, retrieval, and evaluation metrics like Precision, Recall, F1, and MRR. Advanced techniques include summary indexing and Claude-powered re-ranking to boost end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs RAG to answer questions specific to your business context—whether that's internal documentation, customer support knowledge bases, or financial analysis reports.
In this guide, we'll build a RAG system using Claude and Voyage AI embeddings, then systematically improve it through evaluation-driven optimization. We'll move beyond "vibes-based" testing and implement proper metrics that production systems demand.
Why RAG Matters for Claude Users
Claude's training data has a cutoff date, and it doesn't know your company's internal documents. RAG bridges this gap by:
- Retrieving relevant chunks from your knowledge base
- Injecting them into Claude's context window
- Enabling accurate, grounded answers to domain-specific questions
Prerequisites and Setup
Before diving in, you'll need:
- Anthropic API key for Claude access
- Voyage AI API key for high-quality embeddings
- Python environment with these libraries:
# Core dependencies
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Initialize Your Vector Database
For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.
import numpy as np
from typing import List, Dict, Tuple
class InMemoryVectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
def add_documents(self, docs: List[str], embeddings: List[List[float]]):
self.documents.extend(docs)
self.embeddings.extend(embeddings)
def search(self, query_embedding: List[float], k: int = 3) -> List[Tuple[str, float]]:
# Cosine similarity search
similarities = []
for doc_emb in self.embeddings:
sim = np.dot(query_embedding, doc_emb) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)
)
similarities.append(sim)
top_indices = np.argsort(similarities)[-k:][::-1]
return [(self.documents[i], similarities[i]) for i in top_indices]
Level 1: Basic RAG Pipeline
Let's start with what the industry calls "Naive RAG." This three-step process is the foundation:
- Chunk documents by heading (each subheading becomes a chunk)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks via cosine similarity
Implementation
import voyageai
from anthropic import Anthropic
Initialize clients
vo = voyageai.Client(api_key="your-voyage-api-key")
claude = Anthropic(api_key="your-anthropic-api-key")
def chunk_document(text: str) -> List[str]:
"""Split document by headings (## or ###)"""
chunks = []
current_chunk = []
for line in text.split('\n'):
if line.startswith('##') or line.startswith('###'):
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = [line]
else:
current_chunk.append(line)
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
def basic_rag(query: str, vector_db: InMemoryVectorDB) -> str:
# Step 1: Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Step 2: Retrieve relevant chunks
retrieved = vector_db.search(query_embedding, k=3)
context = "\n\n---\n\n".join([doc for doc, _ in retrieved])
# Step 3: Generate answer with Claude
response = claude.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
}]
)
return response.content[0].text
Building an Evaluation System
This is where most RAG tutorials fall short. We need to measure:
- Retrieval performance (is the system finding the right chunks?)
- End-to-end accuracy (is Claude giving correct answers?)
Creating a Test Dataset
Generate 100+ test samples, each containing:
- A question
- Ground-truth relevant chunks
- A correct answer
# Example test sample structure
test_sample = {
"question": "How do I set up rate limiting in Claude API?",
"relevant_chunks": [
"Rate limits are applied per API key...",
"To increase your rate limit, contact..."
],
"correct_answer": "Rate limits are configured per API key..."
}
Key Metrics Explained
#### Precision What it measures: Of all chunks retrieved, how many were actually relevant?
Precision = True Positives / Total Retrieved
- High precision = fewer irrelevant chunks
- Our system retrieves minimum 3 chunks, which can lower precision
Recall = True Positives / Total Relevant
- Critical for ensuring Claude has all necessary information
- Low recall means missing important context
F1 = 2 (Precision Recall) / (Precision + Recall)
#### Mean Reciprocal Rank (MRR) What it measures: How high did the first relevant chunk rank?
def calculate_mrr(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_chunks:
return 1 / (i + 1)
return 0
- MRR of 0.87 means first relevant chunk appears at position ~1.15 on average
Running Evaluations
def evaluate_retrieval(test_data, vector_db):
results = []
for sample in test_data:
query_emb = vo.embed([sample["question"]], model="voyage-2").embeddings[0]
retrieved = vector_db.search(query_emb, k=3)
retrieved_texts = [doc for doc, _ in retrieved]
# Calculate metrics
relevant = sample["relevant_chunks"]
true_positives = len(set(retrieved_texts) & set(relevant))
precision = true_positives / len(retrieved_texts)
recall = true_positives / len(relevant)
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
results.append({
"precision": precision,
"recall": recall,
"f1": f1
})
return results
Level 2: Summary Indexing
Basic chunking misses relationships between sections. Summary indexing creates hierarchical representations:
def create_summary_index(chunks: List[str]) -> Dict[str, str]:
"""Create summaries for groups of related chunks"""
summary_index = {}
for i in range(0, len(chunks), 3):
chunk_group = chunks[i:i+3]
combined = "\n\n".join(chunk_group)
# Use Claude to summarize
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this documentation section:\n\n{combined}"
}]
)
summary_index[response.content[0].text] = chunk_group
return summary_index
Level 3: Adding Re-Ranking with Claude
Re-ranking dramatically improves MRR. After initial retrieval, use Claude to score relevance:
def rerank_with_claude(query: str, candidates: List[str], top_k: int = 3) -> List[str]:
"""Use Claude to re-rank retrieved chunks by relevance"""
scored_chunks = []
for chunk in candidates:
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{
"role": "user",
"content": f"On a scale of 0-10, how relevant is this text to the question?\n\nQuestion: {query}\n\nText: {chunk}\n\nAnswer only with a number."
}]
)
try:
score = float(response.content[0].text.strip())
except ValueError:
score = 0
scored_chunks.append((chunk, score))
# Sort by score descending
scored_chunks.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in scored_chunks[:top_k]]
Results: The Impact of Optimization
After implementing summary indexing and re-ranking, here are the improvements over basic RAG:
| Metric | Basic RAG | Optimized | Improvement |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | +2.3% |
| Avg Recall | 0.66 | 0.69 | +4.5% |
| Avg F1 Score | 0.52 | 0.54 | +3.8% |
| Avg MRR | 0.74 | 0.87 | +17.6% |
| End-to-End Accuracy | 71% | 81% | +14.1% |
Production Considerations
- Rate Limits: Full evaluations can hit API limits. Use Tier 2+ accounts for extensive testing.
- Token Budget: Summary indexing and re-ranking consume additional tokens. Balance cost vs. quality.
- Vector Database: Move from in-memory to hosted solutions (Pinecone, Weaviate) for production.
- Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your content.
Key Takeaways
- Evaluate retrieval and generation separately to identify where your RAG system needs improvement
- MRR is your most actionable metric for retrieval optimization—small improvements here compound into large end-to-end gains
- Summary indexing helps Claude understand document structure and relationships between sections
- Re-ranking with Claude dramatically improves retrieval quality without changing your embedding pipeline
- Start simple, measure everything, then optimize—basic RAG works, but systematic evaluation reveals where to invest your optimization efforts