Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide teaches you to build a production-ready RAG system with Claude, covering basic setup with Voyage AI embeddings, creating an evaluation suite with precision/recall/F1 metrics, and advanced optimization techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities into your specific business context. While Claude excels at general knowledge tasks, it needs RAG to answer questions about your internal documents, customer support history, or proprietary research.
In this guide, we'll walk through building a complete RAG system using Claude and Voyage AI embeddings, then systematically improve it using evaluation-driven development. We'll cover three levels of sophistication:
- Basic RAG - Simple chunking, embedding, and retrieval
- Summary Indexing - Adding document summaries for better context
- Re-ranking - Using Claude to improve result ordering
Prerequisites and Setup
Before diving in, you'll need:
- An Anthropic API key for Claude
- A Voyage AI API key for embeddings
- Python 3.8+ with basic data science libraries
Installing Dependencies
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Initializing Your Vector Database
For this guide, we'll use an in-memory vector store. In production, consider managed solutions like Pinecone, Weaviate, or pgvector.
import voyageai
from anthropic import Anthropic
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self, voyage_client):
self.documents = []
self.embeddings = []
self.voyage = voyage_client
def add_documents(self, texts: List[str]):
"""Add documents and their embeddings to the store."""
response = self.voyage.embed(texts, model="voyage-2")
self.embeddings.extend(response.embeddings)
self.documents.extend(texts)
def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]:
"""Retrieve top-k documents by cosine similarity."""
query_embedding = self.voyage.embed([query], model="voyage-2").embeddings[0]
# Compute cosine similarities
similarities = [
np.dot(query_embedding, doc_emb) /
(np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
for doc_emb in self.embeddings
]
# Get top-k indices
top_indices = np.argsort(similarities)[-k:][::-1]
return [
{"text": self.documents[i], "score": similarities[i]}
for i in top_indices
]
Initialize clients
voyage_client = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
anthropic_client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
db = InMemoryVectorDB(voyage_client)
Level 1: Basic RAG Pipeline
Let's start with what's often called "Naive RAG" - a straightforward three-step process:
- Chunk documents by heading or section
- Embed each chunk using Voyage AI
- Retrieve relevant chunks via cosine similarity
Implementing the Basic Pipeline
def chunk_document(text: str, heading_pattern: str = "## ") -> List[str]:
"""Split document by headings for semantic chunks."""
chunks = []
current_chunk = []
for line in text.split("\n"):
if line.startswith(heading_pattern) and current_chunk:
chunks.append("\n".join(current_chunk))
current_chunk = [line]
else:
current_chunk.append(line)
if current_chunk:
chunks.append("\n".join(current_chunk))
return chunks
def basic_rag(query: str, db: InMemoryVectorDB, k: int = 3) -> str:
"""Basic RAG: retrieve chunks and generate answer."""
# Step 1: Retrieve relevant chunks
results = db.search(query, k=k)
context = "\n\n".join([r["text"] for r in results])
# Step 2: Generate answer with Claude
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
response = anthropic_client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production. You need quantitative metrics to measure and improve your RAG system. Let's build a proper evaluation suite.
Creating a Test Dataset
Generate a synthetic evaluation dataset with 100+ samples. Each sample should include:
- A question
- The correct answer
- The relevant document chunks
import json
def create_evaluation_sample(question: str, answer: str, relevant_chunks: List[str]) -> Dict:
return {
"question": question,
"expected_answer": answer,
"relevant_chunks": relevant_chunks
}
Load or generate your dataset
evaluation_data = json.load(open("evaluation_dataset.json"))
Key Metrics Explained
#### Retrieval Metrics
Precision measures how many retrieved chunks are actually relevant:$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Retrieved}|}$$
High precision means fewer false positives - you're not wasting Claude's context window on irrelevant information.
Recall measures how many relevant chunks you successfully retrieved:$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Relevant}|}$$
High recall ensures Claude has all the information it needs.
F1 Score is the harmonic mean of precision and recall:$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
Mean Reciprocal Rank (MRR) measures how early the first relevant result appears:$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$
#### End-to-End Metric
Accuracy measures whether Claude's final answer is correct given the retrieved context.Implementing the Evaluation
def evaluate_retrieval(db: InMemoryVectorDB, eval_data: List[Dict], k: int = 3):
"""Evaluate retrieval performance."""
precisions, recalls, f1s, mrrs = [], [], [], []
for sample in eval_data:
query = sample["question"]
relevant = set(sample["relevant_chunks"])
# Retrieve chunks
results = db.search(query, k=k)
retrieved = set([r["text"] for r in results])
# Calculate metrics
true_positives = len(retrieved & relevant)
precision = true_positives / len(retrieved) if retrieved else 0
recall = true_positives / len(relevant) if relevant else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: find first relevant result
mrr = 0
for i, r in enumerate(results):
if r["text"] in relevant:
mrr = 1 / (i + 1)
break
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
mrrs.append(mrr)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean(f1s),
"avg_mrr": np.mean(mrrs)
}
Level 2: Summary Indexing
Basic chunking loses the forest for the trees. Summary indexing adds a high-level overview of each document section, improving retrieval for questions that require synthesis.
How Summary Indexing Works
- For each document chunk, generate a summary using Claude
- Store both the original chunk and its summary
- When searching, match against summaries first, then retrieve full chunks
def generate_summary(chunk: str, anthropic_client: Anthropic) -> str:
"""Generate a concise summary of a document chunk."""
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{
"role": "user",
"content": f"Summarize this in 1-2 sentences:\n\n{chunk}"
}]
)
return response.content[0].text
def build_summary_index(chunks: List[str], anthropic_client: Anthropic, voyage_client) -> InMemoryVectorDB:
"""Build a vector index with summaries."""
summary_db = InMemoryVectorDB(voyage_client)
for chunk in chunks:
summary = generate_summary(chunk, anthropic_client)
# Store summary + chunk for retrieval
summary_db.add_documents([f"Summary: {summary}\n\nFull: {chunk}"])
return summary_db
Level 3: Re-ranking with Claude
Even with good embeddings, the top-k results aren't always optimally ordered. Re-ranking uses Claude to evaluate and reorder retrieved chunks based on relevance to the specific question.
Implementing Re-ranking
def rerank_chunks(query: str, chunks: List[str], anthropic_client: Anthropic, top_k: int = 3) -> List[str]:
"""Use Claude to re-rank retrieved chunks by relevance."""
# Prepare chunks for evaluation
chunk_text = "\n\n---\n\n".join([
f"Chunk {i+1}: {chunk}" for i, chunk in enumerate(chunks)
])
prompt = f"""Given the question below, rank these chunks by relevance (most relevant first).
Question: {query}
{chunk_text}
Return the chunk numbers in order of relevance, like: 3, 1, 2"""
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
# Parse the ranking
ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
ranked_chunks = [chunks[i] for i in ranking[:top_k]]
return ranked_chunks
def advanced_rag(query: str, db: InMemoryVectorDB, anthropic_client: Anthropic, k: int = 5) -> str:
"""Advanced RAG with re-ranking."""
# Retrieve more chunks than needed
results = db.search(query, k=k)
chunks = [r["text"] for r in results]
# Re-rank with Claude
top_chunks = rerank_chunks(query, chunks, anthropic_client, top_k=3)
context = "\n\n".join(top_chunks)
# Generate final answer
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
response = anthropic_client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Results and Performance Gains
After implementing these optimizations, we achieved significant improvements:
| Metric | Basic RAG | Advanced RAG | Improvement |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | +2% |
| Avg Recall | 0.66 | 0.69 | +5% |
| Avg F1 Score | 0.52 | 0.54 | +4% |
| Avg MRR | 0.74 | 0.87 | +18% |
| End-to-End Accuracy | 71% | 81% | +14% |
Production Considerations
- Rate Limits: Full evaluations can hit API rate limits. Consider using Tier 2+ accounts or running evaluations incrementally.
- Cost Management: Summary indexing and re-ranking add token costs. Balance improvement against budget.
- Vector Database: For production, use managed solutions like Pinecone, Weaviate, or pgvector instead of in-memory stores.
- Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your document structure.
Key Takeaways
- Evaluate systematically: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy to identify bottlenecks in your RAG pipeline.
- Summary indexing improves context: Adding document summaries helps Claude understand the big picture before diving into details, improving recall by 5%.
- Re-ranking with Claude boosts relevance: Using Claude to reorder retrieved chunks improved MRR by 18%, ensuring the most relevant information appears first.
- Start simple, then optimize: Begin with basic RAG, establish your evaluation baseline, then incrementally add sophistication.
- Monitor costs vs. benefits: Advanced techniques like summary generation and re-ranking add token costs. Measure whether the accuracy gains justify the expense for your use case.