Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics, with concrete code examples.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.
In this guide, we'll walk through building a RAG system using Claude and Voyage AI embeddings, using the Claude documentation as our knowledge base. We'll start with a basic implementation, then show you how to measure performance properly, and finally introduce advanced techniques that can significantly boost your results.
What You'll Learn
- How to set up a basic RAG pipeline with Claude
- How to build a robust evaluation suite for retrieval and end-to-end performance
- How to implement summary indexing for better context capture
- How to use re-ranking to improve answer quality
Prerequisites
You'll need:
- An Anthropic API key
- A Voyage AI API key
- Python 3.8+ with
anthropic,voyageai,pandas,numpy,matplotlib, andscikit-learninstalled
Level 1: Basic RAG Pipeline
Let's start with what's often called "Naive RAG" — a straightforward three-step process:
- Chunk your documents by headings
- Embed each chunk using Voyage AI
- Retrieve the most relevant chunks using cosine similarity
import voyageai
from anthropic import Anthropic
import numpy as np
Initialize clients
vo = voyageai.Client(api_key="your-voyage-api-key")
anthropic = Anthropic(api_key="your-anthropic-api-key")
class BasicRAG:
def __init__(self, documents):
self.documents = documents
self.embeddings = self._embed_documents(documents)
def _embed_documents(self, docs):
result = vo.embed(docs, model="voyage-2")
return np.array(result.embeddings)
def retrieve(self, query, k=3):
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
similarities = np.dot(self.embeddings, query_embedding)
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.documents[i] for i in top_indices]
def answer(self, query):
chunks = self.retrieve(query)
context = "\n\n".join(chunks)
response = anthropic.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Answer the question based on this context:\n\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
This works, but how do you know if it's working well? That's where evaluation comes in.
Building an Evaluation System
Don't rely on "vibes" — measure your RAG system properly. The key insight is to evaluate retrieval and end-to-end performance separately.
Creating an Evaluation Dataset
Generate a synthetic evaluation set with 100+ samples. Each sample should include:
- A question
- The correct chunks (ground truth for retrieval)
- A correct answer (ground truth for end-to-end)
# Example evaluation sample structure
eval_sample = {
"question": "How do I set up streaming with Claude?",
"relevant_chunks": ["chunk_42.txt", "chunk_43.txt"],
"correct_answer": "To set up streaming, use the stream=True parameter..."
}
Key Metrics
#### Retrieval Metrics
Precision — Of the chunks retrieved, how many are relevant?Precision = True Positives / Total Retrieved
High precision means fewer irrelevant chunks cluttering the context.
Recall — Of all relevant chunks, how many did we retrieve?
Recall = True Positives / Total Relevant
High recall ensures Claude has all the information it needs.
F1 Score — Harmonic mean of precision and recall.
Mean Reciprocal Rank (MRR) — How high is the first relevant chunk in the results? Crucial for question-answering where one good chunk might be enough.
#### End-to-End Metric
Accuracy — Does Claude produce the correct answer? This is the ultimate test of your system.Implementing the Evaluation
def evaluate_retrieval(rag_system, eval_dataset):
precisions, recalls, mrrs = [], [], []
for sample in eval_dataset:
retrieved = rag_system.retrieve(sample["question"])
relevant = sample["relevant_chunks"]
# Calculate metrics
true_positives = len(set(retrieved) & set(relevant))
precision = true_positives / len(retrieved)
recall = true_positives / len(relevant)
# MRR: reciprocal rank of first relevant result
for rank, chunk in enumerate(retrieved, 1):
if chunk in relevant:
mrr = 1.0 / rank
break
precisions.append(precision)
recalls.append(recall)
mrrs.append(mrr)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean([2pr/(p+r) if p+r > 0 else 0
for p, r in zip(precisions, recalls)]),
"avg_mrr": np.mean(mrrs)
}
Level 2: Summary Indexing
Basic chunking by headings misses the bigger picture. Summary indexing creates an additional index where each chunk is paired with a summary of its broader context.
def create_summary_index(documents, chunk_size=3):
"""Create summaries for groups of chunks."""
summary_index = []
for i in range(0, len(documents), chunk_size):
group = documents[i:i+chunk_size]
combined = "\n".join(group)
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this content in 2-3 sentences:\n\n{combined}"
}]
)
summary = response.content[0].text
# Store both summary and original chunks
summary_index.append({
"summary": summary,
"chunks": group,
"embedding": vo.embed([summary], model="voyage-2").embeddings[0]
})
return summary_index
This improves recall by helping the retriever find relevant content even when the query doesn't match exact keywords in the chunk.
Level 3: Summary Indexing + Re-Ranking
The most advanced technique combines summary indexing with re-ranking. After retrieving candidates, use Claude to re-rank them by relevance to the query.
def rerank_with_claude(query, candidates, top_k=3):
"""Use Claude to re-rank retrieved chunks."""
prompt = f"""Given the question: "{query}"
Rate each chunk from 1-10 for relevance (10 = most relevant).
Return only the chunk indices sorted by relevance, highest first.
Chunks:
"""
for i, chunk in enumerate(candidates):
prompt += f"\n[{i}]: {chunk[:200]}..."
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response to get re-ranked indices
# (In production, use structured output)
return response.content[0].text
Results: What You Can Expect
With these improvements, the guide's authors achieved:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Vector Database: For production, replace the in-memory store with Pinecone, Weaviate, or pgvector
- Rate Limits: Full evaluations can hit rate limits — consider Tier 2+ or sample-based testing
- Cost: Summary indexing and re-ranking add token usage; optimize by caching summaries
Key Takeaways
- Evaluate retrieval and end-to-end performance separately — This lets you pinpoint whether issues are in finding information or in answering with it.
- Summary indexing improves recall by capturing the broader context around individual chunks, helping Claude find relevant information even with imperfect queries.
- Re-ranking with Claude significantly boosts MRR — Getting the most relevant chunk to the top of the context window improves answer quality dramatically.
- Start simple, measure, then optimize — A basic RAG pipeline can be surprisingly effective. Use data to decide where to invest in improvements.
- Your evaluation dataset is your most important asset — Invest time in creating high-quality, representative test samples that reflect real user queries.