Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
A practical guide to building RAG systems with Claude, covering basic setup with Voyage AI embeddings, building an evaluation suite with precision/recall/F1 metrics, and advanced optimization techniques like summary indexing and re-ranking to boost end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. While Claude excels at general knowledge tasks, it needs RAG to answer questions specific to your business context—whether that's internal documentation, customer support histories, or proprietary research.
In this guide, we'll walk through building a complete RAG system using Claude and Voyage AI embeddings, then systematically improve it using advanced techniques. By the end, you'll understand not just how to build RAG, but how to measure and optimize it for production.
Why RAG Matters for Claude Users
Enterprises are increasingly building RAG applications to:
- Power customer support with product documentation
- Enable Q&A over internal company documents
- Accelerate financial and legal analysis
- Create knowledge assistants for specialized domains
Level 1: Building a Basic RAG Pipeline
Let's start with what's often called "Naive RAG"—a straightforward three-step process:
- Chunk your documents into manageable pieces
- Embed each chunk into a vector representation
- Retrieve the most relevant chunks for a given query
Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
You'll need API keys from both Anthropic and Voyage AI.
Creating a Vector Database Class
For this example, we'll use an in-memory vector database. In production, you'd want a hosted solution like Pinecone or Weaviate.
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self):
self.vectors = []
self.metadata = []
def add_document(self, vector: List[float], metadata: Dict[str, Any]):
self.vectors.append(np.array(vector))
self.metadata.append(metadata)
def search(self, query_vector: List[float], top_k: int = 3) -> List[Dict[str, Any]]:
query_vec = np.array(query_vector)
similarities = [
np.dot(query_vec, doc_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(doc_vec))
for doc_vec in self.vectors
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [
{"metadata": self.metadata[i], "score": similarities[i]}
for i in top_indices
]
Implementing the Basic RAG Pipeline
import voyageai
from anthropic import Anthropic
class BasicRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.anthropic = Anthropic(api_key=anthropic_key)
self.voyage = voyageai.Client(api_key=voyage_key)
self.vector_db = InMemoryVectorDB()
def index_documents(self, documents: List[Dict[str, str]]):
"""Chunk and index documents by heading."""
for doc in documents:
# Simple chunking: split by headings
chunks = self._chunk_by_heading(doc["content"])
for chunk in chunks:
embedding = self.voyage.embed(
[chunk["text"]],
model="voyage-2"
).embeddings[0]
self.vector_db.add_document(
embedding,
{"source": doc["title"], "text": chunk["text"]}
)
def query(self, question: str, top_k: int = 3) -> str:
# Embed the question
query_embedding = self.voyage.embed(
[question],
model="voyage-2"
).embeddings[0]
# Retrieve relevant chunks
results = self.vector_db.search(query_embedding, top_k=top_k)
# Build context from retrieved chunks
context = "\n\n".join([
r["metadata"]["text"] for r in results
])
# Generate answer with Claude
response = self.anthropic.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
Building a Robust Evaluation System
This is where most RAG tutorials stop—but it's where the real work begins. To build production-quality RAG, you need to measure two things independently:
- Retrieval performance: How well does your system find relevant chunks?
- End-to-end performance: How well does Claude answer questions given those chunks?
Creating a Synthetic Evaluation Dataset
We'll generate 100 evaluation samples, each containing:
- A question
- The correct chunks (ground truth)
- A correct answer
{
"question": "How do I use Claude's system prompt to control output format?",
"relevant_chunks": [
"System prompts allow you to set the behavior and output format...",
"You can specify JSON output by including 'Respond in JSON'..."
],
"correct_answer": "To control output format, use a system prompt that specifies..."
}
Key Retrieval Metrics
#### Precision Precision answers: "Of the chunks we retrieved, how many were actually relevant?"
Precision = True Positives / Total Retrieved
High precision means your system isn't returning irrelevant chunks. Low precision means Claude has to sort through noise.
#### Recall Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"
Recall = True Positives / Total Relevant
High recall ensures Claude has all the information it needs. Low recall means you're missing important context.
#### F1 Score The harmonic mean of precision and recall:
F1 = 2 (Precision Recall) / (Precision + Recall)
#### Mean Reciprocal Rank (MRR) MRR measures how high the first relevant chunk appears in your results. This matters because Claude pays more attention to early chunks.
Implementing the Evaluation
def evaluate_retrieval(rag_system, eval_dataset):
metrics = {"precision": [], "recall": [], "f1": [], "mrr": []}
for sample in eval_dataset:
query_embedding = rag_system.voyage.embed(
[sample["question"]],
model="voyage-2"
).embeddings[0]
results = rag_system.vector_db.search(query_embedding, top_k=3)
retrieved_texts = [r["metadata"]["text"] for r in results]
# Calculate metrics
tp = len(set(retrieved_texts) & set(sample["relevant_chunks"]))
precision = tp / len(retrieved_texts) if retrieved_texts else 0
recall = tp / len(sample["relevant_chunks"]) if sample["relevant_chunks"] else 0
f1 = 2 precision recall / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant chunk
mrr = 0
for i, text in enumerate(retrieved_texts):
if text in sample["relevant_chunks"]:
mrr = 1 / (i + 1)
break
metrics["precision"].append(precision)
metrics["recall"].append(recall)
metrics["f1"].append(f1)
metrics["mrr"].append(mrr)
return {k: np.mean(v) for k, v in metrics.items()}
Level 2: Summary Indexing
Basic chunking by heading has a fundamental problem: it loses the broader context. A chunk about "API Rate Limits" might not mention it's part of the "Getting Started" guide, which is crucial context.
Summary indexing solves this by creating a summary of each document section and including it with every chunk:class SummaryIndexRAG(BasicRAG):
def index_documents(self, documents):
for doc in documents:
sections = self._extract_sections(doc["content"])
for section in sections:
# Generate a summary of the section
summary = self.anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this section in 2-3 sentences:\n\n{section['text']}"
}]
).content[0].text
# Embed the summary + chunk together
enhanced_chunk = f"Section Summary: {summary}\n\nContent: {section['text']}"
embedding = self.voyage.embed(
[enhanced_chunk],
model="voyage-2"
).embeddings[0]
self.vector_db.add_document(
embedding,
{"text": section["text"], "summary": summary}
)
Level 3: Adding Re-Ranking
Even with summary indexing, your top-3 retrieved chunks might not be in the optimal order. Re-ranking uses Claude to intelligently reorder results:
class ReRankRAG(SummaryIndexRAG):
def query(self, question: str, top_k: int = 10) -> str:
# Retrieve more candidates initially
query_embedding = self.voyage.embed([question], model="voyage-2").embeddings[0]
candidates = self.vector_db.search(query_embedding, top_k=top_k)
# Re-rank with Claude
ranked = self._rerank_with_claude(question, candidates)
# Take top 3 after re-ranking
top_chunks = ranked[:3]
context = "\n\n".join([c["metadata"]["text"] for c in top_chunks])
response = self.anthropic.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
def _rerank_with_claude(self, question: str, candidates: List[Dict]) -> List[Dict]:
chunks_text = "\n---\n".join([
f"Chunk {i+1}: {c['metadata']['text'][:200]}..."
for i, c in enumerate(candidates)
])
response = self.anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=500,
messages=[{
"role": "user",
"content": f"Given the question: '{question}'\n\nRank these chunks by relevance (most relevant first). Return only the chunk numbers in order, comma-separated.\n\n{chunks_text}"
}]
)
# Parse the ranked order
ranked_indices = [
int(x.strip()) - 1
for x in response.content[0].text.split(",")
if x.strip().isdigit()
]
return [candidates[i] for i in ranked_indices if i < len(candidates)]
Results: The Impact of Each Improvement
Here's what we achieved by layering these techniques:
| Metric | Basic RAG | + Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Precision | 0.43 | 0.44 | 0.44 |
| Recall | 0.66 | 0.68 | 0.69 |
| F1 Score | 0.52 | 0.53 | 0.54 |
| MRR | 0.74 | 0.82 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
- MRR improvement (0.74 → 0.87): Re-ranking ensures the most relevant chunk appears first
- End-to-end accuracy (71% → 81%): Better retrieval directly leads to better answers
Key Takeaways
- Evaluate retrieval and generation separately – Don't just trust "vibes." Use precision, recall, F1, and MRR to measure retrieval quality independently from answer quality.
- Summary indexing preserves context – By embedding summaries alongside chunks, you help the retrieval system understand the broader context of each piece of content.
- Re-ranking with Claude dramatically improves results – Even a lightweight model like Claude 3 Haiku can intelligently reorder search results, boosting MRR by 18% and end-to-end accuracy by 10 percentage points.
- Start simple, then optimize – Build a basic RAG pipeline first, establish your baseline metrics, then layer improvements. This approach ensures you're actually making progress, not just adding complexity.
- Watch your rate limits – Full evaluation runs can consume significant tokens. Consider running smaller evaluation sets during development and scaling up for production testing.