Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques for production-grade performance.
This guide teaches you to build production-ready RAG systems with Claude, covering basic setup with Voyage AI embeddings, comprehensive evaluation using precision/recall/MRR metrics, and advanced optimization techniques like summary indexing and re-ranking to boost end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your proprietary data. While Claude excels at general knowledge tasks, it can't know your internal documentation, customer support history, or proprietary research. RAG bridges this gap by dynamically retrieving relevant information from your knowledge base and injecting it into Claude's context window.
In this guide, we'll walk through building a production-grade RAG system using the Anthropic Cookbook's reference implementation. We'll start with a basic "naive" RAG pipeline, then systematically improve it using advanced techniques like summary indexing and re-ranking. Along the way, we'll build a proper evaluation framework—because without measurement, you're just guessing.
Understanding the RAG Architecture
Before diving into code, let's understand the three core components of any RAG system:
- Ingestion Pipeline: Chunks documents, generates embeddings, and stores them in a vector database
- Retrieval System: Takes a user query, embeds it, and finds the most semantically similar document chunks
- Generation System: Passes retrieved chunks to Claude along with the original query for answer generation
Level 1: Building a Basic RAG Pipeline
Let's start with what the industry calls "Naive RAG"—a straightforward implementation that gets the job done but has plenty of room for improvement.
Setup and Dependencies
First, install the required libraries:
pip install anthropic voyageai pandas numpy scikit-learn matplotlib
You'll need API keys from both Anthropic and Voyage AI. Voyage AI provides specialized embedding models that outperform general-purpose alternatives for retrieval tasks.
Initializing the Vector Database
For this example, we'll use an in-memory vector store. In production, you'd likely use Pinecone, Weaviate, or another hosted solution.
import voyageai
import numpy as np
from typing import List, Dict, Tuple
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings"""
texts = [doc["content"] for doc in documents]
response = self.client.embed(texts, model="voyage-2")
self.embeddings.extend(response.embeddings)
self.documents.extend(documents)
def search(self, query: str, k: int = 3) -> List[Tuple[Dict[str, str], float]]:
"""Retrieve top-k most similar documents"""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, doc_emb)
for doc_emb in self.embeddings
]
top_indices = np.argsort(similarities)[-k:][::-1]
return [(self.documents[i], similarities[i]) for i in top_indices]
The Basic RAG Pipeline
Our naive approach follows three steps:
- Chunk documents by heading—each section becomes a separate chunk
- Embed each chunk using Voyage AI's embedding model
- Retrieve top-k chunks using cosine similarity and pass them to Claude
from anthropic import Anthropic
class BasicRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.vector_db = InMemoryVectorDB(voyage_key)
self.llm = Anthropic(api_key=anthropic_key)
def query(self, question: str) -> str:
# Step 1: Retrieve relevant chunks
retrieved = self.vector_db.search(question, k=3)
context = "\n\n---\n\n".join([doc["content"] for doc, _ in retrieved])
# Step 2: Generate answer with Claude
response = self.llm.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
This works, but it has limitations. Chunks are arbitrary, retrieval quality is inconsistent, and we have no way to measure performance.
Building a Comprehensive Evaluation System
"Vibes-based" evaluation won't cut it for production systems. We need quantitative metrics that measure both retrieval quality and end-to-end answer accuracy.
Creating a Synthetic Evaluation Dataset
The Anthropic Cookbook provides a dataset of 100 samples, each containing:
- A question
- Ground-truth relevant chunks
- A correct answer
import json
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
eval_data = json.load(f)
Preview the first sample
print(json.dumps(eval_data[0], indent=2))
Retrieval Metrics
We evaluate retrieval quality using four standard metrics:
Precision: Of the chunks we retrieved, how many were actually relevant?Precision = |Retrieved ∩ Correct| / |Retrieved|
Recall: Of all the correct chunks that exist, how many did we retrieve?
Recall = |Retrieved ∩ Correct| / |Correct|
F1 Score: The harmonic mean of precision and recall.
Mean Reciprocal Rank (MRR): How early in the results does the first relevant chunk appear? This is crucial because Claude's context window is limited—if the relevant chunk is buried, it might get cut off.
End-to-End Accuracy
This measures whether Claude's final answer is correct given the retrieved context. It's the ultimate test—even perfect retrieval is useless if Claude can't synthesize the information correctly.
def evaluate_retrieval(rag_system, eval_data):
"""Evaluate retrieval metrics across the dataset"""
precisions, recalls, f1s, mrrs = [], [], [], []
for sample in eval_data:
retrieved = rag_system.vector_db.search(sample["question"], k=3)
retrieved_ids = {doc["id"] for doc, _ in retrieved}
correct_ids = set(sample["relevant_chunk_ids"])
# Calculate metrics
true_positives = len(retrieved_ids & correct_ids)
precision = true_positives / len(retrieved_ids)
recall = true_positives / len(correct_ids)
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant result
for rank, (doc, _) in enumerate(retrieved, 1):
if doc["id"] in correct_ids:
mrr = 1.0 / rank
break
else:
mrr = 0.0
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
mrrs.append(mrr)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean(f1s),
"avg_mrr": np.mean(mrrs)
}
Level 2: Summary Indexing
Our basic RAG has a fundamental problem: chunks are too granular. A single heading might contain multiple distinct concepts, and relevant information might span across headings.
Summary indexing solves this by creating higher-level summaries of document sections. Instead of retrieving raw chunks, we retrieve summaries that provide broader context.def create_summary_index(documents, llm_client):
"""Create summary embeddings for document sections"""
summaries = []
for doc in documents:
response = llm_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this section in 2-3 sentences:\n\n{doc['content']}"
}]
)
summaries.append({
"id": doc["id"],
"summary": response.content[0].text,
"original_content": doc["content"]
})
return summaries
During retrieval, we embed the query and search against summary embeddings. Once we find relevant summaries, we retrieve the corresponding full chunks for Claude's context. This dramatically improves recall because summaries capture the essence of longer passages.
Level 3: Adding Re-Ranking
Even with summary indexing, our initial retrieval might miss the mark. Re-ranking adds a second stage: after retrieving top-k candidates, we use Claude to score and reorder them based on relevance to the query.
def rerank_with_claude(query: str, candidates: List[Dict], llm_client) -> List[Dict]:
"""Use Claude to re-rank retrieved chunks by relevance"""
scored = []
for chunk in candidates:
response = llm_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{
"role": "user",
"content": f"On a scale of 0-10, how relevant is this chunk to the query?\n\nQuery: {query}\n\nChunk: {chunk['content'][:500]}\n\nRelevance score (just the number):"
}]
)
score = float(response.content[0].text.strip())
scored.append((chunk, score))
# Sort by score descending
scored.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in scored]
Re-ranking dramatically improves MRR—the first relevant chunk appears much earlier in the results. This is critical because Claude's attention is strongest on the first few chunks in its context window.
Results: The Impact of Optimization
After implementing summary indexing and re-ranking, the improvements are significant:
| Metric | Basic RAG | Optimized RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
When moving to production, consider these additional factors:
- Rate Limits: Full evaluations can hit API rate limits. Use Tier 2+ accounts and consider sampling your evaluation dataset.
- Chunking Strategy: Experiment with different chunk sizes and overlap. The optimal size depends on your document structure.
- Embedding Model: Voyage AI's
voyage-2is excellent, but test alternatives liketext-embedding-3-smallfrom OpenAI. - Caching: Cache embeddings for frequently accessed documents to reduce API calls and latency.
- Monitoring: Log retrieval metrics in production to detect degradation over time.
Key Takeaways
- Measure what matters: Build separate evaluation pipelines for retrieval quality (precision, recall, F1, MRR) and end-to-end accuracy. Without metrics, you can't optimize.
- Summary indexing beats raw chunking: Creating summary-level embeddings significantly improves recall by capturing the essence of longer passages.
- Re-ranking is worth the latency: A second-stage re-ranking pass with Claude dramatically improves MRR, ensuring the most relevant context appears first.
- Optimize iteratively: Start with basic RAG, measure baseline performance, then systematically apply improvements. The 10-point accuracy gain from 71% to 81% came from targeted, measurable changes.
- Synthetic evaluation datasets are powerful: Generate challenging test cases that require multi-chunk synthesis to truly stress-test your system.