Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Learn to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your unique business context. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude tap into your proprietary data to deliver accurate, context-aware answers.
In this guide, we'll walk through building and optimizing a RAG system using Claude and the Anthropic documentation as our knowledge base. You'll learn how to move from a basic "naive RAG" implementation to an advanced system that achieves measurable improvements in retrieval quality and end-to-end accuracy.
What You'll Learn
- How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
- How to build a robust evaluation suite with production-grade metrics
- How to implement summary indexing for better retrieval coverage
- How to use Claude as a re-ranker to improve result relevance
- How to measure and optimize precision, recall, F1, MRR, and end-to-end accuracy
Prerequisites
Before diving in, make sure you have:
- An Anthropic API key for accessing Claude
- A Voyage AI API key for generating embeddings
- Python 3.8+ installed
- Basic familiarity with Python and vector databases
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Level 1: Basic RAG Pipeline
Let's start with a simple "naive RAG" implementation. This three-step process forms the foundation of any RAG system:
- Chunk documents by heading (each subheading becomes a separate chunk)
- Embed each chunk using Voyage AI's embedding model
- Retrieve relevant chunks using cosine similarity when a query comes in
Setting Up the Vector Database
For this example, we'll use an in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
texts = [doc["content"] for doc in documents]
response = self.client.embed(texts, model="voyage-2")
self.embeddings.extend(response.embeddings)
self.documents.extend(documents)
def search(self, query: str, k: int = 3) -> List[Dict]:
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
scores = [
np.dot(query_embedding, doc_emb)
for doc_emb in self.embeddings
]
top_indices = np.argsort(scores)[-k:][::-1]
return [self.documents[i] for i in top_indices]
Implementing the Basic RAG Query
from anthropic import Anthropic
class BasicRAG:
def __init__(self, vector_db, anthropic_api_key: str):
self.vector_db = vector_db
self.anthropic = Anthropic(api_key=anthropic_api_key)
def query(self, question: str) -> str:
# Step 1: Retrieve relevant chunks
chunks = self.vector_db.search(question, k=3)
# Step 2: Build context from chunks
context = "\n\n".join([chunk["content"] for chunk in chunks])
# Step 3: Generate answer with Claude
response = self.anthropic.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
system="You are a helpful assistant. Answer the question based on the provided context.",
messages=[
{"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
]
)
return response.content[0].text
Building an Evaluation System
"Vibes-based" evaluation won't cut it in production. You need objective metrics to measure and improve your RAG system. We'll evaluate two independent components:
- Retrieval performance: How well does the system find relevant chunks?
- End-to-end performance: How accurate are the final answers?
Creating a Synthetic Evaluation Dataset
Generate 100+ test samples, each containing:
- A question
- The correct answer
- The relevant document chunks that should be retrieved
import json
Example evaluation sample
{
"question": "What is the maximum context window for Claude 3 Opus?",
"correct_answer": "Claude 3 Opus supports up to 200,000 tokens of context.",
"relevant_chunks": [
"Claude 3 Opus features a 200,000 token context window...",
"The context window allows processing large documents..."
]
}
Key Retrieval Metrics
#### Precision Precision measures how many of the retrieved chunks are actually relevant. High precision means fewer false positives.
Precision = |Retrieved ∩ Correct| / |Retrieved|
#### Recall Recall measures how many of the relevant chunks were retrieved. High recall means you're not missing important information.
Recall = |Retrieved ∩ Correct| / |Correct|
#### F1 Score The harmonic mean of precision and recall, giving a balanced view of retrieval quality.
F1 = 2 (Precision Recall) / (Precision + Recall)
#### Mean Reciprocal Rank (MRR) MRR evaluates how early the first relevant chunk appears in your results. A high MRR means users see relevant information quickly.
def calculate_mrr(retrieved_chunks, correct_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in correct_chunks:
return 1.0 / (i + 1)
return 0.0
End-to-End Accuracy
This measures whether Claude's final answer is correct. You can use LLM-as-judge or manual evaluation.
def evaluate_end_to_end(rag_system, eval_dataset):
correct = 0
for sample in eval_dataset:
answer = rag_system.query(sample["question"])
# Use Claude to judge correctness
judgment = judge_answer(answer, sample["correct_answer"])
if judgment == "correct":
correct += 1
return correct / len(eval_dataset)
Level 2: Summary Indexing
Basic chunking often misses the forest for the trees. Summary indexing adds a high-level overview of each document section to improve retrieval.
def create_summary_index(documents):
summary_db = InMemoryVectorDB(api_key=VOYAGE_API_KEY)
for doc in documents:
# Generate a summary using Claude
summary = generate_summary(doc["content"])
# Store both the summary and original content
summary_db.add_documents([
{"content": summary, "type": "summary", "original": doc},
{"content": doc["content"], "type": "full", "original": doc}
])
return summary_db
When a query comes in, search both summaries and full chunks. This improves recall by helping the system find relevant documents even when the query doesn't match specific chunk text.
Level 3: Re-Ranking with Claude
Re-ranking takes the top-k results from your initial retrieval and uses Claude to reorder them by relevance. This dramatically improves MRR.
def rerank_with_claude(query: str, candidates: List[Dict]) -> List[Dict]:
prompt = f"""
Given the query: "{query}"
Rank the following passages by relevance (most relevant first):
{chr(10).join([f"{i+1}. {c['content']}" for i, c in enumerate(candidates)])}
Return the indices in order of relevance, separated by commas.
"""
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response and reorder candidates
indices = [int(i) - 1 for i in response.content[0].text.split(",")]
return [candidates[i] for i in indices]
Results: Before and After
After implementing summary indexing and re-ranking, here's the improvement over the basic RAG pipeline:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate limits: Full evaluations can hit API rate limits. Consider running smaller eval sets or using Tier 2+ accounts.
- Cost management: Summary indexing and re-ranking add token costs. Balance improvements against budget.
- Vector database: For production, use a hosted vector DB with proper indexing and scaling.
- Evaluation dataset: Maintain a diverse, evolving eval set that reflects real user queries.
Key Takeaways
- Evaluate retrieval and generation separately to identify where your RAG system needs improvement. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
- Summary indexing boosts recall by adding high-level document overviews that catch queries missed by chunk-level search.
- Re-ranking with Claude dramatically improves MRR, ensuring the most relevant information appears first in your results.
- Start simple, then iterate — a basic RAG pipeline can be surprisingly effective, and targeted improvements (like re-ranking) often yield the biggest gains.
- Build a synthetic evaluation dataset early in your development process. It's the foundation for objective, reproducible improvements.