Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide teaches you to build a RAG system with Claude, from basic setup to advanced optimization. You'll learn to evaluate retrieval performance using precision, recall, F1, and MRR metrics, then improve accuracy from 71% to 81% using summary indexing and re-ranking techniques.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG enables Claude to answer questions about your unique business context with high accuracy.
In this guide, we'll walk through building and optimizing a RAG system using Claude and Voyage AI embeddings, using the Claude documentation as our knowledge base. You'll learn how to:
- Set up a basic RAG pipeline
- Build a robust evaluation system
- Implement advanced techniques like summary indexing and re-ranking
- Achieve measurable improvements in retrieval and end-to-end accuracy
Understanding the RAG Architecture
Before diving into code, let's understand what makes RAG tick. A RAG system has three core components:
- Ingestion Pipeline: Chunks documents, generates embeddings, and stores them in a vector database
- Retrieval System: Finds relevant chunks for a given query using semantic similarity
- Generation System: Feeds retrieved context to Claude to generate accurate answers
Level 1: Building a Basic RAG System
Let's start with what's often called "Naive RAG" — a simple but functional implementation.
Setup and Dependencies
First, install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
You'll need API keys from both Anthropic and Voyage AI.
Initialize the Vector Database
For this example, we'll use an in-memory vector database. In production, you'd want a hosted solution like Pinecone or Weaviate.
import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings"""
texts = [doc['content'] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, k: int = 3) -> List[Dict]:
"""Search for similar documents using cosine similarity"""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
# Calculate cosine similarity
similarities = []
for doc_embedding in self.embeddings:
similarity = np.dot(query_embedding, doc_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
)
similarities.append(similarity)
# Get top-k results
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.documents[i] for i in top_indices]
The Basic RAG Pipeline
Now let's implement the three-step pipeline:
from anthropic import Anthropic
class BasicRAG:
def __init__(self, vector_db, anthropic_api_key: str):
self.vector_db = vector_db
self.anthropic = Anthropic(api_key=anthropic_api_key)
def chunk_documents(self, documents: List[Dict]) -> List[Dict]:
"""Chunk documents by heading"""
chunks = []
for doc in documents:
# Split by headings (## or ###)
sections = doc['content'].split('\n##')
for section in sections:
if section.strip():
chunks.append({
'content': section.strip(),
'source': doc.get('source', ''),
'heading': section.split('\n')[0] if '\n' in section else ''
})
return chunks
def retrieve(self, query: str, k: int = 3) -> List[Dict]:
"""Retrieve relevant chunks"""
return self.vector_db.search(query, k=k)
def generate(self, query: str, context: List[Dict]) -> str:
"""Generate answer using Claude"""
context_text = "\n\n".join([c['content'] for c in context])
prompt = f"""Based on the following context, answer the question accurately.
If the context doesn't contain enough information, say so.
Context:
{context_text}
Question: {query}
Answer:"""
response = self.anthropic.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building a Robust Evaluation System
This is where most RAG tutorials stop, but it's where the real work begins. You can't improve what you can't measure.
Creating an Evaluation Dataset
We need three things for each test case:
- A question
- The correct chunks (ground truth)
- A correct answer
# Example evaluation dataset structure
eval_dataset = [
{
"question": "How do I stream Claude's responses?",
"relevant_chunks": ["chunk_1_id", "chunk_5_id"],
"correct_answer": "You can stream Claude's responses by setting stream=True in the API call..."
},
# ... 97 more samples
]
Key Metrics Explained
#### Retrieval Metrics
Precision: Of the chunks we retrieved, how many were relevant?Precision = True Positives / Total Retrieved
Recall: Of all relevant chunks, how many did we retrieve?
Recall = True Positives / Total Relevant
F1 Score: Harmonic mean of precision and recall
F1 = 2 (Precision Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR): How high did the first relevant result appear?
MRR = 1 / rank_of_first_relevant_result
#### End-to-End Metric
Accuracy: Did Claude's answer match the expected answer?Implementing the Evaluation
def evaluate_retrieval(rag_system, eval_dataset):
"""Evaluate retrieval performance"""
results = {
'precision': [],
'recall': [],
'f1': [],
'mrr': []
}
for sample in eval_dataset:
retrieved = rag_system.retrieve(sample['question'])
retrieved_ids = [r['id'] for r in retrieved]
relevant_ids = sample['relevant_chunks']
# Calculate metrics
true_positives = len(set(retrieved_ids) & set(relevant_ids))
precision = true_positives / len(retrieved) if retrieved else 0
recall = true_positives / len(relevant_ids) if relevant_ids else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: find first relevant result
mrr = 0
for i, rid in enumerate(retrieved_ids):
if rid in relevant_ids:
mrr = 1 / (i + 1)
break
results['precision'].append(precision)
results['recall'].append(recall)
results['f1'].append(f1)
results['mrr'].append(mrr)
return {k: np.mean(v) for k, v in results.items()}
Level 2: Summary Indexing
Basic RAG has a fundamental problem: chunks often lack context. A chunk about "rate limits" might not mention it's about Claude's API. Summary indexing solves this by creating a summary for each chunk and using it for retrieval.
class SummaryIndexRAG(BasicRAG):
def __init__(self, vector_db, anthropic_api_key: str):
super().__init__(vector_db, anthropic_api_key)
self.summaries = []
def generate_summary(self, chunk: Dict) -> str:
"""Generate a summary of the chunk using Claude"""
prompt = f"""Summarize the following text in 1-2 sentences,
focusing on what questions this text can answer:
{chunk['content']}
Summary:"""
response = self.anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def retrieve(self, query: str, k: int = 3) -> List[Dict]:
# Search over summaries first
summary_results = self.vector_db.search(query, k=k*2)
# Then get the full chunks
chunk_ids = [r['chunk_id'] for r in summary_results]
return [self.documents[cid] for cid in chunk_ids[:k]]
Level 3: Adding Re-Ranking
Re-ranking is the secret weapon of production RAG systems. Instead of relying solely on embedding similarity, we use Claude to re-rank the retrieved chunks based on actual relevance to the query.
class ReRankRAG(SummaryIndexRAG):
def rerank(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
"""Use Claude to re-rank chunks by relevance"""
chunks_text = "\n---\n".join([
f"Chunk {i}: {c['content']}"
for i, c in enumerate(chunks)
])
prompt = f"""Given the query: "{query}"
Rate each chunk's relevance on a scale of 1-10:
{chunks_text}
Return only the chunk numbers sorted by relevance (most relevant first):"""
response = self.anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response to get ordered indices
# Then return chunks in that order
return reordered_chunks[:k]
Results: The Impact of Optimization
After implementing these techniques, here's what we achieved:
| Metric | Basic RAG | Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.68 | 0.69 |
| Avg F1 Score | 0.52 | 0.53 | 0.54 |
| Avg MRR | 0.74 | 0.82 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
- MRR improvement: Re-ranking pushed relevant results higher
- End-to-end accuracy: Better retrieval led to better answers
Production Considerations
When moving to production, consider:
- Vector Database: Use Pinecone, Weaviate, or Qdrant for persistence and scaling
- Chunking Strategy: Experiment with different sizes (256-1024 tokens) and overlap
- Caching: Cache embeddings and common queries to reduce API costs
- Monitoring: Track retrieval metrics in production to catch degradation
- Rate Limits: Be aware of API rate limits, especially during evaluation
Key Takeaways
- Measure separately, optimize together: Evaluate retrieval and generation independently to identify bottlenecks
- MRR matters most: For RAG, getting the right chunk to the top is more important than perfect recall
- Summary indexing adds context: It helps Claude understand what each chunk is about, improving retrieval quality
- Re-ranking is worth the cost: Using Claude to re-rank even a small set of candidates significantly improves accuracy
- Start simple, iterate: A basic RAG system with good evaluation beats a complex system with no metrics