Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn how to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques for production-grade performance.
This guide teaches you to build a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement vector search, create evaluation suites, and apply techniques like summary indexing and re-ranking to improve retrieval accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities into your specific business domain. While Claude excels at general knowledge tasks, it needs RAG to answer questions about your internal documentation, customer support history, or proprietary data.
In this comprehensive guide, we'll walk through building a production-grade RAG system using Claude, Voyage AI embeddings, and systematic evaluation. We'll start with a basic implementation and progressively optimize it using advanced techniques that improved end-to-end accuracy from 71% to 81% in production testing.
Understanding the RAG Architecture
Before diving into code, let's understand what makes RAG tick. A RAG system has three core components:
- Ingestion Pipeline: Chunks your documents, generates embeddings, and stores them in a vector database
- Retrieval System: Finds the most relevant document chunks for a given query
- Generation System: Feeds retrieved context to Claude to produce accurate answers
Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
You'll need API keys from both Anthropic and Voyage AI. Set them as environment variables:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"
Level 1: Building a Basic RAG Pipeline
Let's start with what the industry calls "Naive RAG" – a straightforward implementation that gets the job done but has room for improvement.
Step 1: Document Chunking
We'll chunk documents by headings, keeping content from each subheading together:
def chunk_document_by_headings(text):
"""Split document into chunks based on markdown headings"""
chunks = []
current_heading = None
current_content = []
for line in text.split('\n'):
if line.startswith('##') or line.startswith('###'):
if current_heading:
chunks.append({
'heading': current_heading,
'content': '\n'.join(current_content)
})
current_heading = line
current_content = []
else:
current_content.append(line)
# Don't forget the last chunk
if current_heading:
chunks.append({
'heading': current_heading,
'content': '\n'.join(current_content)
})
return chunks
Step 2: Generate Embeddings
Using Voyage AI's embedding model:
import voyageai
vo = voyageai.Client()
def generate_embeddings(chunks):
"""Generate embeddings for each chunk"""
texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(texts, model="voyage-2").embeddings
for i, chunk in enumerate(chunks):
chunk['embedding'] = embeddings[i]
return chunks
Step 3: In-Memory Vector Database
For this guide, we'll use a simple in-memory store. In production, consider Pinecone, Weaviate, or Chroma:
class InMemoryVectorDB:
def __init__(self):
self.chunks = []
def add_chunks(self, chunks):
self.chunks.extend(chunks)
def search(self, query_embedding, top_k=3):
"""Find top_k most similar chunks using cosine similarity"""
similarities = []
for chunk in self.chunks:
similarity = cosine_similarity(query_embedding, chunk['embedding'])
similarities.append(similarity)
# Get indices of top_k results
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.chunks[i] for i in top_indices]
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Step 4: Query Pipeline
Now let's tie it all together with Claude:
from anthropic import Anthropic
client = Anthropic()
def rag_query(query, vector_db, top_k=3):
# 1. Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# 2. Retrieve relevant chunks
relevant_chunks = vector_db.search(query_embedding, top_k=top_k)
# 3. Build context from chunks
context = "\n\n---\n\n".join([
f"From section '{chunk['heading']}':\n{chunk['content']}"
for chunk in relevant_chunks
])
# 4. Generate answer with Claude
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Based on the following documentation, answer the question.\n\nDocumentation:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
Building a Robust Evaluation System
This is where most RAG tutorials stop – but it's where the real work begins. You can't improve what you can't measure.
Creating an Evaluation Dataset
We need three things for each test case:
- A question
- The correct chunks (ground truth for retrieval)
- A correct answer (ground truth for generation)
evaluation_data = [
{
"question": "How do I set up rate limiting in Claude?",
"relevant_chunks": ["rate_limiting_intro", "rate_limit_config"],
"correct_answer": "You can set up rate limiting by configuring..."
},
# ... 97 more samples
]
Key Metrics Explained
Retrieval Metrics (measure your search quality):- Precision: Of all chunks retrieved, how many were relevant?
True Positives / Total Retrieved
- High precision = fewer irrelevant results
- Recall: Of all relevant chunks, how many did we retrieve?
True Positives / Total Relevant
- High recall = we're not missing important information
- F1 Score: Harmonic mean of precision and recall
2 (Precision Recall) / (Precision + Recall)
- Mean Reciprocal Rank (MRR): How early does the first relevant result appear?
- Accuracy: Does Claude produce the correct answer given the retrieved context?
Implementing the Evaluation
def evaluate_retrieval(vector_db, eval_data):
metrics = {
'precision': [],
'recall': [],
'f1': [],
'mrr': []
}
for item in eval_data:
query_embedding = vo.embed([item['question']]).embeddings[0]
retrieved = vector_db.search(query_embedding, top_k=3)
retrieved_ids = [chunk['id'] for chunk in retrieved]
relevant_ids = item['relevant_chunks']
# Calculate metrics
true_positives = len(set(retrieved_ids) & set(relevant_ids))
precision = true_positives / len(retrieved_ids)
recall = true_positives / len(relevant_ids)
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant result
mrr = 0
for i, chunk_id in enumerate(retrieved_ids):
if chunk_id in relevant_ids:
mrr = 1 / (i + 1)
break
metrics['precision'].append(precision)
metrics['recall'].append(recall)
metrics['f1'].append(f1)
metrics['mrr'].append(mrr)
return {k: np.mean(v) for k, v in metrics.items()}
Level 2: Summary Indexing
Basic RAG struggles when answers span multiple chunks. Summary indexing solves this by creating higher-level summaries that capture cross-cutting concepts.
def create_summary_index(chunks):
"""Create summary embeddings for groups of related chunks"""
summaries = []
# Group chunks by topic (simplified - use clustering in production)
for i in range(0, len(chunks), 3):
group = chunks[i:i+3]
combined = " ".join([c['content'] for c in group])
# Generate summary using Claude
summary = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this documentation section:\n\n{combined}"
}]
).content[0].text
# Embed the summary
summary_embedding = vo.embed([summary]).embeddings[0]
summaries.append({
'summary': summary,
'embedding': summary_embedding,
'source_chunks': group
})
return summaries
Level 3: Re-Ranking with Claude
Re-ranking dramatically improves MRR by having Claude evaluate the relevance of retrieved chunks before generating an answer:
def rerank_with_claude(query, chunks, top_k=3):
"""Use Claude to re-rank retrieved chunks by relevance"""
chunk_scores = []
for chunk in chunks:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{
"role": "user",
"content": f"On a scale of 0-10, how relevant is this chunk to the question?\n\nQuestion: {query}\n\nChunk: {chunk['content'][:500]}\n\nAnswer with just a number."
}]
)
try:
score = int(response.content[0].text.strip())
except ValueError:
score = 5 # Default score if parsing fails
chunk_scores.append((score, chunk))
# Sort by score descending and return top_k
chunk_scores.sort(key=lambda x: x[0], reverse=True)
return [chunk for score, chunk in chunk_scores[:top_k]]
Performance Results
After implementing all three levels of optimization, here are the improvements we observed:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
When moving to production, consider these additional optimizations:
- Hybrid Search: Combine semantic search with keyword matching for better recall
- Caching: Cache frequent queries and their results
- Monitoring: Track retrieval metrics in production to catch degradation
- A/B Testing: Test different chunking strategies and embedding models
Key Takeaways
- Start simple, measure everything: Build a basic RAG pipeline first, then establish rigorous evaluation metrics before optimizing
- Separate retrieval from generation metrics: You need to know whether failures come from missing context or poor reasoning
- Summary indexing bridges the gap: When answers span multiple chunks, summary-level retrieval captures the big picture
- Re-ranking with Claude boosts MRR significantly: Having Claude evaluate relevance before answering improves both speed and accuracy
- Production RAG requires continuous evaluation: Metrics like precision and recall should be monitored in production to catch data drift and degradation