Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide teaches you to build a production-ready RAG system with Claude, covering basic setup with Voyage AI embeddings, a comprehensive evaluation framework with 5 key metrics, and advanced optimization techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs RAG to answer questions specific to your business context—whether that's internal documentation, customer support knowledge bases, or proprietary research.
In this guide, we'll walk through building a RAG system using Claude and Voyage AI embeddings, then systematically improve it using advanced techniques. We'll use the Claude documentation as our knowledge base, but the principles apply to any domain.
Why RAG Matters for Claude Users
Claude's training data has a cutoff date, and it doesn't know your internal documents. RAG bridges this gap by:
- Grounding responses in your verified content
- Reducing hallucinations by providing relevant context
- Enabling domain-specific Q&A without fine-tuning
- Maintaining data freshness as your knowledge base evolves
Setting Up Your RAG Environment
Required Libraries
# Core dependencies
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
API Key Configuration
import os
from anthropic import Anthropic
import voyageai
Set your API keys
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"
Initialize clients
anthropic_client = Anthropic()
vo_client = voyageai.Client()
Building a Vector Database Class
For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.
import numpy as np
from typing import List, Dict, Tuple
class InMemoryVectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
self.metadata = []
def add_documents(self, texts: List[str], embeddings: List[List[float]], metadata: List[Dict] = None):
self.documents.extend(texts)
self.embeddings.extend(embeddings)
if metadata:
self.metadata.extend(metadata)
else:
self.metadata.extend([{}] * len(texts))
def search(self, query_embedding: List[float], k: int = 5) -> List[Tuple[str, float, Dict]]:
# Cosine similarity search
query_norm = np.array(query_embedding) / np.linalg.norm(query_embedding)
doc_norms = np.array(self.embeddings) / np.linalg.norm(self.embeddings, axis=1, keepdims=True)
similarities = np.dot(doc_norms, query_norm)
top_indices = np.argsort(similarities)[-k:][::-1]
results = []
for idx in top_indices:
results.append((
self.documents[idx],
similarities[idx],
self.metadata[idx]
))
return results
Level 1: Basic RAG Pipeline
This is the "naive RAG" approach that many tutorials start with. It works, but has significant limitations.
Step 1: Chunk Your Documents
def chunk_documents(documents: List[Dict]) -> List[Dict]:
"""Split documents by headings for meaningful chunks."""
chunks = []
for doc in documents:
# Split by markdown headings
sections = doc['content'].split('\n## ')
for section in sections:
if section.strip():
chunks.append({
'text': section.strip(),
'source': doc['source'],
'heading': section.split('\n')[0] if '\n' in section else ''
})
return chunks
Step 2: Embed and Index
def embed_and_index(chunks: List[Dict], vector_db: InMemoryVectorDB):
"""Embed chunks and add to vector database."""
texts = [chunk['text'] for chunk in chunks]
# Generate embeddings using Voyage AI
response = vo_client.embed(
texts,
model="voyage-2",
input_type="document"
)
vector_db.add_documents(
texts=texts,
embeddings=response.embeddings,
metadata=[{'source': c['source'], 'heading': c['heading']} for c in chunks]
)
Step 3: Retrieve and Generate
def rag_query(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str:
"""Full RAG pipeline: retrieve context, then generate answer."""
# Embed the query
query_embedding = vo_client.embed(
[query],
model="voyage-2",
input_type="query"
).embeddings[0]
# Retrieve relevant chunks
results = vector_db.search(query_embedding, k=k)
context = "\n\n---\n\n".join([doc for doc, _, _ in results])
# Generate answer with Claude
response = anthropic_client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
system="You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain enough information, say so.",
messages=[
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
return response.content[0].text
Building a Robust Evaluation System
Most RAG tutorials skip evaluation, but it's critical for production systems. We'll measure two things independently:
- Retrieval Performance: How well does our system find relevant documents?
- End-to-End Performance: How well does Claude answer questions given the retrieved context?
Creating an Evaluation Dataset
We synthetically generated 100 test samples, each containing:
- A question
- Ground truth relevant chunks
- A correct answer
import json
Load evaluation dataset
with open('evaluation/docs_evaluation_dataset.json', 'r') as f:
eval_data = json.load(f)
Preview
print(f"Total samples: {len(eval_data)}")
print(f"Sample question: {eval_data[0]['question']}")
print(f"Relevant chunks: {len(eval_data[0]['relevant_chunks'])}")
Key Metrics Explained
#### Precision What it measures: Of all chunks retrieved, how many were actually relevant?
def calculate_precision(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
retrieved_set = set(retrieved_chunks)
relevant_set = set(relevant_chunks)
if len(retrieved_set) == 0:
return 0.0
return len(retrieved_set & relevant_set) / len(retrieved_set)
#### Recall What it measures: Of all relevant chunks, how many did we retrieve?
def calculate_recall(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
retrieved_set = set(retrieved_chunks)
relevant_set = set(relevant_chunks)
if len(relevant_set) == 0:
return 0.0
return len(retrieved_set & relevant_set) / len(relevant_set)
#### F1 Score What it measures: Harmonic mean of precision and recall.
def calculate_f1(precision: float, recall: float) -> float:
if precision + recall == 0:
return 0.0
return 2 (precision recall) / (precision + recall)
#### Mean Reciprocal Rank (MRR) What it measures: How early in the results does the first relevant chunk appear?
def calculate_mrr(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
relevant_set = set(relevant_chunks)
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_set:
return 1.0 / (i + 1)
return 0.0
#### End-to-End Accuracy What it measures: Does Claude's final answer match the ground truth?
def calculate_e2e_accuracy(generated_answer: str, correct_answer: str) -> bool:
# Use Claude to judge if answers are semantically equivalent
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
system="You are an answer evaluator. Respond with only 'YES' or 'NO'.",
messages=[{
"role": "user",
"content": f"Does this answer:\n'{generated_answer}'\n\nCorrectly answer the question? Correct answer is:\n'{correct_answer}'"
}]
)
return response.content[0].text.strip().upper() == "YES"
Level 2: Summary Indexing
Basic RAG fails when a single chunk doesn't contain enough context. Summary indexing creates higher-level summaries that capture the "big picture."
def create_summary_index(chunks: List[Dict], group_size: int = 5) -> List[Dict]:
"""Group chunks and create summaries for each group."""
summary_index = []
for i in range(0, len(chunks), group_size):
group = chunks[i:i+group_size]
combined_text = "\n\n".join([c['text'] for c in group])
# Generate summary using Claude
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=512,
system="Summarize the following text, preserving key information and relationships between concepts.",
messages=[{"role": "user", "content": combined_text}]
)
summary_index.append({
'summary': response.content[0].text,
'original_chunks': group,
'chunk_indices': list(range(i, min(i+group_size, len(chunks))))
})
return summary_index
Hybrid Retrieval Strategy
def hybrid_retrieve(query: str, vector_db: InMemoryVectorDB, summary_index: List[Dict], k: int = 3) -> List[str]:
"""Retrieve from both chunk-level and summary-level indices."""
# Get chunk-level results
chunk_results = vector_db.search(query_embedding, k=k)
# Get summary-level results
summary_results = summary_vector_db.search(query_embedding, k=2)
# Combine and deduplicate
all_chunks = []
seen = set()
for doc, _, _ in chunk_results:
if doc not in seen:
all_chunks.append(doc)
seen.add(doc)
for summary in summary_results:
for chunk in summary['original_chunks']:
if chunk['text'] not in seen:
all_chunks.append(chunk['text'])
seen.add(chunk['text'])
return all_chunks[:k]
Level 3: Re-Ranking with Claude
Re-ranking uses Claude to evaluate the relevance of retrieved chunks before generating the final answer.
def rerank_with_claude(query: str, candidates: List[str], top_k: int = 3) -> List[str]:
"""Use Claude to re-rank retrieved chunks by relevance."""
# Prepare chunks for evaluation
chunk_text = "\n\n---\n\n".join([
f"CHUNK {i+1}:\n{chunk}"
for i, chunk in enumerate(candidates)
])
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=256,
system="You are a relevance evaluator. Rank the following chunks by their relevance to the query. Return the chunk numbers in order of relevance, separated by commas.",
messages=[{
"role": "user",
"content": f"Query: {query}\n\n{chunk_text}\n\nReturn the chunk numbers ranked by relevance (most relevant first):"
}]
)
# Parse the ranked chunk numbers
ranked_indices = [
int(x.strip()) - 1
for x in response.content[0].text.split(',')
if x.strip().isdigit()
]
return [candidates[i] for i in ranked_indices[:top_k]]
Results: Before and After
After implementing summary indexing and re-ranking, we achieved significant improvements:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access for large-scale testing.
- Cost Management: Summary indexing and re-ranking add token costs. Balance quality improvements against budget.
- Vector Database: Replace the in-memory store with Pinecone, Weaviate, or Chroma for production.
- Chunking Strategy: Experiment with different chunk sizes and overlap percentages.
- Caching: Cache embeddings and common queries to reduce API calls.
Key Takeaways
- Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Basic RAG often fails at retrieval, not generation.
- Summary indexing bridges the gap between granular chunks and high-level concepts, improving recall for complex questions.
- Re-ranking with Claude significantly improves MRR, ensuring the most relevant context appears first in your prompt.
- End-to-end accuracy improved by 10% (71% → 81%) through these optimizations, proving that better retrieval directly improves answer quality.
- Build your evaluation dataset first before optimizing. Without ground truth data, you're optimizing blindly.