Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide shows you how to build a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.
Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.
What You'll Learn
By the end of this guide, you will be able to:
- Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
- Build a robust evaluation suite that measures retrieval and end-to-end performance independently
- Implement advanced techniques including summary indexing and re-ranking with Claude
- Avg Precision: 0.43 → 0.44
- Avg Recall: 0.66 → 0.69
- Avg F1 Score: 0.52 → 0.54
- Avg Mean Reciprocal Rank (MRR): 0.74 → 0.87
- End-to-End Accuracy: 71% → 81%
Prerequisites and Setup
Before diving in, you'll need:
- API keys from Anthropic and Voyage AI
- Python 3.8+ environment
- Required libraries:
anthropic,voyageai,pandas,numpy,matplotlib,scikit-learn
Initialize a Vector DB Class
In this example, we're using an in-memory vector DB. For production, consider a hosted solution like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings."""
texts = [doc["content"] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]:
"""Retrieve top-k documents by cosine similarity."""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
scores = [self._cosine_similarity(query_embedding, emb) for emb in self.embeddings]
top_indices = np.argsort(scores)[-k:][::-1]
return [self.documents[i] for i in top_indices]
def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Level 1: Basic RAG (Naive RAG)
A basic RAG pipeline includes three steps:
- Chunk documents by heading (containing only content from each subheading)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks using cosine similarity
import anthropic
class BasicRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.vector_db = InMemoryVectorDB(api_key=voyage_key)
self.llm = anthropic.Anthropic(api_key=anthropic_key)
def answer_query(self, query: str) -> str:
# Retrieve relevant chunks
chunks = self.vector_db.search(query, k=3)
context = "\n\n".join([chunk["content"] for chunk in chunks])
# Generate answer with Claude
prompt = f"""Based on the following context, answer the user's question.
If the context doesn't contain enough information, say so.
Context:
{context}
Question: {query}
Answer:"""
response = self.llm.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building an Evaluation System
When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. We synthetically generated an evaluation dataset of 100 samples, each containing:
- A question
- Relevant chunks (ground truth)
- A correct answer
Key Metrics Explained
#### Retrieval Metrics
Precision measures the proportion of retrieved chunks that are actually relevant. High precision means fewer false positives.$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$
Recall measures completeness—how many of the correct chunks were retrieved. High recall ensures the LLM has all necessary information.$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$
F1 Score is the harmonic mean of precision and recall. Mean Reciprocal Rank (MRR) evaluates how early the first relevant chunk appears in the results. This is crucial because Claude has limited context and may not process all retrieved chunks equally.#### End-to-End Metric
End-to-End Accuracy measures whether Claude's final answer is correct based on the retrieved context.from sklearn.metrics import precision_score, recall_score, f1_score
def evaluate_retrieval(rag_system, eval_dataset):
"""Evaluate retrieval performance."""
precisions, recalls, f1s, mrrs = [], [], [], []
for item in eval_dataset:
query = item["question"]
correct_chunks = set(item["relevant_chunks"])
# Retrieve chunks
retrieved = rag_system.vector_db.search(query, k=3)
retrieved_ids = set([doc["id"] for doc in retrieved])
# Calculate metrics
true_positives = len(retrieved_ids & correct_chunks)
precision = true_positives / len(retrieved) if retrieved else 0
recall = true_positives / len(correct_chunks) if correct_chunks else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant chunk
mrr = 0
for rank, doc in enumerate(retrieved, 1):
if doc["id"] in correct_chunks:
mrr = 1 / rank
break
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
mrrs.append(mrr)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean(f1s),
"avg_mrr": np.mean(mrrs)
}
Level 2: Summary Indexing
Basic RAG struggles when relevant information is spread across multiple chunks. Summary indexing addresses this by creating summary-level embeddings that capture the essence of larger document sections.
def create_summary_index(documents, llm_client):
"""Create summary embeddings for document sections."""
summary_index = []
for doc_section in documents:
# Generate a summary of the section
prompt = f"Summarize the following content in 2-3 sentences:\n\n{doc_section['content']}"
response = llm_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
summary = response.content[0].text
# Embed the summary instead of raw content
summary_index.append({
"summary": summary,
"original_content": doc_section["content"],
"id": doc_section["id"]
})
return summary_index
This technique improved recall by capturing broader context, allowing Claude to retrieve more relevant information even when the exact query terms don't appear in the target chunk.
Level 3: Summary Indexing + Re-Ranking
The final level combines summary indexing with re-ranking using Claude. After initial retrieval, Claude re-ranks the chunks based on relevance to the query.
def rerank_with_claude(query, retrieved_chunks, llm_client):
"""Re-rank retrieved chunks using Claude."""
# Prepare chunks for re-ranking
chunks_text = ""
for i, chunk in enumerate(retrieved_chunks):
chunks_text += f"[{i+1}] {chunk['content'][:500]}...\n\n"
prompt = f"""Given the query: "{query}"
Rank the following chunks from most relevant (1) to least relevant ({len(retrieved_chunks)}).
Return only the ranked list of numbers, separated by commas.
{chunks_text}
"""
response = llm_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=50,
messages=[{"role": "user", "content": prompt}]
)
# Parse ranking
ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
return [retrieved_chunks[i] for i in ranking]
Re-ranking dramatically improved MRR from 0.74 to 0.87, ensuring the most relevant context appears first in Claude's context window.
Putting It All Together: The Optimized Pipeline
class OptimizedRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.vector_db = InMemoryVectorDB(api_key=voyage_key)
self.llm = anthropic.Anthropic(api_key=anthropic_key)
self.summary_index = None
def initialize_with_summaries(self, documents):
self.summary_index = create_summary_index(documents, self.llm)
# Add summary embeddings to vector DB
for item in self.summary_index:
self.vector_db.add_documents([{
"id": item["id"],
"content": item["summary"]
}])
def answer_query(self, query: str) -> str:
# Retrieve using summary index
initial_chunks = self.vector_db.search(query, k=5)
# Map back to original content
original_chunks = []
for chunk in initial_chunks:
for item in self.summary_index:
if item["id"] == chunk["id"]:
original_chunks.append({
"content": item["original_content"],
"id": item["id"]
})
break
# Re-rank with Claude
reranked = rerank_with_claude(query, original_chunks, self.llm)
context = "\n\n".join([chunk["content"] for chunk in reranked[:3]])
# Generate final answer
prompt = f"""Based on the following context, answer the user's question.
Context:
{context}
Question: {query}
Answer:"""
response = self.llm.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Performance Comparison
| Metric | Basic RAG | Summary Indexing | Summary + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.68 | 0.69 |
| Avg F1 Score | 0.52 | 0.53 | 0.54 |
| Avg MRR | 0.74 | 0.80 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
Best Practices for Production RAG
- Evaluate retrieval and generation separately – This helps you identify where the bottleneck is
- Use a diverse evaluation dataset – Include questions that require single-chunk, multi-chunk, and edge-case reasoning
- Monitor MRR closely – It directly impacts how well Claude can use the retrieved context
- Consider chunk overlap – Overlapping chunks can improve recall at the cost of more tokens
- Test with different embedding models – Voyage AI, OpenAI, and Cohere all offer strong options
Key Takeaways
- Start simple, then optimize: A basic RAG pipeline works for many use cases. Add complexity (summary indexing, re-ranking) only when metrics show room for improvement.
- Measure what matters: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy. This pinpoints whether the issue is retrieval or generation.
- Re-ranking with Claude significantly improves MRR: From 0.74 to 0.87 in our tests, ensuring the most relevant context appears first in Claude's context window.
- Summary indexing boosts recall: By capturing broader document context, you retrieve more relevant information even when exact query terms are missing.
- Production RAG requires continuous evaluation: Your evaluation dataset should evolve with your use case, and metrics should be tracked over time to catch regressions.