Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics, achieving up to 81% accuracy.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for deploying Claude in enterprise contexts. While Claude excels at general knowledge tasks, it needs access to your specific business data—internal documentation, customer support articles, or proprietary research—to answer domain-specific questions accurately.
In this guide, you'll learn how to build, evaluate, and optimize a RAG system using Claude and Voyage AI embeddings. We'll start with a basic implementation and progressively add advanced techniques that measurably improve performance.
What You'll Build
By the end of this guide, you'll have:
- A working RAG pipeline using Claude and an in-memory vector database
- A robust evaluation suite that measures retrieval and end-to-end performance independently
- Advanced techniques including summary indexing and re-ranking
- Concrete metrics showing improvement from 71% to 81% end-to-end accuracy
Prerequisites
Before diving in, make sure you have:
- API keys from Anthropic and Voyage AI
- Python 3.8+ installed
- Basic familiarity with Python and the Claude API
Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Next, initialize your API clients:
import anthropic
import voyageai
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
Level 1: Basic RAG (Naive RAG)
Let's start with the simplest possible RAG implementation. This "naive" approach has three steps:
- Chunk documents by heading (each subheading becomes a chunk)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks using cosine similarity
Creating a Vector Database
For this example, we'll use an in-memory vector database. In production, consider hosted solutions like Pinecone or Weaviate.
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self):
self.vectors = []
self.metadata = []
def add_document(self, embedding: List[float], metadata: Dict):
self.vectors.append(embedding)
self.metadata.append(metadata)
def search(self, query_embedding: List[float], top_k: int = 3) -> List[Dict]:
similarities = [
np.dot(query_embedding, vec) / (np.linalg.norm(query_embedding) * np.linalg.norm(vec))
for vec in self.vectors
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.metadata[i] for i in top_indices]
Chunking and Embedding
def chunk_document(text: str) -> List[str]:
"""Split document by headings (## or ###)"""
import re
chunks = re.split(r'(?=^##\s)', text, flags=re.MULTILINE)
return [chunk.strip() for chunk in chunks if chunk.strip()]
def embed_chunks(chunks: List[str]) -> List[List[float]]:
"""Embed chunks using Voyage AI"""
response = vo.embed(chunks, model="voyage-2")
return response.embeddings
Building the RAG Pipeline
def rag_query(query: str, vector_db: InMemoryVectorDB, top_k: int = 3) -> str:
# Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Retrieve relevant chunks
retrieved_chunks = vector_db.search(query_embedding, top_k=top_k)
context = "\n\n".join([chunk["text"] for chunk in retrieved_chunks])
# Generate answer with Claude
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Based on the following context, answer the question:\n\nContext: {context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production systems. You need objective metrics. We'll evaluate two independent components:
- Retrieval performance – How well does the system find relevant chunks?
- End-to-end performance – How accurate are the final answers?
Creating an Evaluation Dataset
Generate a synthetic dataset with 100 samples, each containing:
- A question
- Relevant chunks (ground truth)
- A correct answer
import json
Load pre-generated evaluation dataset
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
eval_data = json.load(f)
Preview the first sample
print(json.dumps(eval_data[0], indent=2))
Retrieval Metrics
#### Precision
Precision measures how many retrieved chunks are actually relevant:
Precision = True Positives / Total Retrieved
High precision means fewer irrelevant chunks. Since we retrieve a minimum of 3 chunks, precision can be affected by the number of truly relevant chunks available.
#### Recall
Recall measures how many relevant chunks we successfully retrieved:
Recall = True Positives / Total Relevant
High recall ensures Claude has all the information it needs.
#### F1 Score
The harmonic mean of precision and recall:
F1 = 2 (Precision Recall) / (Precision + Recall)
#### Mean Reciprocal Rank (MRR)
MRR measures how early the first relevant chunk appears in the results:
def calculate_mrr(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_chunks:
return 1.0 / (i + 1)
return 0.0
End-to-End Accuracy
This measures whether Claude's final answer is correct. Use LLM-as-judge or human evaluation:
def evaluate_answer(question: str, answer: str, correct_answer: str) -> bool:
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1,
messages=[{
"role": "user",
"content": f"Question: {question}\nCorrect Answer: {correct_answer}\nModel Answer: {answer}\n\nIs the model answer correct? Answer only 'yes' or 'no'."
}]
)
return response.content[0].text.lower() == "yes"
Level 2: Summary Indexing
Basic RAG often misses context that spans multiple chunks. Summary indexing solves this by creating condensed representations of document sections.
How It Works
- For each document section, generate a summary using Claude
- Index both the original chunk and its summary
- Retrieve using the summary for better semantic matching
def generate_summary(chunk: str) -> str:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize the following text in 2-3 sentences:\n\n{chunk}"
}]
)
return response.content[0].text
def index_with_summaries(chunks: List[str], vector_db: InMemoryVectorDB):
for chunk in chunks:
summary = generate_summary(chunk)
combined_text = f"{summary}\n\n{chunk}"
embedding = vo.embed([combined_text], model="voyage-2").embeddings[0]
vector_db.add_document(embedding, {"text": chunk, "summary": summary})
Level 3: Summary Indexing + Re-Ranking
Re-ranking adds a second retrieval step that dramatically improves MRR. After initial retrieval, use Claude to score and reorder results.
Implementing Re-Ranking
def rerank_with_claude(query: str, candidates: List[Dict]) -> List[Dict]:
# Prepare candidates for Claude
candidate_text = "\n".join([
f"[{i}] {c['text'][:200]}" for i, c in enumerate(candidates)
])
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{
"role": "user",
"content": f"Query: {query}\n\nCandidates:\n{candidate_text}\n\nRank these candidates by relevance. Return the indices in order of relevance, most relevant first."
}]
)
# Parse the ranked indices
ranked_indices = [int(x) for x in response.content[0].text.split() if x.isdigit()]
return [candidates[i] for i in ranked_indices]
def advanced_rag_query(query: str, vector_db: InMemoryVectorDB) -> str:
# Initial retrieval
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
initial_results = vector_db.search(query_embedding, top_k=10)
# Re-rank
reranked_results = rerank_with_claude(query, initial_results)
top_results = reranked_results[:3]
# Generate answer
context = "\n\n".join([chunk["text"] for chunk in top_results])
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context: {context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
Results: Measurable Improvements
After implementing these techniques, here are the performance gains:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate limits: Full evaluations may hit rate limits unless you're at Tier 2 or above. Consider sampling your evaluation dataset.
- Vector database: For production, use a hosted solution like Pinecone, Weaviate, or Chroma.
- Chunking strategy: Experiment with different chunk sizes and overlap strategies.
- Embedding model: Voyage AI's
voyage-2works well, but test with your specific domain.
Key Takeaways
- Evaluate retrieval and generation separately – This lets you pinpoint where improvements are needed, whether in finding relevant chunks or in answer quality.
- Summary indexing improves semantic matching – By indexing both summaries and original chunks, you capture context that naive chunking misses.
- Re-ranking dramatically improves MRR – Adding a Claude-powered re-ranking step after initial retrieval ensures the most relevant chunks appear first, boosting MRR from 0.74 to 0.87.
- End-to-end accuracy gains are real – Advanced RAG techniques improved accuracy from 71% to 81% in our tests, a meaningful improvement for production systems.
- Start simple, then iterate – Begin with basic RAG, establish your evaluation metrics, then layer on advanced techniques. Measure each change to confirm improvement.