Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic pipelines, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to set up retrieval, measure performance with precision/recall/F1/MRR, and improve end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your proprietary data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.
In this guide, we'll walk through building a complete RAG system using Claude, Voyage AI embeddings, and an in-memory vector store. We'll start with a basic pipeline, then show you how to measure performance systematically, and finally implement advanced techniques that boost end-to-end accuracy from 71% to 81%.
Why RAG Matters for Claude Users
Claude excels at general knowledge tasks, but it can't know your internal documentation, product manuals, or proprietary research. RAG bridges this gap by:
- Grounding answers in your verified content
- Reducing hallucinations by constraining Claude to retrieved context
- Enabling domain-specific queries without fine-tuning
- Keeping knowledge current by updating your document store
Setting Up Your RAG Environment
First, install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
You'll need API keys from Anthropic and Voyage AI. Set them as environment variables:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"
Initialize a Vector Database
For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or MongoDB Atlas.
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self):
self.vectors = []
self.metadata = []
def add(self, vector: List[float], metadata: Dict):
self.vectors.append(vector)
self.metadata.append(metadata)
def search(self, query_vector: List[float], k: int = 3) -> List[Dict]:
similarities = [
np.dot(query_vector, vec) / (np.linalg.norm(query_vector) * np.linalg.norm(vec))
for vec in self.vectors
]
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.metadata[i] for i in top_indices]
Level 1: Basic RAG Pipeline
A basic RAG system (often called "Naive RAG") follows three steps:
- Chunk documents by heading or logical sections
- Embed each chunk using a high-quality embedding model
- Retrieve relevant chunks via cosine similarity and feed them to Claude
Chunking Strategy
def chunk_by_headings(text: str) -> List[Dict]:
"""Split document by markdown headings, preserving context."""
chunks = []
current_heading = "Introduction"
current_content = []
for line in text.split("\n"):
if line.startswith("##"):
if current_content:
chunks.append({
"heading": current_heading,
"content": "\n".join(current_content)
})
current_heading = line.strip("# ").strip()
current_content = []
else:
current_content.append(line)
if current_content:
chunks.append({
"heading": current_heading,
"content": "\n".join(current_content)
})
return chunks
Embedding and Retrieval
import voyageai
vo = voyageai.Client()
def embed_chunks(chunks: List[Dict]) -> List[float]:
texts = [chunk["content"] for chunk in chunks]
embeddings = vo.embed(texts, model="voyage-2").embeddings
return embeddings
def retrieve(query: str, db: InMemoryVectorDB, k: int = 3) -> List[Dict]:
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
return db.search(query_embedding, k=k)
Generating Answers with Claude
from anthropic import Anthropic
client = Anthropic()
def answer_with_claude(query: str, context_chunks: List[Dict]) -> str:
context = "\n\n---\n\n".join([
f"Source: {chunk['heading']}\n{chunk['content']}"
for chunk in context_chunks
])
prompt = f"""Answer the question based on the provided context. If the context doesn't contain enough information, say so.
Context:
{context}
Question: {query}
Answer:"""
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production. You need systematic metrics that measure both retrieval quality and end-to-end performance.
Creating an Evaluation Dataset
Generate a synthetic dataset with 100+ samples, each containing:
- A question
- Ground-truth relevant chunks
- A correct answer
{
"question": "How do I set up rate limiting in Claude?",
"relevant_chunks": ["rate_limiting.md", "api_basics.md"],
"correct_answer": "Rate limiting is configured via the Anthropic console..."
}
Retrieval Metrics
#### Precision
What it measures: Of all chunks retrieved, how many were actually relevant?def precision(retrieved: List[str], relevant: List[str]) -> float:
retrieved_set = set(retrieved)
relevant_set = set(relevant)
if len(retrieved_set) == 0:
return 0.0
return len(retrieved_set & relevant_set) / len(retrieved_set)
Interpretation: High precision means your system isn't wasting Claude's context window on irrelevant information.
#### Recall
What it measures: Of all relevant chunks, how many did we retrieve?def recall(retrieved: List[str], relevant: List[str]) -> float:
retrieved_set = set(retrieved)
relevant_set = set(relevant)
if len(relevant_set) == 0:
return 0.0
return len(retrieved_set & relevant_set) / len(relevant_set)
Interpretation: High recall ensures Claude has all the information it needs to answer correctly.
#### F1 Score
The harmonic mean of precision and recall:
def f1_score(prec: float, rec: float) -> float:
if prec + rec == 0:
return 0.0
return 2 (prec rec) / (prec + rec)
#### Mean Reciprocal Rank (MRR)
Measures how early the first relevant chunk appears in your results:
def mrr(retrieved: List[str], relevant: List[str]) -> float:
for i, chunk in enumerate(retrieved):
if chunk in relevant:
return 1.0 / (i + 1)
return 0.0
Why MRR matters: If the first relevant chunk is at position 3, Claude has to wade through 2 irrelevant chunks first, increasing the chance of confusion.
End-to-End Accuracy
This measures whether Claude's final answer is correct, using a judge LLM or human evaluation:
def evaluate_answer(question: str, generated: str, correct: str) -> bool:
prompt = f"""Does the following answer correctly address the question?
Question: {question}
Generated Answer: {generated}
Correct Answer: {correct}
Answer YES or NO:"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return "YES" in response.content[0].text
Level 2: Summary Indexing
Basic RAG retrieves individual chunks, but sometimes the answer requires synthesizing information across multiple sections. Summary indexing addresses this by creating higher-level summaries that capture cross-chunk context.
def create_summary_index(chunks: List[Dict], window_size: int = 3) -> List[Dict]:
"""Create sliding window summaries over consecutive chunks."""
summaries = []
for i in range(len(chunks) - window_size + 1):
window = chunks[i:i+window_size]
combined = "\n".join([c["content"] for c in window])
# Use Claude to generate a concise summary
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize the following text in 2-3 sentences:\n\n{combined}"
}]
)
summaries.append({
"summary": response.content[0].text,
"source_chunks": [c["heading"] for c in window],
"original_content": combined
})
return summaries
Embed and index both the original chunks and the summaries. When retrieving, search across both indexes and merge results.
Level 3: Summary Indexing + Re-Ranking
Re-ranking adds a second stage to your retrieval pipeline. After the initial retrieval, Claude re-orders the chunks by relevance to the specific query.
def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3) -> List[Dict]:
"""Use Claude to re-rank retrieved chunks by relevance."""
chunks_text = "\n\n".join([
f"[{i+1}] {chunk['heading']}\n{chunk['content'][:500]}"
for i, chunk in enumerate(candidates)
])
prompt = f"""Given the question below, rank the following chunks by relevance (most relevant first).
Return only the numbers in order, comma-separated.
Question: {query}
Chunks:
{chunks_text}
Ranking:"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the ranking
import re
indices = [int(x) - 1 for x in re.findall(r"\d+", response.content[0].text)]
return [candidates[i] for i in indices[:top_k]]
Performance Gains
With these optimizations, here's what you can expect:
| Metric | Basic RAG | Optimized RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access or run smaller eval sets.
- Token Budget: Summary indexing and re-ranking add token costs. Benchmark to ensure ROI.
- Vector Database: For production, use a hosted vector DB with built-in indexing and scaling.
- Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your content.
- Embedding Model: Voyage AI's
voyage-2works well, but test alternatives liketext-embedding-3-small.
Key Takeaways
- Start with basic RAG, then systematically measure retrieval quality using precision, recall, F1, and MRR before optimizing.
- Summary indexing captures cross-chunk context, improving recall for questions that require synthesis.
- Re-ranking with Claude dramatically improves MRR, ensuring the most relevant information appears first in the context window.
- Evaluate retrieval and end-to-end performance separately to identify where your pipeline needs improvement.
- Expect 10-15% accuracy gains from advanced techniques, but always benchmark against your specific use case and data.