Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide teaches you to build a RAG system with Claude, from basic implementation to advanced optimization. You'll learn to set up vector search, create evaluation metrics (precision, recall, F1, MRR), and improve performance through summary indexing and re-ranking—achieving up to 81% end-to-end accuracy.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs access to your specific business context—internal documents, customer support knowledge bases, or proprietary data—to deliver truly valuable answers.
In this guide, we'll walk through building a production-quality RAG system using Claude and Voyage AI embeddings. We'll start with a basic implementation, then systematically improve it using evaluation-driven optimization. By the end, you'll understand how to achieve significant performance gains: our optimized system improved end-to-end accuracy from 71% to 81%.
Understanding the RAG Pipeline
A RAG system works in three stages:
- Ingestion: Chunk and embed your documents into a vector database
- Retrieval: Find the most relevant chunks for a user query
- Generation: Feed retrieved context to Claude to produce an answer
Level 1: Basic RAG Implementation
Setup and Dependencies
First, install the required libraries:
pip install anthropic voyageai pandas numpy scikit-learn matplotlib
You'll need API keys from Anthropic and Voyage AI. Set them as environment variables:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"
Building the Vector Database
For this example, we'll use an in-memory vector store. In production, consider solutions like Pinecone, Weaviate, or pgvector.
import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings"""
texts = [doc["content"] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, k: int = 3) -> List[Dict]:
"""Retrieve top-k most similar documents"""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
# Cosine similarity
similarities = np.dot(self.embeddings, query_embedding)
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.documents[i] for i in top_indices]
Chunking Strategy
A naive approach chunks documents by heading:
def chunk_by_heading(document: str) -> List[Dict[str, str]]:
"""Split document by markdown headings"""
chunks = []
current_heading = "Introduction"
current_content = []
for line in document.split("\n"):
if line.startswith("##"):
if current_content:
chunks.append({
"heading": current_heading,
"content": "\n".join(current_content)
})
current_heading = line.strip("# ").strip()
current_content = []
else:
current_content.append(line)
# Don't forget the last section
if current_content:
chunks.append({
"heading": current_heading,
"content": "\n".join(current_content)
})
return chunks
The Complete RAG Pipeline
from anthropic import Anthropic
class BasicRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.vector_db = InMemoryVectorDB(voyage_key)
self.llm = Anthropic(api_key=anthropic_key)
def ingest(self, documents: List[str]):
"""Process and store documents"""
all_chunks = []
for doc in documents:
chunks = chunk_by_heading(doc)
all_chunks.extend(chunks)
self.vector_db.add_documents(all_chunks)
def query(self, question: str) -> str:
"""Answer a question using RAG"""
# Retrieve relevant chunks
relevant_chunks = self.vector_db.search(question, k=3)
# Build context
context = "\n\n---\n\n".join([
chunk["content"] for chunk in relevant_chunks
])
# Generate answer with Claude
response = self.llm.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Based on the following context, answer the question.
Context:
{context}
Question: {question}
Provide a clear, accurate answer based only on the context provided."""
}]
)
return response.content[0].text
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production. You need quantitative metrics. Let's build a robust evaluation suite.
Creating a Test Dataset
Generate 100+ test samples with:
- A question
- Ground truth relevant chunks
- A correct answer
# Example test sample structure
test_sample = {
"question": "What is the maximum context window for Claude 3 Opus?",
"relevant_chunks": [
"Claude 3 Opus supports a 200K token context window...",
"The Claude 3 family offers different context windows..."
],
"correct_answer": "Claude 3 Opus supports up to 200K tokens..."
}
Key Metrics Explained
#### Retrieval Metrics
Precision: Of the chunks we retrieved, how many were actually relevant?Precision = |Retrieved ∩ Relevant| / |Retrieved|
Recall: Of all relevant chunks, how many did we retrieve?
Recall = |Retrieved ∩ Relevant| / |Relevant|
F1 Score: Harmonic mean of precision and recall
F1 = 2 (Precision Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR): How high did the first relevant result rank?
def calculate_mrr(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_chunks:
return 1.0 / (i + 1)
return 0.0
#### End-to-End Metrics
Accuracy: Does Claude's answer match the ground truth?def evaluate_answer(generated_answer: str, correct_answer: str) -> bool:
"""Use Claude to judge if answers are semantically equivalent"""
response = client.messages.create(
model="claude-3-haiku-20240307",
messages=[{
"role": "user",
"content": f"""Are these two answers equivalent?
Answer 1: {generated_answer}
Answer 2: {correct_answer}
Respond with only 'YES' or 'NO'."""
}]
)
return response.content[0].text.strip() == "YES"
Level 2: Summary Indexing
Basic chunking loses context. Summary indexing creates a two-tier retrieval system:
- Summary chunks: High-level overviews for initial retrieval
- Detail chunks: Full content for answer generation
def create_summary_index(chunks: List[Dict]) -> List[Dict]:
"""Create summary-level representations"""
summary_chunks = []
for chunk in chunks:
# Use Claude to generate a concise summary
response = client.messages.create(
model="claude-3-haiku-20240307",
messages=[{
"role": "user",
"content": f"Summarize this text in 1-2 sentences:\n\n{chunk['content']}"
}]
)
summary_chunks.append({
"summary": response.content[0].text,
"original_content": chunk["content"],
"heading": chunk["heading"]
})
return summary_chunks
This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.
Level 3: Adding Re-Ranking
Re-ranking refines initial retrieval results using Claude's understanding of relevance:
def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3) -> List[Dict]:
"""Use Claude to re-rank retrieved chunks by relevance"""
# Prepare chunks for ranking
chunks_text = "\n\n".join([
f"[{i}] {chunk['content']}"
for i, chunk in enumerate(candidates)
])
response = client.messages.create(
model="claude-3-sonnet-20240229",
messages=[{
"role": "user",
"content": f"""Given this query: "{query}"
Rank these chunks by relevance (most relevant first).
Return only the indices in order, comma-separated.
Chunks:
{chunks_text}"""
}]
)
# Parse ranked indices
ranked_indices = [
int(idx.strip())
for idx in response.content[0].text.split(",")
]
return [candidates[i] for i in ranked_indices[:top_k]]
Re-ranking dramatically improved MRR from 0.74 to 0.87—meaning the first retrieved chunk was almost always relevant.
Performance Results
Here's what our optimizations achieved:
| Metric | Basic RAG | Optimized RAG |
|---|---|---|
| Precision | 0.43 | 0.44 |
| Recall | 0.66 | 0.69 |
| F1 Score | 0.52 | 0.54 |
| MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access or sample your evaluation set.
- Vector Database: Move from in-memory to Pinecone, Weaviate, or pgvector for production.
- Chunk Size: Experiment with chunk sizes (256-1024 tokens) based on your document structure.
- Embedding Model: Voyage AI's voyage-2 offers excellent performance, but test alternatives.
- Caching: Cache embeddings and common queries to reduce API costs.
Key Takeaways
- Evaluate systematically: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy to identify bottlenecks in your RAG pipeline.
- Summary indexing improves recall: Creating two-tier representations helps retrieve relevant content even when queries don't match exact phrasing.
- Re-ranking with Claude dramatically improves MRR: Using Claude to re-rank initial results ensures the most relevant context reaches the generation step.
- Start simple, optimize iteratively: A basic RAG pipeline can achieve 71% accuracy; targeted improvements push it to 81%.
- Build for production from day one: Consider rate limits, vector database choices, and caching strategies early in development.