Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Learn how to build and optimize a Retrieval Augmented Generation (RAG) system with Claude, including evaluation metrics, summary indexing, and re-ranking techniques for production use.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to set up retrieval, evaluate performance with precision/recall/F1 metrics, and achieve significant accuracy improvements.
Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.
Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using Claude Documentation as our knowledge base.
What You'll Learn
By the end of this guide, you'll know how to:
- Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
- Build a robust evaluation suite that measures retrieval and end-to-end performance independently
- Implement advanced techniques like summary indexing and re-ranking with Claude
Prerequisites and Setup
Before we begin, you'll need:
- API keys from Anthropic and Voyage AI
- Python 3.8+ installed
- Basic familiarity with Python and API calls
Required Libraries
# Install the required packages
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Initialize Your Clients
import anthropic
import voyageai
Initialize Claude client
claude_client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
Initialize Voyage AI client for embeddings
voyage_client = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
Setting Up an In-Memory Vector Database
For this guide, we'll use an in-memory vector database. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
def add_document(self, text: str, embedding: List[float], metadata: Dict[str, Any] = None):
self.documents.append({
"text": text,
"metadata": metadata or {}
})
self.embeddings.append(embedding)
def search(self, query_embedding: List[float], top_k: int = 5) -> List[Dict[str, Any]]:
# Cosine similarity search
similarities = []
for doc_embedding in self.embeddings:
sim = np.dot(query_embedding, doc_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding)
)
similarities.append(sim)
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
Level 1: Basic RAG Pipeline
Let's start with a basic RAG pipeline, sometimes called "Naive RAG." This involves three steps:
- Chunk documents by heading (containing only content from each subheading)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks using cosine similarity
Step 1: Chunk Your Documents
def chunk_by_headings(document_text: str) -> List[Dict[str, str]]:
"""Split document into chunks based on headings."""
chunks = []
lines = document_text.split('\n')
current_heading = "Introduction"
current_content = []
for line in lines:
if line.startswith('#'): # Markdown heading
if current_content:
chunks.append({
"heading": current_heading,
"content": '\n'.join(current_content).strip()
})
current_heading = line.lstrip('#').strip()
current_content = []
else:
current_content.append(line)
# Don't forget the last chunk
if current_content:
chunks.append({
"heading": current_heading,
"content": '\n'.join(current_content).strip()
})
return chunks
Step 2: Embed and Store
def build_vector_db(chunks: List[Dict[str, str]]) -> InMemoryVectorDB:
db = InMemoryVectorDB()
for chunk in chunks:
# Generate embedding using Voyage AI
response = voyage_client.embed(
texts=[chunk["content"]],
model="voyage-2"
)
embedding = response.embeddings[0]
db.add_document(
text=chunk["content"],
embedding=embedding,
metadata={"heading": chunk["heading"]}
)
return db
Step 3: Retrieve and Generate
def rag_query(db: InMemoryVectorDB, query: str, top_k: int = 3) -> str:
# Embed the query
response = voyage_client.embed(
texts=[query],
model="voyage-2"
)
query_embedding = response.embeddings[0]
# Retrieve relevant chunks
retrieved_chunks = db.search(query_embedding, top_k=top_k)
# Build context from retrieved chunks
context = "\n\n".join([chunk["text"] for chunk in retrieved_chunks])
# Generate answer using Claude
prompt = f"""Based on the following context, answer the user's question.
If the context doesn't contain enough information, say so.
Context:
{context}
Question: {query}
Answer:"""
response = claude_client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building an Evaluation System
When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end-to-end system separately. We'll use a synthetic evaluation dataset with 100 samples, each containing:
- A question
- Relevant chunks (ground truth)
- A correct answer
Key Metrics
#### Retrieval Metrics
Precision measures how many of the retrieved chunks are actually relevant:Precision = True Positives / Total Retrieved
High precision means fewer irrelevant chunks are being retrieved.
Recall measures how many of the relevant chunks were retrieved:Recall = True Positives / Total Relevant
High recall means you're capturing most of the necessary information.
F1 Score is the harmonic mean of precision and recall:F1 = 2 (Precision Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the results:
def calculate_mrr(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_chunks:
return 1.0 / (i + 1)
return 0.0
#### End-to-End Metric
End-to-End Accuracy measures whether the final answer is correct, considering both retrieval and generation quality.Level 2: Summary Indexing
A major improvement over basic RAG is summary indexing. Instead of retrieving raw chunks, you:
- Generate a summary of each chunk using Claude
- Store both the summary and the original chunk
- Retrieve based on summary similarity, then return the full chunk
def generate_summary(chunk_text: str) -> str:
prompt = f"Summarize the following text in 2-3 sentences:\n\n{chunk_text}"
response = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Level 3: Summary Indexing + Re-Ranking
The most advanced approach combines summary indexing with re-ranking. After initial retrieval, you use Claude to re-rank the results based on relevance to the query.
def rerank_with_claude(query: str, chunks: List[Dict[str, str]], top_k: int = 3) -> List[Dict[str, str]]:
# Ask Claude to rank chunks by relevance
chunks_text = "\n---\n".join([
f"Chunk {i+1}: {chunk['text'][:200]}..."
for i, chunk in enumerate(chunks)
])
prompt = f"""Given the query: "{query}"
Rank the following chunks by relevance (most relevant first):
{chunks_text}
Return the chunk numbers in order of relevance, separated by commas."""
response = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response to get ordered indices
import re
indices = [int(x) - 1 for x in re.findall(r'\d+', response.content[0].text)]
return [chunks[i] for i in indices[:top_k]]
Performance Gains
Through these targeted improvements, you can achieve significant performance gains:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Key Takeaways
- Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
- Summary indexing improves retrieval quality by matching queries against concise summaries rather than raw chunks, leading to better semantic alignment.
- Re-ranking with Claude significantly boosts MRR by ensuring the most relevant chunks appear first, which is critical for time-sensitive applications.
- Start simple, then iterate — a basic RAG pipeline can be surprisingly effective. Only add complexity (summary indexing, re-ranking) when metrics show it's needed.
- Use high-quality embeddings from providers like Voyage AI to ensure your retrieval foundation is solid before optimizing other components.