Building and Optimizing RAG Systems with Claude: A Practical Guide
Learn how to implement and optimize Retrieval Augmented Generation (RAG) with Claude AI. This guide covers basic setup, evaluation metrics, and advanced techniques to improve accuracy from 71% to 81%.
This guide teaches you to build a Claude RAG system using Voyage AI embeddings, create robust evaluations, and implement advanced techniques like summary indexing and re-ranking to improve answer accuracy from 71% to 81%.
Building and Optimizing RAG Systems with Claude: A Practical Guide
Claude excels at general tasks but may struggle with domain-specific queries about your business context. Retrieval Augmented Generation (RAG) solves this by enabling Claude to access your internal knowledge bases, documents, and support materials. Enterprises use RAG applications for customer support, internal Q&A systems, financial analysis, legal research, and more.
In this guide, we'll walk through building and optimizing a RAG system using Claude Documentation as our knowledge base. We'll demonstrate how to achieve measurable performance improvements—increasing end-to-end accuracy from 71% to 81% through targeted optimizations.
Prerequisites and Setup
Before we begin, ensure you have the following:
# Install required libraries
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Import libraries
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json
import matplotlib.pyplot as plt
- Initialize Clients:
# Initialize API clients
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGEAI_KEY")
Level 1: Building a Basic RAG System
A basic RAG pipeline (sometimes called "Naive RAG") consists of three core steps:
1. Document Chunking
Divide your documents into manageable pieces. For documentation, chunking by heading works well:
def chunk_by_heading(document_text):
"""
Simple chunking function that splits documents by headings
"""
chunks = []
current_chunk = ""
for line in document_text.split('\n'):
if line.startswith('#'): # Markdown heading
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = line + "\n"
else:
current_chunk += line + "\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
2. Embedding Generation
Convert chunks into vector embeddings using Voyage AI:
def embed_chunks(chunks):
"""
Generate embeddings for document chunks
"""
# Voyage AI embeddings are optimized for retrieval
result = vo.embed(
texts=chunks,
model="voyage-2",
input_type="document"
)
return result.embeddings
3. Retrieval with Cosine Similarity
When a query comes in, embed it and find the most similar chunks:
class InMemoryVectorDB:
"""
Simple in-memory vector database for demonstration
For production, consider hosted solutions like Pinecone or Weaviate
"""
def __init__(self):
self.chunks = []
self.embeddings = []
def add_documents(self, chunks, embeddings):
self.chunks.extend(chunks)
self.embeddings.extend(embeddings)
def search(self, query_embedding, k=3):
"""
Retrieve top k most similar chunks using cosine similarity
"""
similarities = cosine_similarity(
[query_embedding],
self.embeddings
)[0]
# Get indices of top k most similar chunks
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.chunks[i] for i in top_indices]
Initialize and populate the vector database
db = InMemoryVectorDB()
chunks = chunk_by_heading(your_document_text)
embeddings = embed_chunks(chunks)
db.add_documents(chunks, embeddings)
4. Query Processing
Combine retrieval with Claude for final answers:
def query_rag_system(query, db, k=3):
"""
Complete RAG query pipeline
"""
# Embed the query
query_embedding = vo.embed(
[query],
model="voyage-2",
input_type="query"
).embeddings[0]
# Retrieve relevant chunks
relevant_chunks = db.search(query_embedding, k=k)
# Build context for Claude
context = "\n\n".join(relevant_chunks)
# Query Claude with retrieved context
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
}]
)
return response.content[0].text, relevant_chunks
Building a Robust Evaluation System
Moving beyond "vibes-based" evaluation is crucial for production RAG systems. We need to measure both retrieval performance and end-to-end accuracy independently.
Creating an Evaluation Dataset
For this guide, we use a synthetic dataset of 100 samples containing:
- Questions
- Relevant document chunks (ground truth for retrieval)
- Correct answers (ground truth for end-to-end evaluation)
# Load evaluation dataset
with open('evaluation/docs_evaluation_dataset.json', 'r') as f:
eval_dataset = json.load(f)
Preview a sample
sample = eval_dataset[0]
print(f"Question: {sample['question']}")
print(f"Relevant chunks: {len(sample['relevant_chunks'])}")
print(f"Correct answer: {sample['correct_answer'][:100]}...")
Key Evaluation Metrics
#### Retrieval Metrics
- Precision: Proportion of retrieved chunks that are actually relevant
def calculate_precision(retrieved_chunks, relevant_chunks):
true_positives = len(set(retrieved_chunks) & set(relevant_chunks))
total_retrieved = len(retrieved_chunks)
return true_positives / total_retrieved if total_retrieved > 0 else 0
- Recall: Proportion of all relevant chunks that were retrieved
def calculate_recall(retrieved_chunks, relevant_chunks):
true_positives = len(set(retrieved_chunks) & set(relevant_chunks))
total_relevant = len(relevant_chunks)
return true_positives / total_relevant if total_relevant > 0 else 0
- F1 Score: Harmonic mean of precision and recall
def calculate_f1(precision, recall):
if precision + recall == 0:
return 0
return 2 (precision recall) / (precision + recall)
- Mean Reciprocal Rank (MRR): Measures how high the first relevant result appears
def calculate_mrr(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks, 1):
if chunk in relevant_chunks:
return 1 / i
return 0
#### End-to-End Accuracy
Measure whether Claude's final answer matches the ground truth:
def evaluate_answer(claude_answer, correct_answer):
"""
Simple evaluation - for production, consider more sophisticated methods
like using Claude to evaluate answer quality
"""
# This is a simplified version
claude_lower = claude_answer.lower()
correct_lower = correct_answer.lower()
# Check for key information presence
key_terms = extract_key_terms(correct_answer)
matches = sum(1 for term in key_terms if term in claude_lower)
return matches / len(key_terms) if key_terms else 0
Level 2: Summary Indexing
Basic RAG struggles with queries requiring synthesis across multiple chunks. Summary indexing helps by creating hierarchical representations:
def create_summary_index(chunks):
"""
Create summary embeddings for groups of related chunks
"""
# Group chunks by topic/section
grouped_chunks = group_by_topic(chunks)
summaries = []
for group in grouped_chunks:
# Use Claude to create a summary of the group
summary = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=300,
messages=[{
"role": "user",
"content": f"Summarize the following documents:\n\n{'\n\n'.join(group)}"
}]
).content[0].text
summaries.append({
"summary": summary,
"chunks": group,
"embedding": vo.embed([summary], model="voyage-2").embeddings[0]
})
return summaries
How it works:
- First, retrieve relevant summaries
- Then, retrieve chunks from the most relevant summary groups
- This provides better context for multi-chunk queries
Level 3: Summary Indexing with Re-Ranking
Add a re-ranking step using Claude to improve retrieval quality:
def rerank_with_claude(query, retrieved_chunks):
"""
Use Claude to re-rank retrieved chunks by relevance
"""
chunk_list = "\n\n".join([
f"Chunk {i+1}: {chunk[:200]}..."
for i, chunk in enumerate(retrieved_chunks)
])
prompt = f"""Rank these document chunks by relevance to the query.
Query: {query}
Chunks:
{chunk_list}
Return only the numbers in order of relevance (most relevant first)."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse Claude's ranking
ranked_indices = parse_ranking(response.content[0].text)
# Reorder chunks based on Claude's ranking
return [retrieved_chunks[i] for i in ranked_indices]
Performance Improvements
Through these optimizations, we achieved significant gains:
| Metric | Basic RAG | Optimized RAG | Improvement |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | +2.3% |
| Avg Recall | 0.66 | 0.69 | +4.5% |
| Avg F1 Score | 0.52 | 0.54 | +3.8% |
| Avg MRR | 0.74 | 0.87 | +17.6% |
| End-to-End Accuracy | 71% | 81% | +14.1% |
Production Considerations
- Vector Database Selection: For production, use hosted solutions like Pinecone, Weaviate, or pgvector
- Chunking Strategy: Experiment with different chunk sizes and overlap based on your content
- Embedding Models: Voyage AI works well, but also consider OpenAI, Cohere, or open-source alternatives
- Caching: Implement caching for frequent queries to reduce costs and latency
- Monitoring: Track retrieval metrics and answer quality in production
Key Takeaways
- Start with Basic RAG: Implement a simple pipeline with document chunking, embedding, and cosine similarity retrieval before adding complexity.
- Build Robust Evaluations: Move beyond subjective assessment by measuring precision, recall, F1, MRR, and end-to-end accuracy with a proper evaluation dataset.
- Use Summary Indexing for Complex Queries: When questions require synthesis across multiple documents, hierarchical summary indexing significantly improves retrieval quality.
- Implement Re-Ranking: Add a Claude-powered re-ranking step to refine retrieval results, improving MRR by 17.6% in our tests.
- Measure and Iterate: Continuously evaluate your system and implement targeted improvements—our optimizations increased end-to-end accuracy from 71% to 81%.