Building Better RAG Systems with Claude: A Practical Guide to Evaluation and Optimization
Learn how to build, evaluate, and optimize Retrieval Augmented Generation systems with Claude AI. This guide covers retrieval metrics, summary indexing, and re-ranking techniques to improve accuracy.
This guide teaches you how to build and optimize RAG systems with Claude, covering basic implementation, comprehensive evaluation metrics, and advanced techniques like summary indexing and re-ranking to significantly improve accuracy from 71% to 81%.
Building Better RAG Systems with Claude: A Practical Guide to Evaluation and Optimization
Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.
Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more.
In this guide, we'll demonstrate how to build and optimize a RAG system using practical techniques that helped achieve significant performance gains:
- End-to-End Accuracy: 71% → 81%
- Mean Reciprocal Rank (MRR): 0.74 → 0.87
- F1 Score: 0.52 → 0.54
Prerequisites and Setup
Before we begin, you'll need:
# Install required libraries
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Import libraries
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Initialize clients
client = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyageai-key")
Level 1: Building a Basic RAG System
Let's start with a basic RAG pipeline, sometimes called 'Naive RAG'. This approach includes three fundamental steps:
1. Document Chunking
Chunk documents by heading, containing only the content from each subheading. This preserves semantic boundaries and improves retrieval quality.
def chunk_by_heading(document_text):
"""
Simple chunking function that splits documents by headings
"""
chunks = []
current_chunk = ""
for line in document_text.split('\n'):
if line.startswith('## '): # Heading detection
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = line + "\n"
else:
current_chunk += line + "\n"
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
2. Embedding Generation
Use Voyage AI to generate embeddings for each document chunk:
def embed_chunks(chunks):
"""
Generate embeddings for document chunks
"""
embeddings = vo.embed(
chunks,
model="voyage-2",
input_type="document"
).embeddings
return embeddings
3. Retrieval with Cosine Similarity
class InMemoryVectorDB:
"""
Simple in-memory vector database for demonstration
"""
def __init__(self):
self.chunks = []
self.embeddings = []
def add_documents(self, chunks, embeddings):
self.chunks.extend(chunks)
self.embeddings.extend(embeddings)
def search(self, query_embedding, k=3):
"""
Retrieve top-k most similar chunks using cosine similarity
"""
similarities = cosine_similarity(
[query_embedding],
self.embeddings
)[0]
# Get indices of top-k results
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.chunks[i] for i in top_indices]
4. Query Processing
def basic_rag_query(query, vector_db):
"""
Complete RAG pipeline for a single query
"""
# Embed the query
query_embedding = vo.embed(
[query],
model="voyage-2",
input_type="query"
).embeddings[0]
# Retrieve relevant chunks
retrieved_chunks = vector_db.search(query_embedding, k=3)
# Build context
context = "\n\n".join(retrieved_chunks)
# Generate response with Claude
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Based on the following context, answer the question.
Context:\n{context}\n\nQuestion: {query}"""
}]
)
return response.content[0].text, retrieved_chunks
Building a Robust Evaluation System
When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end-to-end system separately. We'll use five key metrics:
Retrieval Metrics
#### 1. Precision Precision represents the proportion of retrieved chunks that are actually relevant.
def calculate_precision(retrieved_chunks, correct_chunks):
"""
Calculate precision: TP / Total Retrieved
"""
retrieved_set = set(retrieved_chunks)
correct_set = set(correct_chunks)
true_positives = len(retrieved_set.intersection(correct_set))
total_retrieved = len(retrieved_set)
return true_positives / total_retrieved if total_retrieved > 0 else 0
#### 2. Recall Recall measures the completeness of our retrieval system.
def calculate_recall(retrieved_chunks, correct_chunks):
"""
Calculate recall: TP / Total Correct
"""
retrieved_set = set(retrieved_chunks)
correct_set = set(correct_chunks)
true_positives = len(retrieved_set.intersection(correct_set))
total_correct = len(correct_set)
return true_positives / total_correct if total_correct > 0 else 0
#### 3. F1 Score F1 Score is the harmonic mean of precision and recall.
def calculate_f1(precision, recall):
"""
Calculate F1 Score: 2 (precision recall) / (precision + recall)
"""
if precision + recall == 0:
return 0
return 2 (precision recall) / (precision + recall)
#### 4. Mean Reciprocal Rank (MRR) MRR measures how high the first relevant document appears in results.
def calculate_mrr(retrieved_chunks, correct_chunks):
"""
Calculate MRR: 1 / rank of first relevant document
"""
for i, chunk in enumerate(retrieved_chunks, 1):
if chunk in correct_chunks:
return 1 / i
return 0
End-to-End Accuracy
This measures whether Claude provides the correct answer based on retrieved context.
def evaluate_end_to_end(query, retrieved_chunks, expected_answer):
"""
Evaluate if Claude generates the correct answer
"""
context = "\n\n".join(retrieved_chunks)
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Based on the context, answer the question.
Just answer the question, no additional commentary.
Context:\n{context}\n\nQuestion: {query}"""
}]
)
generated_answer = response.content[0].text.strip()
# Compare with expected answer (you might want more sophisticated comparison)
return generated_answer.lower() == expected_answer.lower()
Level 2: Summary Indexing
Summary indexing creates concise summaries of document chunks that can be searched first, reducing token usage and improving retrieval quality.
def create_chunk_summary(chunk):
"""
Use Claude to create a concise summary of a document chunk
"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{
"role": "user",
"content": f"Create a concise summary of this document chunk:\n\n{chunk}"
}]
)
return response.content[0].text
class SummaryVectorDB(InMemoryVectorDB):
"""
Enhanced vector DB with summary indexing
"""
def __init__(self):
super().__init__()
self.summaries = []
self.summary_embeddings = []
def add_documents(self, chunks, embeddings):
super().add_documents(chunks, embeddings)
# Create summaries
for chunk in chunks:
summary = create_chunk_summary(chunk)
self.summaries.append(summary)
# Embed summaries
summary_embeddings = vo.embed(
self.summaries,
model="voyage-2",
input_type="document"
).embeddings
self.summary_embeddings = summary_embeddings
def search_with_summaries(self, query_embedding, k=3):
"""
Search using summaries first, then retrieve full chunks
"""
# Search in summary space
similarities = cosine_similarity(
[query_embedding],
self.summary_embeddings
)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.chunks[i] for i in top_indices]
Level 3: Summary Indexing with Re-Ranking
Re-ranking uses Claude to re-order retrieved documents based on their relevance to the specific query.
def rerank_with_claude(query, retrieved_chunks):
"""
Use Claude to re-rank retrieved chunks by relevance
"""
chunk_list = "\n".join([
f"{i+1}. {chunk[:200]}..." for i, chunk in enumerate(retrieved_chunks)
])
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Re-rank these document chunks by relevance to this query: '{query}'
Return only the numbers in order of relevance, most relevant first.
Chunks:\n{chunk_list}"""
}]
)
# Parse Claude's response to get re-ranked indices
# This is a simplified version - you'd want more robust parsing
ranked_text = response.content[0].text
ranked_indices = []
for line in ranked_text.split('\n'):
if line.strip().isdigit():
idx = int(line.strip()) - 1
if 0 <= idx < len(retrieved_chunks):
ranked_indices.append(idx)
# Return re-ranked chunks
return [retrieved_chunks[i] for i in ranked_indices]
def advanced_rag_query(query, vector_db):
"""
Complete RAG pipeline with summary indexing and re-ranking
"""
# Embed the query
query_embedding = vo.embed(
[query],
model="voyage-2",
input_type="query"
).embeddings[0]
# Retrieve using summaries
retrieved_chunks = vector_db.search_with_summaries(query_embedding, k=5)
# Re-rank with Claude
reranked_chunks = rerank_with_claude(query, retrieved_chunks)
# Take top 3 after re-ranking
final_chunks = reranked_chunks[:3]
# Generate response
context = "\n\n".join(final_chunks)
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text, final_chunks
Performance Comparison
Through these targeted improvements, we achieved significant performance gains:
| Metric | Basic RAG | With Summary Indexing & Re-ranking | Improvement |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | +2.3% |
| Avg Recall | 0.66 | 0.69 | +4.5% |
| Avg F1 Score | 0.52 | 0.54 | +3.8% |
| Avg MRR | 0.74 | 0.87 | +17.6% |
| End-to-End Accuracy | 71% | 81% | +14.1% |
Key Takeaways
- Separate Your Evaluations: Always evaluate retrieval performance and end-to-end accuracy independently. This helps identify whether issues are in retrieval or generation.
- MRR Matters: Mean Reciprocal Rank (MRR) is particularly important for RAG systems because having the most relevant document appear first significantly improves answer quality.
- Summary Indexing Reduces Noise: Creating concise summaries of document chunks and searching in summary space first can improve retrieval quality while reducing token usage.
- Re-Ranking Adds Precision: Using Claude to re-rank initially retrieved documents based on query-specific relevance can significantly improve the quality of context provided to the final generation step.
- Start Simple, Then Optimize: Begin with a basic RAG implementation, establish your evaluation baseline, then systematically implement optimizations like summary indexing and re-ranking while measuring their impact.