Building and Optimizing RAG Systems with Claude: A Practical Guide
Learn how to implement and optimize Retrieval Augmented Generation (RAG) with Claude AI. This guide covers basic setup, evaluation metrics, and advanced techniques like summary indexing and re-ranking.
This guide teaches you to build a RAG system using Claude, evaluate its performance with precision, recall, and accuracy metrics, and implement advanced optimizations like summary indexing and re-ranking to improve results from 71% to 81% accuracy.
Building and Optimizing RAG Systems with Claude: A Practical Guide
Claude excels at general tasks but may struggle with domain-specific queries about your business context. Retrieval Augmented Generation (RAG) solves this by enabling Claude to access your internal knowledge bases, documents, and support materials. Enterprises use RAG to enhance customer support, analyze financial/legal documents, and answer internal questions.
In this guide, we'll walk through building a production-ready RAG system using Claude Documentation as our knowledge base, complete with evaluation frameworks and optimization techniques.
Why RAG Matters for Claude Users
RAG bridges the gap between Claude's general knowledge and your specific domain expertise. Instead of retraining models or fine-tuning, you can dynamically retrieve relevant information from your documents and feed it to Claude as context. This approach is:
- Cost-effective: No model retraining required
- Updatable: Simply add new documents to your knowledge base
- Transparent: You can trace answers back to source materials
- Accurate: Reduces hallucinations by grounding responses in your data
Prerequisites and Setup
Before building your RAG system, you'll need:
- API Keys:
- Required Libraries:
# Install required packages
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Import libraries
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json
from typing import List, Dict, Tuple
- Initialize Clients:
# Initialize API clients
client = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyageai-key")
Simple in-memory vector database class
class VectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
self.metadata = []
def add_document(self, text: str, metadata: dict = None):
"""Add document and generate embedding"""
embedding = vo.embed(text, model="voyage-2").embeddings[0]
self.documents.append(text)
self.embeddings.append(embedding)
self.metadata.append(metadata or {})
def search(self, query: str, k: int = 3) -> List[Tuple[str, float, dict]]:
"""Search for similar documents"""
query_embedding = vo.embed(query, model="voyage-2").embeddings[0]
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
indices = np.argsort(similarities)[::-1][:k]
return [(self.documents[i], similarities[i], self.metadata[i])
for i in indices]
Level 1: Building a Basic RAG Pipeline
A basic ("Naive") RAG pipeline consists of three steps:
1. Document Chunking
Chunk documents by logical sections (like headings) to maintain context:
def chunk_by_heading(document_text: str) -> List[Dict]:
"""Simple chunking by heading sections"""
chunks = []
current_chunk = {"heading": "", "content": ""}
for line in document_text.split('\n'):
if line.startswith('#'): # Markdown heading
if current_chunk["content"]:
chunks.append(current_chunk)
current_chunk = {"heading": line.strip('# '), "content": ""}
else:
current_chunk["content"] += line + "\n"
if current_chunk["content"]:
chunks.append(current_chunk)
return chunks
2. Embedding Generation
Generate embeddings for each chunk using Voyage AI:
def index_documents(documents: List[Dict]) -> VectorDB:
"""Index documents in vector database"""
db = VectorDB()
for doc in documents:
# Combine heading and content for better context
text = f"{doc['heading']}\n{doc['content']}"
db.add_document(text, {"heading": doc["heading"]})
return db
3. Query and Response Generation
Retrieve relevant chunks and generate answers with Claude:
def basic_rag_query(db: VectorDB, query: str) -> str:
"""Execute RAG query with Claude"""
# Retrieve relevant chunks
results = db.search(query, k=3)
# Build context
context = "\n\n".join([f"## {res[2].get('heading', '')}\n{res[0]}"
for res in results])
# Generate response with Claude
prompt = f"""You are a helpful assistant answering questions based on the provided context.
Context:
{context}
Question: {query}
Answer based only on the context provided. If the answer isn't in the context, say "I don't have enough information to answer that question."
Answer:"""
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building an Evaluation System
Moving beyond "vibes-based" evaluation is crucial for production RAG systems. We need to measure both retrieval performance and end-to-end accuracy.
Creating an Evaluation Dataset
Create a synthetic dataset with:
- Questions
- Relevant document chunks (expected retrieval)
- Correct answers
# Example evaluation dataset structure
evaluation_data = [
{
"question": "How do I set up API rate limits?",
"relevant_chunks": ["chunk_id_1", "chunk_id_2"],
"correct_answer": "API rate limits are configured in the dashboard..."
}
# ... 99 more samples
]
Key Evaluation Metrics
#### Retrieval Metrics:
- Precision: Proportion of retrieved chunks that are relevant
def calculate_precision(retrieved: List[str], relevant: List[str]) -> float:
relevant_set = set(relevant)
retrieved_set = set(retrieved)
true_positives = len(retrieved_set.intersection(relevant_set))
return true_positives / len(retrieved_set) if retrieved_set else 0
- Recall: Proportion of relevant chunks that were retrieved
def calculate_recall(retrieved: List[str], relevant: List[str]) -> float:
relevant_set = set(relevant)
retrieved_set = set(retrieved)
true_positives = len(retrieved_set.intersection(relevant_set))
return true_positives / len(relevant_set) if relevant_set else 0
- F1 Score: Harmonic mean of precision and recall
- Mean Reciprocal Rank (MRR): Measures how high the first relevant result appears
Level 2: Summary Indexing
Basic RAG can miss broader context. Summary indexing adds hierarchical structure:
def create_summary_index(documents: List[Dict]) -> Tuple[VectorDB, VectorDB]:
"""Create two-level index: summaries and detailed chunks"""
summary_db = VectorDB()
detail_db = VectorDB()
for doc in documents:
# Create summary using Claude
summary_prompt = f"Summarize this document section in 2-3 sentences:\n\n{doc['content']}"
summary_response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = summary_response.content[0].text
# Index summary
summary_db.add_document(
f"{doc['heading']}\n{summary}",
{"type": "summary", "doc_id": doc["id"]}
)
# Index detailed content
detail_db.add_document(
f"{doc['heading']}\n{doc['content']}",
{"type": "detail", "doc_id": doc["id"]}
)
return summary_db, detail_db
Two-stage retrieval process:
- First search the summary index to identify relevant topics
- Then retrieve detailed chunks from those topics
Level 3: Summary Indexing with Re-Ranking
Add a re-ranking step using Claude to improve result ordering:
def rerank_with_claude(query: str, candidates: List[Tuple[str, float, dict]]) -> List[Tuple[str, float, dict]]:
"""Use Claude to re-rank retrieved documents"""
if len(candidates) <= 1:
return candidates
# Prepare documents for ranking
docs_text = "\n\n".join([
f"[Document {i+1}]\n{candidates[i][0]}"
for i in range(len(candidates))
])
ranking_prompt = f"""Rank these documents by relevance to the query.
Query: {query}
Documents:
{docs_text}
Return ONLY a comma-separated list of document numbers in order of relevance (most relevant first).
Example: "3,1,2""""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=50,
messages=[{"role": "user", "content": ranking_prompt}]
)
# Parse ranking and reorder
ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
return [candidates[i] for i in ranking]
Performance Improvements
Through these optimizations, we achieved significant gains:
| Metric | Basic RAG | Optimized RAG | Improvement |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | +2.3% |
| Avg Recall | 0.66 | 0.69 | +4.5% |
| Avg F1 Score | 0.52 | 0.54 | +3.8% |
| Avg MRR | 0.74 | 0.87 | +17.6% |
| End-to-End Accuracy | 71% | 81% | +14.1% |
Production Considerations
- Vector Database Choice: For production, use hosted solutions like Pinecone, Weaviate, or pgvector instead of in-memory storage.
- Chunking Strategy: Experiment with different chunk sizes (200-1000 tokens) and overlap strategies.
- Embedding Models: Test different models (Voyage, OpenAI, Cohere) for your specific domain.
- Hybrid Search: Combine semantic search with keyword matching for better recall.
- Caching: Cache embeddings and frequent queries to reduce costs and latency.
Key Takeaways
- Start with Basic RAG: Implement a simple pipeline first (chunk → embed → retrieve) before adding complexity.
- Measure Systematically: Use precision, recall, F1, MRR, and end-to-end accuracy metrics—don't rely on subjective evaluation.
- Optimize Retrieval First: Poor retrieval can't be fixed by the LLM. Focus on getting the right documents before improving answer generation.
- Add Hierarchy with Summaries: Summary indexing helps Claude understand broader context and improves retrieval of related information.
- Re-rank with Claude: Use Claude's understanding of relevance to improve the order of retrieved documents, significantly boosting MRR and final answer quality.