Guide2026-04-17

Building Better RAG Systems with Claude: A Practical Guide to Evaluation and Optimization

Learn how to build, evaluate, and optimize Retrieval Augmented Generation systems with Claude AI. This guide covers retrieval metrics, summary indexing, and re-ranking techniques to improve accuracy.

Quick Answer

This guide teaches you how to build and optimize RAG systems with Claude, covering basic implementation, comprehensive evaluation metrics, and advanced techniques like summary indexing and re-ranking to significantly improve accuracy from 71% to 81%.

RAGClaude APIEvaluationVector SearchDocument Retrieval

Building Better RAG Systems with Claude: A Practical Guide to Evaluation and Optimization

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more.

In this guide, we'll demonstrate how to build and optimize a RAG system using practical techniques that helped achieve significant performance gains:

End-to-End Accuracy: 71% → 81%
Mean Reciprocal Rank (MRR): 0.74 → 0.87
F1 Score: 0.52 → 0.54

Prerequisites and Setup

Before we begin, you'll need:

API Keys: Get keys from Anthropic and Voyage AI
Python Libraries: Install the required packages

# Install required libraries
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Import libraries
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Initialize clients
client = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyageai-key")

Level 1: Building a Basic RAG System

Let's start with a basic RAG pipeline, sometimes called 'Naive RAG'. This approach includes three fundamental steps:

1. Document Chunking

Chunk documents by heading, containing only the content from each subheading. This preserves semantic boundaries and improves retrieval quality.

def chunk_by_heading(document_text):
    """
    Simple chunking function that splits documents by headings
    """
    chunks = []
    current_chunk = ""
    
    for line in document_text.split('\n'):
        if line.startswith('## '):  # Heading detection
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = line + "\n"
        else:
            current_chunk += line + "\n"
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

2. Embedding Generation

Use Voyage AI to generate embeddings for each document chunk:

def embed_chunks(chunks):
    """
    Generate embeddings for document chunks
    """
    embeddings = vo.embed(
        chunks,
        model="voyage-2",
        input_type="document"
    ).embeddings
    return embeddings

3. Retrieval with Cosine Similarity

class InMemoryVectorDB:
    """
    Simple in-memory vector database for demonstration
    """
    def __init__(self):
        self.chunks = []
        self.embeddings = []
    
    def add_documents(self, chunks, embeddings):
        self.chunks.extend(chunks)
        self.embeddings.extend(embeddings)
    
    def search(self, query_embedding, k=3):
        """
        Retrieve top-k most similar chunks using cosine similarity
        """
        similarities = cosine_similarity(
            [query_embedding],
            self.embeddings
        )[0]
        
        # Get indices of top-k results
        top_indices = np.argsort(similarities)[-k:][::-1]
        
        return [self.chunks[i] for i in top_indices]

4. Query Processing

def basic_rag_query(query, vector_db):
    """
    Complete RAG pipeline for a single query
    """
    # Embed the query
    query_embedding = vo.embed(
        [query],
        model="voyage-2",
        input_type="query"
    ).embeddings[0]
    
    # Retrieve relevant chunks
    retrieved_chunks = vector_db.search(query_embedding, k=3)
    
    # Build context
    context = "\n\n".join(retrieved_chunks)
    
    # Generate response with Claude
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"""Based on the following context, answer the question.
            
            Context:\n{context}\n\nQuestion: {query}"""
        }]
    )
    
    return response.content[0].text, retrieved_chunks

Building a Robust Evaluation System

When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end-to-end system separately. We'll use five key metrics:

Retrieval Metrics

#### 1. Precision Precision represents the proportion of retrieved chunks that are actually relevant.

def calculate_precision(retrieved_chunks, correct_chunks):
    """
    Calculate precision: TP / Total Retrieved
    """
    retrieved_set = set(retrieved_chunks)
    correct_set = set(correct_chunks)
    
    true_positives = len(retrieved_set.intersection(correct_set))
    total_retrieved = len(retrieved_set)
    
    return true_positives / total_retrieved if total_retrieved > 0 else 0

#### 2. Recall Recall measures the completeness of our retrieval system.

def calculate_recall(retrieved_chunks, correct_chunks):
    """
    Calculate recall: TP / Total Correct
    """
    retrieved_set = set(retrieved_chunks)
    correct_set = set(correct_chunks)
    
    true_positives = len(retrieved_set.intersection(correct_set))
    total_correct = len(correct_set)
    
    return true_positives / total_correct if total_correct > 0 else 0

#### 3. F1 Score F1 Score is the harmonic mean of precision and recall.

def calculate_f1(precision, recall):
    """
    Calculate F1 Score: 2  (precision  recall) / (precision + recall)
    """
    if precision + recall == 0:
        return 0
    return 2  (precision  recall) / (precision + recall)

#### 4. Mean Reciprocal Rank (MRR) MRR measures how high the first relevant document appears in results.

def calculate_mrr(retrieved_chunks, correct_chunks):
    """
    Calculate MRR: 1 / rank of first relevant document
    """
    for i, chunk in enumerate(retrieved_chunks, 1):
        if chunk in correct_chunks:
            return 1 / i
    return 0

End-to-End Accuracy

This measures whether Claude provides the correct answer based on retrieved context.

def evaluate_end_to_end(query, retrieved_chunks, expected_answer):
    """
    Evaluate if Claude generates the correct answer
    """
    context = "\n\n".join(retrieved_chunks)
    
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"""Based on the context, answer the question. 
            Just answer the question, no additional commentary.
            
            Context:\n{context}\n\nQuestion: {query}"""
        }]
    )
    
    generated_answer = response.content[0].text.strip()
    # Compare with expected answer (you might want more sophisticated comparison)
    return generated_answer.lower() == expected_answer.lower()

Level 2: Summary Indexing

Summary indexing creates concise summaries of document chunks that can be searched first, reducing token usage and improving retrieval quality.

def create_chunk_summary(chunk):
    """
    Use Claude to create a concise summary of a document chunk
    """
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": f"Create a concise summary of this document chunk:\n\n{chunk}"
        }]
    )
    return response.content[0].text
class SummaryVectorDB(InMemoryVectorDB):
    """
    Enhanced vector DB with summary indexing
    """
    def __init__(self):
        super().__init__()
        self.summaries = []
        self.summary_embeddings = []
    
    def add_documents(self, chunks, embeddings):
        super().add_documents(chunks, embeddings)
        
        # Create summaries
        for chunk in chunks:
            summary = create_chunk_summary(chunk)
            self.summaries.append(summary)
        
        # Embed summaries
        summary_embeddings = vo.embed(
            self.summaries,
            model="voyage-2",
            input_type="document"
        ).embeddings
        self.summary_embeddings = summary_embeddings
    
    def search_with_summaries(self, query_embedding, k=3):
        """
        Search using summaries first, then retrieve full chunks
        """
        # Search in summary space
        similarities = cosine_similarity(
            [query_embedding],
            self.summary_embeddings
        )[0]
        
        top_indices = np.argsort(similarities)[-k:][::-1]
        
        return [self.chunks[i] for i in top_indices]

Level 3: Summary Indexing with Re-Ranking

Re-ranking uses Claude to re-order retrieved documents based on their relevance to the specific query.

def rerank_with_claude(query, retrieved_chunks):
    """
    Use Claude to re-rank retrieved chunks by relevance
    """
    chunk_list = "\n".join([
        f"{i+1}. {chunk[:200]}..." for i, chunk in enumerate(retrieved_chunks)
    ])
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Re-rank these document chunks by relevance to this query: '{query}'
            
            Return only the numbers in order of relevance, most relevant first.
            
            Chunks:\n{chunk_list}"""
        }]
    )
    
    # Parse Claude's response to get re-ranked indices
    # This is a simplified version - you'd want more robust parsing
    ranked_text = response.content[0].text
    ranked_indices = []
    
    for line in ranked_text.split('\n'):
        if line.strip().isdigit():
            idx = int(line.strip()) - 1
            if 0 <= idx < len(retrieved_chunks):
                ranked_indices.append(idx)
    
    # Return re-ranked chunks
    return [retrieved_chunks[i] for i in ranked_indices]
def advanced_rag_query(query, vector_db):
    """
    Complete RAG pipeline with summary indexing and re-ranking
    """
    # Embed the query
    query_embedding = vo.embed(
        [query],
        model="voyage-2",
        input_type="query"
    ).embeddings[0]
    
    # Retrieve using summaries
    retrieved_chunks = vector_db.search_with_summaries(query_embedding, k=5)
    
    # Re-rank with Claude
    reranked_chunks = rerank_with_claude(query, retrieved_chunks)
    
    # Take top 3 after re-ranking
    final_chunks = reranked_chunks[:3]
    
    # Generate response
    context = "\n\n".join(final_chunks)
    
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query}"
        }]
    )
    
    return response.content[0].text, final_chunks

Performance Comparison

Through these targeted improvements, we achieved significant performance gains:

Metric	Basic RAG	With Summary Indexing & Re-ranking	Improvement
Avg Precision	0.43	0.44	+2.3%
Avg Recall	0.66	0.69	+4.5%
Avg F1 Score	0.52	0.54	+3.8%
Avg MRR	0.74	0.87	+17.6%
End-to-End Accuracy	71%	81%	+14.1%

Note on Evaluations: The evaluations in this guide mirror a production evaluation system. Keep in mind they can take time to run, and you may encounter rate limits unless you're in Tier 2 and above. Consider skipping full end-to-end evaluations if you're conserving token usage.

Key Takeaways

Separate Your Evaluations: Always evaluate retrieval performance and end-to-end accuracy independently. This helps identify whether issues are in retrieval or generation.

MRR Matters: Mean Reciprocal Rank (MRR) is particularly important for RAG systems because having the most relevant document appear first significantly improves answer quality.

Summary Indexing Reduces Noise: Creating concise summaries of document chunks and searching in summary space first can improve retrieval quality while reducing token usage.

Re-Ranking Adds Precision: Using Claude to re-rank initially retrieved documents based on query-specific relevance can significantly improve the quality of context provided to the final generation step.

Start Simple, Then Optimize: Begin with a basic RAG implementation, establish your evaluation baseline, then systematically implement optimizations like summary indexing and re-ranking while measuring their impact.

By following this guide, you can build RAG systems that effectively leverage Claude's capabilities while ensuring they're grounded in your specific domain knowledge. Remember to continuously evaluate and iterate based on your specific use case and data characteristics.