Guide2026-04-19

Building and Optimizing RAG Systems with Claude: A Practical Guide

Learn how to implement and optimize Retrieval Augmented Generation (RAG) with Claude AI. This guide covers basic setup, evaluation metrics, and advanced techniques to improve accuracy from 71% to 81%.

Quick Answer

This guide teaches you to build a Claude RAG system using Voyage AI embeddings, create robust evaluations, and implement advanced techniques like summary indexing and re-ranking to improve answer accuracy from 71% to 81%.

RAGClaude APIVector DatabasesEvaluationEmbeddings

Building and Optimizing RAG Systems with Claude: A Practical Guide

Claude excels at general tasks but may struggle with domain-specific queries about your business context. Retrieval Augmented Generation (RAG) solves this by enabling Claude to access your internal knowledge bases, documents, and support materials. Enterprises use RAG applications for customer support, internal Q&A systems, financial analysis, legal research, and more.

In this guide, we'll walk through building and optimizing a RAG system using Claude Documentation as our knowledge base. We'll demonstrate how to achieve measurable performance improvements—increasing end-to-end accuracy from 71% to 81% through targeted optimizations.

Prerequisites and Setup

Before we begin, ensure you have the following:

API Keys: Obtain keys from Anthropic and Voyage AI
Python Libraries: Install required packages

# Install required libraries
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Import libraries
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json
import matplotlib.pyplot as plt

Initialize Clients:

# Initialize API clients
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGEAI_KEY")

Level 1: Building a Basic RAG System

A basic RAG pipeline (sometimes called "Naive RAG") consists of three core steps:

1. Document Chunking

Divide your documents into manageable pieces. For documentation, chunking by heading works well:

def chunk_by_heading(document_text):
    """
    Simple chunking function that splits documents by headings
    """
    chunks = []
    current_chunk = ""
    
    for line in document_text.split('\n'):
        if line.startswith('#'):  # Markdown heading
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = line + "\n"
        else:
            current_chunk += line + "\n"
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

2. Embedding Generation

Convert chunks into vector embeddings using Voyage AI:

def embed_chunks(chunks):
    """
    Generate embeddings for document chunks
    """
    # Voyage AI embeddings are optimized for retrieval
    result = vo.embed(
        texts=chunks,
        model="voyage-2",
        input_type="document"
    )
    return result.embeddings

3. Retrieval with Cosine Similarity

When a query comes in, embed it and find the most similar chunks:

class InMemoryVectorDB:
    """
    Simple in-memory vector database for demonstration
    For production, consider hosted solutions like Pinecone or Weaviate
    """
    def __init__(self):
        self.chunks = []
        self.embeddings = []
    
    def add_documents(self, chunks, embeddings):
        self.chunks.extend(chunks)
        self.embeddings.extend(embeddings)
    
    def search(self, query_embedding, k=3):
        """
        Retrieve top k most similar chunks using cosine similarity
        """
        similarities = cosine_similarity(
            [query_embedding], 
            self.embeddings
        )[0]
        
        # Get indices of top k most similar chunks
        top_indices = np.argsort(similarities)[-k:][::-1]
        
        return [self.chunks[i] for i in top_indices]
Initialize and populate the vector database
db = InMemoryVectorDB()
chunks = chunk_by_heading(your_document_text)
embeddings = embed_chunks(chunks)
db.add_documents(chunks, embeddings)

4. Query Processing

Combine retrieval with Claude for final answers:

def query_rag_system(query, db, k=3):
    """
    Complete RAG query pipeline
    """
    # Embed the query
    query_embedding = vo.embed(
        [query], 
        model="voyage-2", 
        input_type="query"
    ).embeddings[0]
    
    # Retrieve relevant chunks
    relevant_chunks = db.search(query_embedding, k=k)
    
    # Build context for Claude
    context = "\n\n".join(relevant_chunks)
    
    # Query Claude with retrieved context
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"""Based on the following context, answer the question.
            
            Context:
            {context}
            
            Question: {query}
            
            Answer:"""
        }]
    )
    
    return response.content[0].text, relevant_chunks

Building a Robust Evaluation System

Moving beyond "vibes-based" evaluation is crucial for production RAG systems. We need to measure both retrieval performance and end-to-end accuracy independently.

Creating an Evaluation Dataset

For this guide, we use a synthetic dataset of 100 samples containing:

Questions
Relevant document chunks (ground truth for retrieval)
Correct answers (ground truth for end-to-end evaluation)

# Load evaluation dataset
with open('evaluation/docs_evaluation_dataset.json', 'r') as f:
    eval_dataset = json.load(f)
Preview a sample
sample = eval_dataset[0]
print(f"Question: {sample['question']}")
print(f"Relevant chunks: {len(sample['relevant_chunks'])}")
print(f"Correct answer: {sample['correct_answer'][:100]}...")

Key Evaluation Metrics

#### Retrieval Metrics

Precision: Proportion of retrieved chunks that are actually relevant

def calculate_precision(retrieved_chunks, relevant_chunks):
       true_positives = len(set(retrieved_chunks) & set(relevant_chunks))
       total_retrieved = len(retrieved_chunks)
       return true_positives / total_retrieved if total_retrieved > 0 else 0

Recall: Proportion of all relevant chunks that were retrieved

def calculate_recall(retrieved_chunks, relevant_chunks):
       true_positives = len(set(retrieved_chunks) & set(relevant_chunks))
       total_relevant = len(relevant_chunks)
       return true_positives / total_relevant if total_relevant > 0 else 0

F1 Score: Harmonic mean of precision and recall

def calculate_f1(precision, recall):
       if precision + recall == 0:
           return 0
       return 2  (precision  recall) / (precision + recall)

Mean Reciprocal Rank (MRR): Measures how high the first relevant result appears

def calculate_mrr(retrieved_chunks, relevant_chunks):
       for i, chunk in enumerate(retrieved_chunks, 1):
           if chunk in relevant_chunks:
               return 1 / i
       return 0

#### End-to-End Accuracy

Measure whether Claude's final answer matches the ground truth:

def evaluate_answer(claude_answer, correct_answer):
    """
    Simple evaluation - for production, consider more sophisticated methods
    like using Claude to evaluate answer quality
    """
    # This is a simplified version
    claude_lower = claude_answer.lower()
    correct_lower = correct_answer.lower()
    
    # Check for key information presence
    key_terms = extract_key_terms(correct_answer)
    matches = sum(1 for term in key_terms if term in claude_lower)
    
    return matches / len(key_terms) if key_terms else 0

Level 2: Summary Indexing

Basic RAG struggles with queries requiring synthesis across multiple chunks. Summary indexing helps by creating hierarchical representations:

def create_summary_index(chunks):
    """
    Create summary embeddings for groups of related chunks
    """
    # Group chunks by topic/section
    grouped_chunks = group_by_topic(chunks)
    
    summaries = []
    for group in grouped_chunks:
        # Use Claude to create a summary of the group
        summary = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Summarize the following documents:\n\n{'\n\n'.join(group)}"
            }]
        ).content[0].text
        
        summaries.append({
            "summary": summary,
            "chunks": group,
            "embedding": vo.embed([summary], model="voyage-2").embeddings[0]
        })
    
    return summaries

How it works:

First, retrieve relevant summaries
Then, retrieve chunks from the most relevant summary groups
This provides better context for multi-chunk queries

Level 3: Summary Indexing with Re-Ranking

Add a re-ranking step using Claude to improve retrieval quality:

def rerank_with_claude(query, retrieved_chunks):
    """
    Use Claude to re-rank retrieved chunks by relevance
    """
    chunk_list = "\n\n".join([
        f"Chunk {i+1}: {chunk[:200]}..." 
        for i, chunk in enumerate(retrieved_chunks)
    ])
    
    prompt = f"""Rank these document chunks by relevance to the query.
    
    Query: {query}
    
    Chunks:
    {chunk_list}
    
    Return only the numbers in order of relevance (most relevant first)."""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse Claude's ranking
    ranked_indices = parse_ranking(response.content[0].text)
    
    # Reorder chunks based on Claude's ranking
    return [retrieved_chunks[i] for i in ranked_indices]

Performance Improvements

Through these optimizations, we achieved significant gains:

Metric	Basic RAG	Optimized RAG	Improvement
Avg Precision	0.43	0.44	+2.3%
Avg Recall	0.66	0.69	+4.5%
Avg F1 Score	0.52	0.54	+3.8%
Avg MRR	0.74	0.87	+17.6%
End-to-End Accuracy	71%	81%	+14.1%

Note on Evaluations: Running full evaluations can be time-consuming and may hit rate limits unless you're in Tier 2 and above. Consider running evaluations on a subset if conserving token usage.

Production Considerations

Vector Database Selection: For production, use hosted solutions like Pinecone, Weaviate, or pgvector
Chunking Strategy: Experiment with different chunk sizes and overlap based on your content
Embedding Models: Voyage AI works well, but also consider OpenAI, Cohere, or open-source alternatives
Caching: Implement caching for frequent queries to reduce costs and latency
Monitoring: Track retrieval metrics and answer quality in production

Key Takeaways

Start with Basic RAG: Implement a simple pipeline with document chunking, embedding, and cosine similarity retrieval before adding complexity.

Build Robust Evaluations: Move beyond subjective assessment by measuring precision, recall, F1, MRR, and end-to-end accuracy with a proper evaluation dataset.

Use Summary Indexing for Complex Queries: When questions require synthesis across multiple documents, hierarchical summary indexing significantly improves retrieval quality.

Implement Re-Ranking: Add a Claude-powered re-ranking step to refine retrieval results, improving MRR by 17.6% in our tests.

Measure and Iterate: Continuously evaluate your system and implement targeted improvements—our optimizations increased end-to-end accuracy from 71% to 81%.

By following this guide, you can build a production-ready RAG system with Claude that delivers accurate, context-aware answers specific to your domain. Remember to start simple, measure rigorously, and implement optimizations based on your specific performance metrics.