Guide2026-05-05

Building Production-Ready RAG Systems with Claude: From Basics to Advanced Optimization

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from basic setup to advanced optimization using summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, then apply techniques that boosted accuracy from 71% to 81%.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basics to Advanced Optimization

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.

What You'll Learn

Setting up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
Building a robust evaluation suite that measures retrieval and end-to-end performance independently
Implementing advanced techniques including summary indexing and re-ranking with Claude

Prerequisites and Setup

Before diving in, you'll need:

API keys from Anthropic and Voyage AI
Python environment with the following libraries:

# Required libraries
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Initialize a Vector DB Class

For this guide, we'll use an in-memory vector database. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def add(self, vector: List[float], metadata: Dict[str, Any]):
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search(self, query_vector: List[float], k: int = 3) -> List[Dict[str, Any]]:
        # Cosine similarity search
        similarities = []
        for vec in self.vectors:
            sim = np.dot(query_vector, vec) / (np.linalg.norm(query_vector) * np.linalg.norm(vec))
            similarities.append(sim)
        
        top_k = np.argsort(similarities)[-k:][::-1]
        return [self.metadata[i] for i in top_k]

Level 1: Basic RAG (Naive RAG)

Let's start with a basic RAG pipeline. This approach, sometimes called "Naive RAG," follows three simple steps:

Chunk documents by heading, containing only the content from each subheading
Embed each chunk using Voyage AI's embedding model
Retrieve using cosine similarity to find relevant chunks for a query

import voyageai
Initialize Voyage AI client
vo = voyageai.Client(api_key="your-voyage-api-key")
def embed_chunks(chunks: List[str]) -> List[List[float]]:
    """Embed a list of text chunks using Voyage AI."""
    result = vo.embed(chunks, model="voyage-2")
    return result.embeddings
def basic_rag(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str:
    """Basic RAG pipeline: embed query, retrieve chunks, generate answer."""
    # Step 1: Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Step 2: Retrieve top-k chunks
    retrieved_chunks = vector_db.search(query_embedding, k=k)
    
    # Step 3: Generate answer with Claude
    context = "\n\n".join([chunk["text"] for chunk in retrieved_chunks])
    
    # (Claude API call would go here)
    return context

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end-to-end system separately. This allows you to pinpoint where improvements are needed.

Creating an Evaluation Dataset

We synthetically generated an evaluation dataset consisting of 100 samples. Each sample includes:

A question
Relevant chunks from our docs (the ground truth for retrieval)
A correct answer (the ground truth for generation)

This is a challenging dataset. Some questions require synthesis between multiple chunks, so your system must retrieve more than one chunk at a time.

Key Metrics

We evaluate our system based on five key metrics:

#### Retrieval Metrics

Precision measures the proportion of retrieved chunks that are actually relevant. It answers: "Of the chunks we retrieved, how many were correct?"

$$\text{Precision} = \frac{\text{True Positives}}{\text{Total Retrieved}} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

Recall measures completeness: "Of all the correct chunks that exist, how many did we retrieve?"

$$\text{Recall} = \frac{\text{True Positives}}{\text{Total Correct}} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

F1 Score is the harmonic mean of precision and recall. Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the ranked results. This is crucial because Claude can only process a limited context window.

#### End-to-End Metric

End-to-End Accuracy measures whether Claude's final answer is correct given the retrieved context.

Implementing the Evaluation

from sklearn.metrics import precision_score, recall_score, f1_score
def evaluate_retrieval(retrieved_chunks, relevant_chunks):
    """Calculate retrieval metrics."""
    # Convert to sets for comparison
    retrieved_set = set(retrieved_chunks)
    relevant_set = set(relevant_chunks)
    
    # True positives
    tp = len(retrieved_set & relevant_set)
    
    # Precision
    precision = tp / len(retrieved_set) if retrieved_set else 0
    
    # Recall
    recall = tp / len(relevant_set) if relevant_set else 0
    
    # F1
    f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1
def calculate_mrr(retrieved_chunks, relevant_chunks):
    """Calculate Mean Reciprocal Rank."""
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

Level 2: Summary Indexing

Basic RAG struggles when information is spread across multiple chunks. Summary indexing addresses this by creating higher-level summaries of document sections.

def create_summary_index(chunks: List[Dict]) -> List[Dict]:
    """Create summary embeddings for groups of related chunks."""
    summary_chunks = []
    
    # Group chunks by section (e.g., by parent heading)
    sections = group_by_section(chunks)
    
    for section in sections:
        # Combine chunk texts
        combined_text = " ".join([c["text"] for c in section])
        
        # Create summary using Claude
        summary = claude_summarize(combined_text)
        
        # Embed the summary
        summary_embedding = vo.embed([summary], model="voyage-2").embeddings[0]
        
        summary_chunks.append({
            "embedding": summary_embedding,
            "summary": summary,
            "original_chunks": section
        })
    
    return summary_chunks

Level 3: Summary Indexing + Re-Ranking

The most advanced approach combines summary indexing with re-ranking. After initial retrieval, Claude re-ranks the results to ensure the most relevant chunks are used for generation.

def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3) -> List[Dict]:
    """Use Claude to re-rank retrieved chunks by relevance."""
    # Prepare the re-ranking prompt
    prompt = f"""Given the query: "{query}"
Rate each of the following chunks on a scale of 0-10 for relevance to the query:
"""
    
    for i, chunk in enumerate(candidates):
        prompt += f"Chunk {i+1}: {chunk['text'][:500]}...\n\n"
    
    prompt += "Return only the chunk numbers sorted by relevance (most relevant first)."
    
    # Get Claude's re-ranking
    response = claude_complete(prompt)
    
    # Parse and reorder
    ordered_indices = parse_ranking(response)
    return [candidates[i] for i in ordered_indices[:top_k]]

Results and Performance Gains

Through these targeted improvements, we achieved significant performance gains:

Metric	Basic RAG	Optimized RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement was in Mean Reciprocal Rank (0.74 → 0.87), showing that relevant chunks appeared much earlier in the retrieval results. End-to-end accuracy improved by 10 percentage points.

Best Practices for Production RAG

Evaluate retrieval and generation separately – This helps you identify whether failures come from missing context or poor reasoning.
Use high-quality embeddings – Voyage AI's embeddings significantly outperform basic alternatives for domain-specific content.
Implement re-ranking – Even simple re-ranking with Claude can dramatically improve result quality.
Consider summary indexing – For documents with information spread across sections, summary indexing captures the big picture.
Monitor token usage – Full evaluations can hit rate limits. Consider sampling or using smaller evaluation sets during development.

Key Takeaways

Basic RAG is just the starting point – A simple chunk-and-retrieve approach leaves significant performance on the table. Advanced techniques like summary indexing and re-ranking can boost end-to-end accuracy by 10+ percentage points.
Evaluate retrieval and generation independently – This diagnostic approach lets you pinpoint exactly where your RAG system is failing, whether it's missing relevant context or misinterpreting the information it has.
Mean Reciprocal Rank matters most – The position of relevant chunks in your retrieval results directly impacts Claude's ability to generate accurate answers. Re-ranking with Claude itself is a powerful optimization.
Summary indexing bridges information gaps – When answers require synthesizing information across multiple document sections, summary-level embeddings capture the relationships that basic chunking misses.
Production RAG requires systematic optimization – Moving from "vibes-based" evaluation to metric-driven improvement is essential for building reliable enterprise applications.