BeClaude
GuideBeginnerBest Practices2026-05-15

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide teaches you to build a production-ready RAG system with Claude, covering basic setup with Voyage AI embeddings, creating an evaluation suite with precision/recall/F1 metrics, and advanced optimization techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.

RAGClaude APIVoyage AIVector SearchEvaluation

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities into your specific business context. While Claude excels at general knowledge tasks, it needs RAG to answer questions about your internal documents, customer support history, or proprietary research.

In this guide, we'll walk through building a complete RAG system using Claude and Voyage AI embeddings, then systematically improve it using evaluation-driven development. We'll cover three levels of sophistication:

  • Basic RAG - Simple chunking, embedding, and retrieval
  • Summary Indexing - Adding document summaries for better context
  • Re-ranking - Using Claude to improve result ordering
By the end, you'll have a production-ready approach that improved our metrics from 71% to 81% end-to-end accuracy.

Prerequisites and Setup

Before diving in, you'll need:

Installing Dependencies

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Initializing Your Vector Database

For this guide, we'll use an in-memory vector store. In production, consider managed solutions like Pinecone, Weaviate, or pgvector.

import voyageai
from anthropic import Anthropic
import numpy as np
from typing import List, Dict, Any

class InMemoryVectorDB: def __init__(self, voyage_client): self.documents = [] self.embeddings = [] self.voyage = voyage_client def add_documents(self, texts: List[str]): """Add documents and their embeddings to the store.""" response = self.voyage.embed(texts, model="voyage-2") self.embeddings.extend(response.embeddings) self.documents.extend(texts) def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]: """Retrieve top-k documents by cosine similarity.""" query_embedding = self.voyage.embed([query], model="voyage-2").embeddings[0] # Compute cosine similarities similarities = [ np.dot(query_embedding, doc_emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb)) for doc_emb in self.embeddings ] # Get top-k indices top_indices = np.argsort(similarities)[-k:][::-1] return [ {"text": self.documents[i], "score": similarities[i]} for i in top_indices ]

Initialize clients

voyage_client = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY") anthropic_client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY") db = InMemoryVectorDB(voyage_client)

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG" - a straightforward three-step process:

  • Chunk documents by heading or section
  • Embed each chunk using Voyage AI
  • Retrieve relevant chunks via cosine similarity

Implementing the Basic Pipeline

def chunk_document(text: str, heading_pattern: str = "## ") -> List[str]:
    """Split document by headings for semantic chunks."""
    chunks = []
    current_chunk = []
    
    for line in text.split("\n"):
        if line.startswith(heading_pattern) and current_chunk:
            chunks.append("\n".join(current_chunk))
            current_chunk = [line]
        else:
            current_chunk.append(line)
    
    if current_chunk:
        chunks.append("\n".join(current_chunk))
    
    return chunks

def basic_rag(query: str, db: InMemoryVectorDB, k: int = 3) -> str: """Basic RAG: retrieve chunks and generate answer.""" # Step 1: Retrieve relevant chunks results = db.search(query, k=k) context = "\n\n".join([r["text"] for r in results]) # Step 2: Generate answer with Claude prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {query}

Answer:""" response = anthropic_client.messages.create( model="claude-3-sonnet-20240229", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need quantitative metrics to measure and improve your RAG system. Let's build a proper evaluation suite.

Creating a Test Dataset

Generate a synthetic evaluation dataset with 100+ samples. Each sample should include:

  • A question
  • The correct answer
  • The relevant document chunks
import json

def create_evaluation_sample(question: str, answer: str, relevant_chunks: List[str]) -> Dict: return { "question": question, "expected_answer": answer, "relevant_chunks": relevant_chunks }

Load or generate your dataset

evaluation_data = json.load(open("evaluation_dataset.json"))

Key Metrics Explained

#### Retrieval Metrics

Precision measures how many retrieved chunks are actually relevant:

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Retrieved}|}$$

High precision means fewer false positives - you're not wasting Claude's context window on irrelevant information.

Recall measures how many relevant chunks you successfully retrieved:

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Relevant}|}$$

High recall ensures Claude has all the information it needs.

F1 Score is the harmonic mean of precision and recall:

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Mean Reciprocal Rank (MRR) measures how early the first relevant result appears:

$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$

#### End-to-End Metric

Accuracy measures whether Claude's final answer is correct given the retrieved context.

Implementing the Evaluation

def evaluate_retrieval(db: InMemoryVectorDB, eval_data: List[Dict], k: int = 3):
    """Evaluate retrieval performance."""
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for sample in eval_data:
        query = sample["question"]
        relevant = set(sample["relevant_chunks"])
        
        # Retrieve chunks
        results = db.search(query, k=k)
        retrieved = set([r["text"] for r in results])
        
        # Calculate metrics
        true_positives = len(retrieved & relevant)
        
        precision = true_positives / len(retrieved) if retrieved else 0
        recall = true_positives / len(relevant) if relevant else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: find first relevant result
        mrr = 0
        for i, r in enumerate(results):
            if r["text"] in relevant:
                mrr = 1 / (i + 1)
                break
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic chunking loses the forest for the trees. Summary indexing adds a high-level overview of each document section, improving retrieval for questions that require synthesis.

How Summary Indexing Works

  • For each document chunk, generate a summary using Claude
  • Store both the original chunk and its summary
  • When searching, match against summaries first, then retrieve full chunks
def generate_summary(chunk: str, anthropic_client: Anthropic) -> str:
    """Generate a concise summary of a document chunk."""
    response = anthropic_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{
            "role": "user", 
            "content": f"Summarize this in 1-2 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text

def build_summary_index(chunks: List[str], anthropic_client: Anthropic, voyage_client) -> InMemoryVectorDB: """Build a vector index with summaries.""" summary_db = InMemoryVectorDB(voyage_client) for chunk in chunks: summary = generate_summary(chunk, anthropic_client) # Store summary + chunk for retrieval summary_db.add_documents([f"Summary: {summary}\n\nFull: {chunk}"]) return summary_db

Level 3: Re-ranking with Claude

Even with good embeddings, the top-k results aren't always optimally ordered. Re-ranking uses Claude to evaluate and reorder retrieved chunks based on relevance to the specific question.

Implementing Re-ranking

def rerank_chunks(query: str, chunks: List[str], anthropic_client: Anthropic, top_k: int = 3) -> List[str]:
    """Use Claude to re-rank retrieved chunks by relevance."""
    # Prepare chunks for evaluation
    chunk_text = "\n\n---\n\n".join([
        f"Chunk {i+1}: {chunk}" for i, chunk in enumerate(chunks)
    ])
    
    prompt = f"""Given the question below, rank these chunks by relevance (most relevant first).

Question: {query}

{chunk_text}

Return the chunk numbers in order of relevance, like: 3, 1, 2""" response = anthropic_client.messages.create( model="claude-3-haiku-20240307", max_tokens=50, messages=[{"role": "user", "content": prompt}] ) # Parse the ranking ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")] ranked_chunks = [chunks[i] for i in ranking[:top_k]] return ranked_chunks

def advanced_rag(query: str, db: InMemoryVectorDB, anthropic_client: Anthropic, k: int = 5) -> str: """Advanced RAG with re-ranking.""" # Retrieve more chunks than needed results = db.search(query, k=k) chunks = [r["text"] for r in results] # Re-rank with Claude top_chunks = rerank_chunks(query, chunks, anthropic_client, top_k=3) context = "\n\n".join(top_chunks) # Generate final answer prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {query}

Answer:""" response = anthropic_client.messages.create( model="claude-3-sonnet-20240229", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Results and Performance Gains

After implementing these optimizations, we achieved significant improvements:

MetricBasic RAGAdvanced RAGImprovement
Avg Precision0.430.44+2%
Avg Recall0.660.69+5%
Avg F1 Score0.520.54+4%
Avg MRR0.740.87+18%
End-to-End Accuracy71%81%+14%
The most dramatic improvement came in MRR (Mean Reciprocal Rank), showing that re-ranking effectively pushes the most relevant chunks to the top of Claude's context window.

Production Considerations

  • Rate Limits: Full evaluations can hit API rate limits. Consider using Tier 2+ accounts or running evaluations incrementally.
  • Cost Management: Summary indexing and re-ranking add token costs. Balance improvement against budget.
  • Vector Database: For production, use managed solutions like Pinecone, Weaviate, or pgvector instead of in-memory stores.
  • Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your document structure.

Key Takeaways

  • Evaluate systematically: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy to identify bottlenecks in your RAG pipeline.
  • Summary indexing improves context: Adding document summaries helps Claude understand the big picture before diving into details, improving recall by 5%.
  • Re-ranking with Claude boosts relevance: Using Claude to reorder retrieved chunks improved MRR by 18%, ensuring the most relevant information appears first.
  • Start simple, then optimize: Begin with basic RAG, establish your evaluation baseline, then incrementally add sophistication.
  • Monitor costs vs. benefits: Advanced techniques like summary generation and re-ranking add token costs. Measure whether the accuracy gains justify the expense for your use case.