GuideBeginnerBest Practices2026-05-15

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

A practical guide to building RAG systems with Claude, covering basic setup with Voyage AI embeddings, building an evaluation suite with precision/recall/F1 metrics, and advanced optimization techniques like summary indexing and re-ranking to boost end-to-end accuracy from 71% to 81%.

RAGRetrieval Augmented GenerationVector SearchClaude APIEvaluation

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. While Claude excels at general knowledge tasks, it needs RAG to answer questions specific to your business context—whether that's internal documentation, customer support histories, or proprietary research.

In this guide, we'll walk through building a complete RAG system using Claude and Voyage AI embeddings, then systematically improve it using advanced techniques. By the end, you'll understand not just how to build RAG, but how to measure and optimize it for production.

Why RAG Matters for Claude Users

Enterprises are increasingly building RAG applications to:

Power customer support with product documentation
Enable Q&A over internal company documents
Accelerate financial and legal analysis
Create knowledge assistants for specialized domains

Without RAG, Claude can only answer based on its training data. With RAG, you can ground responses in your specific, up-to-date information.

Level 1: Building a Basic RAG Pipeline

Let's start with what's often called "Naive RAG"—a straightforward three-step process:

Chunk your documents into manageable pieces
Embed each chunk into a vector representation
Retrieve the most relevant chunks for a given query

Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll need API keys from both Anthropic and Voyage AI.

Creating a Vector Database Class

For this example, we'll use an in-memory vector database. In production, you'd want a hosted solution like Pinecone or Weaviate.

import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def add_document(self, vector: List[float], metadata: Dict[str, Any]):
        self.vectors.append(np.array(vector))
        self.metadata.append(metadata)
    
    def search(self, query_vector: List[float], top_k: int = 3) -> List[Dict[str, Any]]:
        query_vec = np.array(query_vector)
        similarities = [
            np.dot(query_vec, doc_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(doc_vec))
            for doc_vec in self.vectors
        ]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [
            {"metadata": self.metadata[i], "score": similarities[i]}
            for i in top_indices
        ]

Implementing the Basic RAG Pipeline

import voyageai
from anthropic import Anthropic
class BasicRAG:
    def __init__(self, anthropic_key: str, voyage_key: str):
        self.anthropic = Anthropic(api_key=anthropic_key)
        self.voyage = voyageai.Client(api_key=voyage_key)
        self.vector_db = InMemoryVectorDB()
    
    def index_documents(self, documents: List[Dict[str, str]]):
        """Chunk and index documents by heading."""
        for doc in documents:
            # Simple chunking: split by headings
            chunks = self._chunk_by_heading(doc["content"])
            for chunk in chunks:
                embedding = self.voyage.embed(
                    [chunk["text"]], 
                    model="voyage-2"
                ).embeddings[0]
                self.vector_db.add_document(
                    embedding,
                    {"source": doc["title"], "text": chunk["text"]}
                )
    
    def query(self, question: str, top_k: int = 3) -> str:
        # Embed the question
        query_embedding = self.voyage.embed(
            [question], 
            model="voyage-2"
        ).embeddings[0]
        
        # Retrieve relevant chunks
        results = self.vector_db.search(query_embedding, top_k=top_k)
        
        # Build context from retrieved chunks
        context = "\n\n".join([
            r["metadata"]["text"] for r in results
        ])
        
        # Generate answer with Claude
        response = self.anthropic.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {question}"
            }]
        )
        return response.content[0].text

Building a Robust Evaluation System

This is where most RAG tutorials stop—but it's where the real work begins. To build production-quality RAG, you need to measure two things independently:

Retrieval performance: How well does your system find relevant chunks?
End-to-end performance: How well does Claude answer questions given those chunks?

Creating a Synthetic Evaluation Dataset

We'll generate 100 evaluation samples, each containing:

A question
The correct chunks (ground truth)
A correct answer

{
  "question": "How do I use Claude's system prompt to control output format?",
  "relevant_chunks": [
    "System prompts allow you to set the behavior and output format...",
    "You can specify JSON output by including 'Respond in JSON'..."
  ],
  "correct_answer": "To control output format, use a system prompt that specifies..."
}

Key Retrieval Metrics

#### Precision Precision answers: "Of the chunks we retrieved, how many were actually relevant?"

Precision = True Positives / Total Retrieved

High precision means your system isn't returning irrelevant chunks. Low precision means Claude has to sort through noise.

#### Recall Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"

Recall = True Positives / Total Relevant

High recall ensures Claude has all the information it needs. Low recall means you're missing important context.

#### F1 Score The harmonic mean of precision and recall:

F1 = 2  (Precision  Recall) / (Precision + Recall)

#### Mean Reciprocal Rank (MRR) MRR measures how high the first relevant chunk appears in your results. This matters because Claude pays more attention to early chunks.

Implementing the Evaluation

def evaluate_retrieval(rag_system, eval_dataset):
    metrics = {"precision": [], "recall": [], "f1": [], "mrr": []}
    
    for sample in eval_dataset:
        query_embedding = rag_system.voyage.embed(
            [sample["question"]], 
            model="voyage-2"
        ).embeddings[0]
        
        results = rag_system.vector_db.search(query_embedding, top_k=3)
        retrieved_texts = [r["metadata"]["text"] for r in results]
        
        # Calculate metrics
        tp = len(set(retrieved_texts) & set(sample["relevant_chunks"]))
        
        precision = tp / len(retrieved_texts) if retrieved_texts else 0
        recall = tp / len(sample["relevant_chunks"]) if sample["relevant_chunks"] else 0
        f1 = 2  precision  recall / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant chunk
        mrr = 0
        for i, text in enumerate(retrieved_texts):
            if text in sample["relevant_chunks"]:
                mrr = 1 / (i + 1)
                break
        
        metrics["precision"].append(precision)
        metrics["recall"].append(recall)
        metrics["f1"].append(f1)
        metrics["mrr"].append(mrr)
    
    return {k: np.mean(v) for k, v in metrics.items()}

Level 2: Summary Indexing

Basic chunking by heading has a fundamental problem: it loses the broader context. A chunk about "API Rate Limits" might not mention it's part of the "Getting Started" guide, which is crucial context.

Summary indexing solves this by creating a summary of each document section and including it with every chunk:

class SummaryIndexRAG(BasicRAG):
    def index_documents(self, documents):
        for doc in documents:
            sections = self._extract_sections(doc["content"])
            for section in sections:
                # Generate a summary of the section
                summary = self.anthropic.messages.create(
                    model="claude-3-haiku-20240307",
                    max_tokens=200,
                    messages=[{
                        "role": "user",
                        "content": f"Summarize this section in 2-3 sentences:\n\n{section['text']}"
                    }]
                ).content[0].text
                
                # Embed the summary + chunk together
                enhanced_chunk = f"Section Summary: {summary}\n\nContent: {section['text']}"
                embedding = self.voyage.embed(
                    [enhanced_chunk],
                    model="voyage-2"
                ).embeddings[0]
                
                self.vector_db.add_document(
                    embedding,
                    {"text": section["text"], "summary": summary}
                )

Level 3: Adding Re-Ranking

Even with summary indexing, your top-3 retrieved chunks might not be in the optimal order. Re-ranking uses Claude to intelligently reorder results:

class ReRankRAG(SummaryIndexRAG):
    def query(self, question: str, top_k: int = 10) -> str:
        # Retrieve more candidates initially
        query_embedding = self.voyage.embed([question], model="voyage-2").embeddings[0]
        candidates = self.vector_db.search(query_embedding, top_k=top_k)
        
        # Re-rank with Claude
        ranked = self._rerank_with_claude(question, candidates)
        
        # Take top 3 after re-ranking
        top_chunks = ranked[:3]
        context = "\n\n".join([c["metadata"]["text"] for c in top_chunks])
        
        response = self.anthropic.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {question}"
            }]
        )
        return response.content[0].text
    
    def _rerank_with_claude(self, question: str, candidates: List[Dict]) -> List[Dict]:
        chunks_text = "\n---\n".join([
            f"Chunk {i+1}: {c['metadata']['text'][:200]}..."
            for i, c in enumerate(candidates)
        ])
        
        response = self.anthropic.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"Given the question: '{question}'\n\nRank these chunks by relevance (most relevant first). Return only the chunk numbers in order, comma-separated.\n\n{chunks_text}"
            }]
        )
        
        # Parse the ranked order
        ranked_indices = [
            int(x.strip()) - 1 
            for x in response.content[0].text.split(",")
            if x.strip().isdigit()
        ]
        
        return [candidates[i] for i in ranked_indices if i < len(candidates)]

Results: The Impact of Each Improvement

Here's what we achieved by layering these techniques:

Metric	Basic RAG	+ Summary Indexing	+ Re-Ranking
Precision	0.43	0.44	0.44
Recall	0.66	0.68	0.69
F1 Score	0.52	0.53	0.54
MRR	0.74	0.82	0.87
End-to-End Accuracy	71%	76%	81%

The biggest wins came from:

MRR improvement (0.74 → 0.87): Re-ranking ensures the most relevant chunk appears first
End-to-end accuracy (71% → 81%): Better retrieval directly leads to better answers

Key Takeaways

Evaluate retrieval and generation separately – Don't just trust "vibes." Use precision, recall, F1, and MRR to measure retrieval quality independently from answer quality.

Summary indexing preserves context – By embedding summaries alongside chunks, you help the retrieval system understand the broader context of each piece of content.

Re-ranking with Claude dramatically improves results – Even a lightweight model like Claude 3 Haiku can intelligently reorder search results, boosting MRR by 18% and end-to-end accuracy by 10 percentage points.

Start simple, then optimize – Build a basic RAG pipeline first, establish your baseline metrics, then layer improvements. This approach ensures you're actually making progress, not just adding complexity.

Watch your rate limits – Full evaluation runs can consume significant tokens. Consider running smaller evaluation sets during development and scaling up for production testing.