BeClaude
GuideBeginnerBest Practices2026-05-12

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide teaches you to build a RAG system with Claude, from basic setup to advanced optimization. You'll learn to evaluate retrieval performance using precision, recall, F1, and MRR metrics, then improve accuracy from 71% to 81% using summary indexing and re-ranking techniques.

RAGClaude APIVector DatabasesEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG enables Claude to answer questions about your unique business context with high accuracy.

In this guide, we'll walk through building and optimizing a RAG system using Claude and Voyage AI embeddings, using the Claude documentation as our knowledge base. You'll learn how to:

  • Set up a basic RAG pipeline
  • Build a robust evaluation system
  • Implement advanced techniques like summary indexing and re-ranking
  • Achieve measurable improvements in retrieval and end-to-end accuracy

Understanding the RAG Architecture

Before diving into code, let's understand what makes RAG tick. A RAG system has three core components:

  • Ingestion Pipeline: Chunks documents, generates embeddings, and stores them in a vector database
  • Retrieval System: Finds relevant chunks for a given query using semantic similarity
  • Generation System: Feeds retrieved context to Claude to generate accurate answers
The magic happens when these components work together. Claude can then answer questions that require specific knowledge it wasn't trained on, while grounding its responses in your actual documents.

Level 1: Building a Basic RAG System

Let's start with what's often called "Naive RAG" — a simple but functional implementation.

Setup and Dependencies

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll need API keys from both Anthropic and Voyage AI.

Initialize the Vector Database

For this example, we'll use an in-memory vector database. In production, you'd want a hosted solution like Pinecone or Weaviate.

import voyageai
import numpy as np
from typing import List, Dict

class InMemoryVectorDB: def __init__(self, api_key: str): self.client = voyageai.Client(api_key=api_key) self.documents = [] self.embeddings = [] def add_documents(self, documents: List[Dict[str, str]]): """Add documents with their embeddings""" texts = [doc['content'] for doc in documents] embeddings = self.client.embed(texts, model="voyage-2").embeddings self.documents.extend(documents) self.embeddings.extend(embeddings) def search(self, query: str, k: int = 3) -> List[Dict]: """Search for similar documents using cosine similarity""" query_embedding = self.client.embed([query], model="voyage-2").embeddings[0] # Calculate cosine similarity similarities = [] for doc_embedding in self.embeddings: similarity = np.dot(query_embedding, doc_embedding) / ( np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding) ) similarities.append(similarity) # Get top-k results top_indices = np.argsort(similarities)[-k:][::-1] return [self.documents[i] for i in top_indices]

The Basic RAG Pipeline

Now let's implement the three-step pipeline:

from anthropic import Anthropic

class BasicRAG: def __init__(self, vector_db, anthropic_api_key: str): self.vector_db = vector_db self.anthropic = Anthropic(api_key=anthropic_api_key) def chunk_documents(self, documents: List[Dict]) -> List[Dict]: """Chunk documents by heading""" chunks = [] for doc in documents: # Split by headings (## or ###) sections = doc['content'].split('\n##') for section in sections: if section.strip(): chunks.append({ 'content': section.strip(), 'source': doc.get('source', ''), 'heading': section.split('\n')[0] if '\n' in section else '' }) return chunks def retrieve(self, query: str, k: int = 3) -> List[Dict]: """Retrieve relevant chunks""" return self.vector_db.search(query, k=k) def generate(self, query: str, context: List[Dict]) -> str: """Generate answer using Claude""" context_text = "\n\n".join([c['content'] for c in context]) prompt = f"""Based on the following context, answer the question accurately. If the context doesn't contain enough information, say so. Context: {context_text} Question: {query} Answer:""" response = self.anthropic.messages.create( model="claude-3-sonnet-20241022", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Building a Robust Evaluation System

This is where most RAG tutorials stop, but it's where the real work begins. You can't improve what you can't measure.

Creating an Evaluation Dataset

We need three things for each test case:

  • A question
  • The correct chunks (ground truth)
  • A correct answer
# Example evaluation dataset structure
eval_dataset = [
    {
        "question": "How do I stream Claude's responses?",
        "relevant_chunks": ["chunk_1_id", "chunk_5_id"],
        "correct_answer": "You can stream Claude's responses by setting stream=True in the API call..."
    },
    # ... 97 more samples
]

Key Metrics Explained

#### Retrieval Metrics

Precision: Of the chunks we retrieved, how many were relevant?
Precision = True Positives / Total Retrieved
Recall: Of all relevant chunks, how many did we retrieve?
Recall = True Positives / Total Relevant
F1 Score: Harmonic mean of precision and recall
F1 = 2  (Precision  Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR): How high did the first relevant result appear?
MRR = 1 / rank_of_first_relevant_result

#### End-to-End Metric

Accuracy: Did Claude's answer match the expected answer?

Implementing the Evaluation

def evaluate_retrieval(rag_system, eval_dataset):
    """Evaluate retrieval performance"""
    results = {
        'precision': [],
        'recall': [],
        'f1': [],
        'mrr': []
    }
    
    for sample in eval_dataset:
        retrieved = rag_system.retrieve(sample['question'])
        retrieved_ids = [r['id'] for r in retrieved]
        relevant_ids = sample['relevant_chunks']
        
        # Calculate metrics
        true_positives = len(set(retrieved_ids) & set(relevant_ids))
        
        precision = true_positives / len(retrieved) if retrieved else 0
        recall = true_positives / len(relevant_ids) if relevant_ids else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: find first relevant result
        mrr = 0
        for i, rid in enumerate(retrieved_ids):
            if rid in relevant_ids:
                mrr = 1 / (i + 1)
                break
        
        results['precision'].append(precision)
        results['recall'].append(recall)
        results['f1'].append(f1)
        results['mrr'].append(mrr)
    
    return {k: np.mean(v) for k, v in results.items()}

Level 2: Summary Indexing

Basic RAG has a fundamental problem: chunks often lack context. A chunk about "rate limits" might not mention it's about Claude's API. Summary indexing solves this by creating a summary for each chunk and using it for retrieval.

class SummaryIndexRAG(BasicRAG):
    def __init__(self, vector_db, anthropic_api_key: str):
        super().__init__(vector_db, anthropic_api_key)
        self.summaries = []
    
    def generate_summary(self, chunk: Dict) -> str:
        """Generate a summary of the chunk using Claude"""
        prompt = f"""Summarize the following text in 1-2 sentences, 
        focusing on what questions this text can answer:
        
        {chunk['content']}
        
        Summary:"""
        
        response = self.anthropic.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=100,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text
    
    def retrieve(self, query: str, k: int = 3) -> List[Dict]:
        # Search over summaries first
        summary_results = self.vector_db.search(query, k=k*2)
        
        # Then get the full chunks
        chunk_ids = [r['chunk_id'] for r in summary_results]
        return [self.documents[cid] for cid in chunk_ids[:k]]

Level 3: Adding Re-Ranking

Re-ranking is the secret weapon of production RAG systems. Instead of relying solely on embedding similarity, we use Claude to re-rank the retrieved chunks based on actual relevance to the query.

class ReRankRAG(SummaryIndexRAG):
    def rerank(self, query: str, chunks: List[Dict], k: int = 3) -> List[Dict]:
        """Use Claude to re-rank chunks by relevance"""
        chunks_text = "\n---\n".join([
            f"Chunk {i}: {c['content']}" 
            for i, c in enumerate(chunks)
        ])
        
        prompt = f"""Given the query: "{query}"
        
        Rate each chunk's relevance on a scale of 1-10:
        
        {chunks_text}
        
        Return only the chunk numbers sorted by relevance (most relevant first):"""
        
        response = self.anthropic.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=100,
            messages=[{"role": "user", "content": prompt}]
        )
        
        # Parse the response to get ordered indices
        # Then return chunks in that order
        return reordered_chunks[:k]

Results: The Impact of Optimization

After implementing these techniques, here's what we achieved:

MetricBasic RAGSummary Indexing+ Re-Ranking
Avg Precision0.430.440.44
Avg Recall0.660.680.69
Avg F1 Score0.520.530.54
Avg MRR0.740.820.87
End-to-End Accuracy71%76%81%
The biggest wins came from:
  • MRR improvement: Re-ranking pushed relevant results higher
  • End-to-end accuracy: Better retrieval led to better answers

Production Considerations

When moving to production, consider:

  • Vector Database: Use Pinecone, Weaviate, or Qdrant for persistence and scaling
  • Chunking Strategy: Experiment with different sizes (256-1024 tokens) and overlap
  • Caching: Cache embeddings and common queries to reduce API costs
  • Monitoring: Track retrieval metrics in production to catch degradation
  • Rate Limits: Be aware of API rate limits, especially during evaluation

Key Takeaways

  • Measure separately, optimize together: Evaluate retrieval and generation independently to identify bottlenecks
  • MRR matters most: For RAG, getting the right chunk to the top is more important than perfect recall
  • Summary indexing adds context: It helps Claude understand what each chunk is about, improving retrieval quality
  • Re-ranking is worth the cost: Using Claude to re-rank even a small set of candidates significantly improves accuracy
  • Start simple, iterate: A basic RAG system with good evaluation beats a complex system with no metrics