BeClaude
GuideBeginnerBest Practices2026-05-15

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks through building a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement chunking, embedding, retrieval, and evaluation metrics like Precision, Recall, F1, and MRR. Advanced techniques include summary indexing and Claude-powered re-ranking to boost end-to-end accuracy from 71% to 81%.

RAGClaude APIVector SearchEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs RAG to answer questions specific to your business context—whether that's internal documentation, customer support knowledge bases, or financial analysis reports.

In this guide, we'll build a RAG system using Claude and Voyage AI embeddings, then systematically improve it through evaluation-driven optimization. We'll move beyond "vibes-based" testing and implement proper metrics that production systems demand.

Why RAG Matters for Claude Users

Claude's training data has a cutoff date, and it doesn't know your company's internal documents. RAG bridges this gap by:

  • Retrieving relevant chunks from your knowledge base
  • Injecting them into Claude's context window
  • Enabling accurate, grounded answers to domain-specific questions
Enterprise teams use RAG for customer support automation, internal Q&A systems, financial analysis, and legal document review. The key challenge? Building a RAG system that actually works in production.

Prerequisites and Setup

Before diving in, you'll need:

  • Anthropic API key for Claude access
  • Voyage AI API key for high-quality embeddings
  • Python environment with these libraries:
# Core dependencies
pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Initialize Your Vector Database

For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.

import numpy as np
from typing import List, Dict, Tuple

class InMemoryVectorDB: def __init__(self): self.documents = [] self.embeddings = [] def add_documents(self, docs: List[str], embeddings: List[List[float]]): self.documents.extend(docs) self.embeddings.extend(embeddings) def search(self, query_embedding: List[float], k: int = 3) -> List[Tuple[str, float]]: # Cosine similarity search similarities = [] for doc_emb in self.embeddings: sim = np.dot(query_embedding, doc_emb) / ( np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb) ) similarities.append(sim) top_indices = np.argsort(similarities)[-k:][::-1] return [(self.documents[i], similarities[i]) for i in top_indices]

Level 1: Basic RAG Pipeline

Let's start with what the industry calls "Naive RAG." This three-step process is the foundation:

  • Chunk documents by heading (each subheading becomes a chunk)
  • Embed each chunk using Voyage AI
  • Retrieve relevant chunks via cosine similarity

Implementation

import voyageai
from anthropic import Anthropic

Initialize clients

vo = voyageai.Client(api_key="your-voyage-api-key") claude = Anthropic(api_key="your-anthropic-api-key")

def chunk_document(text: str) -> List[str]: """Split document by headings (## or ###)""" chunks = [] current_chunk = [] for line in text.split('\n'): if line.startswith('##') or line.startswith('###'): if current_chunk: chunks.append('\n'.join(current_chunk)) current_chunk = [line] else: current_chunk.append(line) if current_chunk: chunks.append('\n'.join(current_chunk)) return chunks

def basic_rag(query: str, vector_db: InMemoryVectorDB) -> str: # Step 1: Embed the query query_embedding = vo.embed([query], model="voyage-2").embeddings[0] # Step 2: Retrieve relevant chunks retrieved = vector_db.search(query_embedding, k=3) context = "\n\n---\n\n".join([doc for doc, _ in retrieved]) # Step 3: Generate answer with Claude response = claude.messages.create( model="claude-3-sonnet-20240229", max_tokens=1024, messages=[{ "role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided." }] ) return response.content[0].text

Building an Evaluation System

This is where most RAG tutorials fall short. We need to measure:

  • Retrieval performance (is the system finding the right chunks?)
  • End-to-end accuracy (is Claude giving correct answers?)

Creating a Test Dataset

Generate 100+ test samples, each containing:

  • A question
  • Ground-truth relevant chunks
  • A correct answer
# Example test sample structure
test_sample = {
    "question": "How do I set up rate limiting in Claude API?",
    "relevant_chunks": [
        "Rate limits are applied per API key...",
        "To increase your rate limit, contact..."
    ],
    "correct_answer": "Rate limits are configured per API key..."
}

Key Metrics Explained

#### Precision What it measures: Of all chunks retrieved, how many were actually relevant?

Precision = True Positives / Total Retrieved
  • High precision = fewer irrelevant chunks
  • Our system retrieves minimum 3 chunks, which can lower precision
#### Recall What it measures: Of all relevant chunks, how many did we retrieve?
Recall = True Positives / Total Relevant
  • Critical for ensuring Claude has all necessary information
  • Low recall means missing important context
#### F1 Score What it measures: Harmonic mean of precision and recall
F1 = 2  (Precision  Recall) / (Precision + Recall)

#### Mean Reciprocal Rank (MRR) What it measures: How high did the first relevant chunk rank?

def calculate_mrr(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1 / (i + 1)
    return 0
  • MRR of 0.87 means first relevant chunk appears at position ~1.15 on average
#### End-to-End Accuracy What it measures: Does Claude's final answer match the ground truth?

Running Evaluations

def evaluate_retrieval(test_data, vector_db):
    results = []
    for sample in test_data:
        query_emb = vo.embed([sample["question"]], model="voyage-2").embeddings[0]
        retrieved = vector_db.search(query_emb, k=3)
        retrieved_texts = [doc for doc, _ in retrieved]
        
        # Calculate metrics
        relevant = sample["relevant_chunks"]
        true_positives = len(set(retrieved_texts) & set(relevant))
        
        precision = true_positives / len(retrieved_texts)
        recall = true_positives / len(relevant)
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        results.append({
            "precision": precision,
            "recall": recall,
            "f1": f1
        })
    
    return results

Level 2: Summary Indexing

Basic chunking misses relationships between sections. Summary indexing creates hierarchical representations:

def create_summary_index(chunks: List[str]) -> Dict[str, str]:
    """Create summaries for groups of related chunks"""
    summary_index = {}
    
    for i in range(0, len(chunks), 3):
        chunk_group = chunks[i:i+3]
        combined = "\n\n".join(chunk_group)
        
        # Use Claude to summarize
        response = claude.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize this documentation section:\n\n{combined}"
            }]
        )
        
        summary_index[response.content[0].text] = chunk_group
    
    return summary_index

Level 3: Adding Re-Ranking with Claude

Re-ranking dramatically improves MRR. After initial retrieval, use Claude to score relevance:

def rerank_with_claude(query: str, candidates: List[str], top_k: int = 3) -> List[str]:
    """Use Claude to re-rank retrieved chunks by relevance"""
    scored_chunks = []
    
    for chunk in candidates:
        response = claude.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"On a scale of 0-10, how relevant is this text to the question?\n\nQuestion: {query}\n\nText: {chunk}\n\nAnswer only with a number."
            }]
        )
        
        try:
            score = float(response.content[0].text.strip())
        except ValueError:
            score = 0
        
        scored_chunks.append((chunk, score))
    
    # Sort by score descending
    scored_chunks.sort(key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in scored_chunks[:top_k]]

Results: The Impact of Optimization

After implementing summary indexing and re-ranking, here are the improvements over basic RAG:

MetricBasic RAGOptimizedImprovement
Avg Precision0.430.44+2.3%
Avg Recall0.660.69+4.5%
Avg F1 Score0.520.54+3.8%
Avg MRR0.740.87+17.6%
End-to-End Accuracy71%81%+14.1%
The most dramatic improvement came in MRR (+17.6%) and end-to-end accuracy (+14.1%), showing that better retrieval ordering directly impacts Claude's answer quality.

Production Considerations

  • Rate Limits: Full evaluations can hit API limits. Use Tier 2+ accounts for extensive testing.
  • Token Budget: Summary indexing and re-ranking consume additional tokens. Balance cost vs. quality.
  • Vector Database: Move from in-memory to hosted solutions (Pinecone, Weaviate) for production.
  • Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your content.

Key Takeaways

  • Evaluate retrieval and generation separately to identify where your RAG system needs improvement
  • MRR is your most actionable metric for retrieval optimization—small improvements here compound into large end-to-end gains
  • Summary indexing helps Claude understand document structure and relationships between sections
  • Re-ranking with Claude dramatically improves retrieval quality without changing your embedding pipeline
  • Start simple, measure everything, then optimize—basic RAG works, but systematic evaluation reveals where to invest your optimization efforts