BeClaude
GuideBeginner2026-05-06

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic pipelines, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to set up retrieval, measure performance with precision/recall/F1/MRR, and improve end-to-end accuracy from 71% to 81%.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your proprietary data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.

In this guide, we'll walk through building a complete RAG system using Claude, Voyage AI embeddings, and an in-memory vector store. We'll start with a basic pipeline, then show you how to measure performance systematically, and finally implement advanced techniques that boost end-to-end accuracy from 71% to 81%.

Why RAG Matters for Claude Users

Claude excels at general knowledge tasks, but it can't know your internal documentation, product manuals, or proprietary research. RAG bridges this gap by:

  • Grounding answers in your verified content
  • Reducing hallucinations by constraining Claude to retrieved context
  • Enabling domain-specific queries without fine-tuning
  • Keeping knowledge current by updating your document store

Setting Up Your RAG Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll need API keys from Anthropic and Voyage AI. Set them as environment variables:

import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"

Initialize a Vector Database

For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or MongoDB Atlas.

import numpy as np
from typing import List, Dict

class InMemoryVectorDB: def __init__(self): self.vectors = [] self.metadata = [] def add(self, vector: List[float], metadata: Dict): self.vectors.append(vector) self.metadata.append(metadata) def search(self, query_vector: List[float], k: int = 3) -> List[Dict]: similarities = [ np.dot(query_vector, vec) / (np.linalg.norm(query_vector) * np.linalg.norm(vec)) for vec in self.vectors ] top_indices = np.argsort(similarities)[-k:][::-1] return [self.metadata[i] for i in top_indices]

Level 1: Basic RAG Pipeline

A basic RAG system (often called "Naive RAG") follows three steps:

  • Chunk documents by heading or logical sections
  • Embed each chunk using a high-quality embedding model
  • Retrieve relevant chunks via cosine similarity and feed them to Claude

Chunking Strategy

def chunk_by_headings(text: str) -> List[Dict]:
    """Split document by markdown headings, preserving context."""
    chunks = []
    current_heading = "Introduction"
    current_content = []
    
    for line in text.split("\n"):
        if line.startswith("##"):
            if current_content:
                chunks.append({
                    "heading": current_heading,
                    "content": "\n".join(current_content)
                })
            current_heading = line.strip("# ").strip()
            current_content = []
        else:
            current_content.append(line)
    
    if current_content:
        chunks.append({
            "heading": current_heading,
            "content": "\n".join(current_content)
        })
    return chunks

Embedding and Retrieval

import voyageai

vo = voyageai.Client()

def embed_chunks(chunks: List[Dict]) -> List[float]: texts = [chunk["content"] for chunk in chunks] embeddings = vo.embed(texts, model="voyage-2").embeddings return embeddings

def retrieve(query: str, db: InMemoryVectorDB, k: int = 3) -> List[Dict]: query_embedding = vo.embed([query], model="voyage-2").embeddings[0] return db.search(query_embedding, k=k)

Generating Answers with Claude

from anthropic import Anthropic

client = Anthropic()

def answer_with_claude(query: str, context_chunks: List[Dict]) -> str: context = "\n\n---\n\n".join([ f"Source: {chunk['heading']}\n{chunk['content']}" for chunk in context_chunks ]) prompt = f"""Answer the question based on the provided context. If the context doesn't contain enough information, say so.

Context: {context}

Question: {query}

Answer:""" response = client.messages.create( model="claude-3-sonnet-20241022", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need systematic metrics that measure both retrieval quality and end-to-end performance.

Creating an Evaluation Dataset

Generate a synthetic dataset with 100+ samples, each containing:

  • A question
  • Ground-truth relevant chunks
  • A correct answer
{
  "question": "How do I set up rate limiting in Claude?",
  "relevant_chunks": ["rate_limiting.md", "api_basics.md"],
  "correct_answer": "Rate limiting is configured via the Anthropic console..."
}

Retrieval Metrics

#### Precision

What it measures: Of all chunks retrieved, how many were actually relevant?
def precision(retrieved: List[str], relevant: List[str]) -> float:
    retrieved_set = set(retrieved)
    relevant_set = set(relevant)
    if len(retrieved_set) == 0:
        return 0.0
    return len(retrieved_set & relevant_set) / len(retrieved_set)
Interpretation: High precision means your system isn't wasting Claude's context window on irrelevant information.

#### Recall

What it measures: Of all relevant chunks, how many did we retrieve?
def recall(retrieved: List[str], relevant: List[str]) -> float:
    retrieved_set = set(retrieved)
    relevant_set = set(relevant)
    if len(relevant_set) == 0:
        return 0.0
    return len(retrieved_set & relevant_set) / len(relevant_set)
Interpretation: High recall ensures Claude has all the information it needs to answer correctly.

#### F1 Score

The harmonic mean of precision and recall:

def f1_score(prec: float, rec: float) -> float:
    if prec + rec == 0:
        return 0.0
    return 2  (prec  rec) / (prec + rec)

#### Mean Reciprocal Rank (MRR)

Measures how early the first relevant chunk appears in your results:

def mrr(retrieved: List[str], relevant: List[str]) -> float:
    for i, chunk in enumerate(retrieved):
        if chunk in relevant:
            return 1.0 / (i + 1)
    return 0.0
Why MRR matters: If the first relevant chunk is at position 3, Claude has to wade through 2 irrelevant chunks first, increasing the chance of confusion.

End-to-End Accuracy

This measures whether Claude's final answer is correct, using a judge LLM or human evaluation:

def evaluate_answer(question: str, generated: str, correct: str) -> bool:
    prompt = f"""Does the following answer correctly address the question?

Question: {question} Generated Answer: {generated} Correct Answer: {correct}

Answer YES or NO:""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=10, messages=[{"role": "user", "content": prompt}] ) return "YES" in response.content[0].text

Level 2: Summary Indexing

Basic RAG retrieves individual chunks, but sometimes the answer requires synthesizing information across multiple sections. Summary indexing addresses this by creating higher-level summaries that capture cross-chunk context.

def create_summary_index(chunks: List[Dict], window_size: int = 3) -> List[Dict]:
    """Create sliding window summaries over consecutive chunks."""
    summaries = []
    for i in range(len(chunks) - window_size + 1):
        window = chunks[i:i+window_size]
        combined = "\n".join([c["content"] for c in window])
        
        # Use Claude to generate a concise summary
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize the following text in 2-3 sentences:\n\n{combined}"
            }]
        )
        
        summaries.append({
            "summary": response.content[0].text,
            "source_chunks": [c["heading"] for c in window],
            "original_content": combined
        })
    return summaries

Embed and index both the original chunks and the summaries. When retrieving, search across both indexes and merge results.

Level 3: Summary Indexing + Re-Ranking

Re-ranking adds a second stage to your retrieval pipeline. After the initial retrieval, Claude re-orders the chunks by relevance to the specific query.

def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3) -> List[Dict]:
    """Use Claude to re-rank retrieved chunks by relevance."""
    chunks_text = "\n\n".join([
        f"[{i+1}] {chunk['heading']}\n{chunk['content'][:500]}"
        for i, chunk in enumerate(candidates)
    ])
    
    prompt = f"""Given the question below, rank the following chunks by relevance (most relevant first).
Return only the numbers in order, comma-separated.

Question: {query}

Chunks: {chunks_text}

Ranking:""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) # Parse the ranking import re indices = [int(x) - 1 for x in re.findall(r"\d+", response.content[0].text)] return [candidates[i] for i in indices[:top_k]]

Performance Gains

With these optimizations, here's what you can expect:

MetricBasic RAGOptimized RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%
The biggest jump comes from MRR (0.74 → 0.87), meaning Claude gets relevant information earlier in the context window, leading to better answers.

Production Considerations

  • Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access or run smaller eval sets.
  • Token Budget: Summary indexing and re-ranking add token costs. Benchmark to ensure ROI.
  • Vector Database: For production, use a hosted vector DB with built-in indexing and scaling.
  • Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your content.
  • Embedding Model: Voyage AI's voyage-2 works well, but test alternatives like text-embedding-3-small.

Key Takeaways

  • Start with basic RAG, then systematically measure retrieval quality using precision, recall, F1, and MRR before optimizing.
  • Summary indexing captures cross-chunk context, improving recall for questions that require synthesis.
  • Re-ranking with Claude dramatically improves MRR, ensuring the most relevant information appears first in the context window.
  • Evaluate retrieval and end-to-end performance separately to identify where your pipeline needs improvement.
  • Expect 10-15% accuracy gains from advanced techniques, but always benchmark against your specific use case and data.