GuideBeginner2026-05-06

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic pipelines, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to set up retrieval, measure performance with precision/recall/F1/MRR, and improve end-to-end accuracy from 71% to 81%.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your proprietary data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.

In this guide, we'll walk through building a complete RAG system using Claude, Voyage AI embeddings, and an in-memory vector store. We'll start with a basic pipeline, then show you how to measure performance systematically, and finally implement advanced techniques that boost end-to-end accuracy from 71% to 81%.

Why RAG Matters for Claude Users

Claude excels at general knowledge tasks, but it can't know your internal documentation, product manuals, or proprietary research. RAG bridges this gap by:

Grounding answers in your verified content
Reducing hallucinations by constraining Claude to retrieved context
Enabling domain-specific queries without fine-tuning
Keeping knowledge current by updating your document store

Setting Up Your RAG Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll need API keys from Anthropic and Voyage AI. Set them as environment variables:

import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"

Initialize a Vector Database

For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or MongoDB Atlas.

import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def add(self, vector: List[float], metadata: Dict):
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search(self, query_vector: List[float], k: int = 3) -> List[Dict]:
        similarities = [
            np.dot(query_vector, vec) / (np.linalg.norm(query_vector) * np.linalg.norm(vec))
            for vec in self.vectors
        ]
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.metadata[i] for i in top_indices]

Level 1: Basic RAG Pipeline

A basic RAG system (often called "Naive RAG") follows three steps:

Chunk documents by heading or logical sections
Embed each chunk using a high-quality embedding model
Retrieve relevant chunks via cosine similarity and feed them to Claude

Chunking Strategy

def chunk_by_headings(text: str) -> List[Dict]:
    """Split document by markdown headings, preserving context."""
    chunks = []
    current_heading = "Introduction"
    current_content = []
    
    for line in text.split("\n"):
        if line.startswith("##"):
            if current_content:
                chunks.append({
                    "heading": current_heading,
                    "content": "\n".join(current_content)
                })
            current_heading = line.strip("# ").strip()
            current_content = []
        else:
            current_content.append(line)
    
    if current_content:
        chunks.append({
            "heading": current_heading,
            "content": "\n".join(current_content)
        })
    return chunks

Embedding and Retrieval

import voyageai
vo = voyageai.Client()
def embed_chunks(chunks: List[Dict]) -> List[float]:
    texts = [chunk["content"] for chunk in chunks]
    embeddings = vo.embed(texts, model="voyage-2").embeddings
    return embeddings
def retrieve(query: str, db: InMemoryVectorDB, k: int = 3) -> List[Dict]:
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    return db.search(query_embedding, k=k)

Generating Answers with Claude

from anthropic import Anthropic
client = Anthropic()
def answer_with_claude(query: str, context_chunks: List[Dict]) -> str:
    context = "\n\n---\n\n".join([
        f"Source: {chunk['heading']}\n{chunk['content']}"
        for chunk in context_chunks
    ])
    
    prompt = f"""Answer the question based on the provided context. If the context doesn't contain enough information, say so.
Context:
{context}
Question: {query}
Answer:"""
    
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need systematic metrics that measure both retrieval quality and end-to-end performance.

Creating an Evaluation Dataset

Generate a synthetic dataset with 100+ samples, each containing:

A question
Ground-truth relevant chunks
A correct answer

{
  "question": "How do I set up rate limiting in Claude?",
  "relevant_chunks": ["rate_limiting.md", "api_basics.md"],
  "correct_answer": "Rate limiting is configured via the Anthropic console..."
}

Retrieval Metrics

#### Precision

What it measures: Of all chunks retrieved, how many were actually relevant?

def precision(retrieved: List[str], relevant: List[str]) -> float:
    retrieved_set = set(retrieved)
    relevant_set = set(relevant)
    if len(retrieved_set) == 0:
        return 0.0
    return len(retrieved_set & relevant_set) / len(retrieved_set)

Interpretation: High precision means your system isn't wasting Claude's context window on irrelevant information.

#### Recall

What it measures: Of all relevant chunks, how many did we retrieve?

def recall(retrieved: List[str], relevant: List[str]) -> float:
    retrieved_set = set(retrieved)
    relevant_set = set(relevant)
    if len(relevant_set) == 0:
        return 0.0
    return len(retrieved_set & relevant_set) / len(relevant_set)

Interpretation: High recall ensures Claude has all the information it needs to answer correctly.

#### F1 Score

The harmonic mean of precision and recall:

def f1_score(prec: float, rec: float) -> float:
    if prec + rec == 0:
        return 0.0
    return 2  (prec  rec) / (prec + rec)

#### Mean Reciprocal Rank (MRR)

Measures how early the first relevant chunk appears in your results:

def mrr(retrieved: List[str], relevant: List[str]) -> float:
    for i, chunk in enumerate(retrieved):
        if chunk in relevant:
            return 1.0 / (i + 1)
    return 0.0

Why MRR matters: If the first relevant chunk is at position 3, Claude has to wade through 2 irrelevant chunks first, increasing the chance of confusion.

End-to-End Accuracy

This measures whether Claude's final answer is correct, using a judge LLM or human evaluation:

def evaluate_answer(question: str, generated: str, correct: str) -> bool:
    prompt = f"""Does the following answer correctly address the question?
Question: {question}
Generated Answer: {generated}
Correct Answer: {correct}
Answer YES or NO:"""
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return "YES" in response.content[0].text

Level 2: Summary Indexing

Basic RAG retrieves individual chunks, but sometimes the answer requires synthesizing information across multiple sections. Summary indexing addresses this by creating higher-level summaries that capture cross-chunk context.

def create_summary_index(chunks: List[Dict], window_size: int = 3) -> List[Dict]:
    """Create sliding window summaries over consecutive chunks."""
    summaries = []
    for i in range(len(chunks) - window_size + 1):
        window = chunks[i:i+window_size]
        combined = "\n".join([c["content"] for c in window])
        
        # Use Claude to generate a concise summary
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize the following text in 2-3 sentences:\n\n{combined}"
            }]
        )
        
        summaries.append({
            "summary": response.content[0].text,
            "source_chunks": [c["heading"] for c in window],
            "original_content": combined
        })
    return summaries

Embed and index both the original chunks and the summaries. When retrieving, search across both indexes and merge results.

Level 3: Summary Indexing + Re-Ranking

Re-ranking adds a second stage to your retrieval pipeline. After the initial retrieval, Claude re-orders the chunks by relevance to the specific query.

def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3) -> List[Dict]:
    """Use Claude to re-rank retrieved chunks by relevance."""
    chunks_text = "\n\n".join([
        f"[{i+1}] {chunk['heading']}\n{chunk['content'][:500]}"
        for i, chunk in enumerate(candidates)
    ])
    
    prompt = f"""Given the question below, rank the following chunks by relevance (most relevant first).
Return only the numbers in order, comma-separated.
Question: {query}
Chunks:
{chunks_text}
Ranking:"""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the ranking
    import re
    indices = [int(x) - 1 for x in re.findall(r"\d+", response.content[0].text)]
    return [candidates[i] for i in indices[:top_k]]

Performance Gains

With these optimizations, here's what you can expect:

Metric	Basic RAG	Optimized RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The biggest jump comes from MRR (0.74 → 0.87), meaning Claude gets relevant information earlier in the context window, leading to better answers.

Production Considerations

Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access or run smaller eval sets.
Token Budget: Summary indexing and re-ranking add token costs. Benchmark to ensure ROI.
Vector Database: For production, use a hosted vector DB with built-in indexing and scaling.
Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your content.
Embedding Model: Voyage AI's voyage-2 works well, but test alternatives like text-embedding-3-small.

Key Takeaways

Start with basic RAG, then systematically measure retrieval quality using precision, recall, F1, and MRR before optimizing.
Summary indexing captures cross-chunk context, improving recall for questions that require synthesis.
Re-ranking with Claude dramatically improves MRR, ensuring the most relevant information appears first in the context window.
Evaluate retrieval and end-to-end performance separately to identify where your pipeline needs improvement.
Expect 10-15% accuracy gains from advanced techniques, but always benchmark against your specific use case and data.