Guide2026-05-06

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, and see how targeted improvements boosted accuracy from 71% to 81%.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Claude excels at general-purpose language tasks, but when you need answers grounded in your proprietary knowledge base—internal documentation, customer support articles, or financial reports—you need Retrieval Augmented Generation (RAG). RAG bridges the gap between Claude's broad capabilities and your specific domain context.

In this guide, we'll build a RAG system using Claude and Voyage AI embeddings, using the Claude Documentation as our knowledge base. We'll start with a basic "naive" pipeline, then layer in advanced techniques like summary indexing and re-ranking. Along the way, we'll build a proper evaluation suite to measure what matters.

By the end, you'll understand how to achieve significant performance gains: our final system improved end-to-end accuracy from 71% to 81%, with Mean Reciprocal Rank jumping from 0.74 to 0.87.

What You'll Need

Before we start, gather your tools:

Anthropic API key – for accessing Claude
Voyage AI API key – for generating high-quality embeddings
Python environment with these libraries:

- anthropic - voyageai - pandas, numpy, matplotlib, scikit-learn

import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG pipeline. This is often called "Naive RAG" in the industry. It follows three steps:

Chunk documents – Split your knowledge base into manageable pieces. Here, we chunk by heading, keeping content under each subheading together.
Embed each chunk – Use Voyage AI to convert text chunks into vector embeddings.
Retrieve by cosine similarity – When a query comes in, embed it, find the most similar chunks, and feed them to Claude as context.

Initialize an In-Memory Vector Database

For this example, we'll use a simple in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

class SimpleVectorDB:
    def __init__(self):
        self.chunks = []
        self.embeddings = []
    
    def add_chunk(self, text, embedding):
        self.chunks.append(text)
        self.embeddings.append(embedding)
    
    def search(self, query_embedding, top_k=3):
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [(self.chunks[i], similarities[i]) for i in top_indices]

The Basic RAG Loop

def basic_rag(query, vector_db, claude_client):
    # Step 1: Embed the query
    query_embedding = voyage_client.embed(query)
    
    # Step 2: Retrieve relevant chunks
    retrieved = vector_db.search(query_embedding, top_k=3)
    context = "\n\n".join([chunk for chunk, _ in retrieved])
    
    # Step 3: Generate answer with Claude
    response = claude_client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

This works, but how well? To answer that, we need an evaluation system.

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need quantitative metrics that measure both retrieval quality and end-to-end answer correctness.

The Evaluation Dataset

We synthetically generated 100 test samples, each containing:

A question
The correct chunks (ground truth) that should be retrieved
A correct answer

Some questions require synthesizing information from multiple chunks, making this a challenging dataset.

Retrieval Metrics

#### Precision

What it measures: Of all chunks retrieved, how many were actually relevant?

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

High precision means you're not wasting Claude's context window on irrelevant information.

#### Recall

What it measures: Of all relevant chunks in the database, how many did we retrieve?

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

High recall ensures Claude has all the information it needs.

#### F1 Score

The harmonic mean of precision and recall. Balances both concerns.

#### Mean Reciprocal Rank (MRR)

What it measures: How early in the retrieval results does the first relevant chunk appear?

$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$

MRR is critical for RAG because Claude's context window is limited—you want the most relevant information first.

End-to-End Accuracy

This measures whether Claude's final answer is correct given the retrieved context. It's the ultimate test: does the system actually help users?

Level 2: Summary Indexing

Basic RAG has a problem: a single chunk might not contain enough context. For example, a chunk about "rate limits" might not mention that it's part of a larger section on "API best practices."

Summary indexing solves this by creating a secondary index of chunk summaries. When a query comes in, you first search the summary index to find the right neighborhood, then retrieve the full chunks.

def build_summary_index(chunks, claude_client):
    summaries = []
    for chunk in chunks:
        summary = claude_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": f"Summarize this in 1-2 sentences: {chunk}"
            }]
        )
        summaries.append(summary.content[0].text)
    return summaries

Then, during retrieval:

Embed the query and search the summary index.
Retrieve the full chunks corresponding to the top summary matches.

This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.

Level 3: Summary Indexing + Re-Ranking

Even with summary indexing, the top-3 retrieved chunks might not be in the optimal order. Re-ranking uses Claude itself to reorder the retrieved chunks by relevance to the query.

def rerank_chunks(query, chunks, claude_client):
    prompt = f"""Given the query: "{query}"
    Rank the following chunks by relevance (most relevant first).
    Return only the chunk numbers in order, separated by commas.
    
    Chunks:
    {chr(10).join([f'{i}: {chunk[:200]}...' for i, chunk in enumerate(chunks)])}
    """
    
    response = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the ordered indices
    ordered_indices = [int(x.strip()) for x in response.content[0].text.split(",")]
    return [chunks[i] for i in ordered_indices]

Re-ranking dramatically improved our MRR from 0.74 to 0.87, meaning the most relevant chunk almost always appears first.

Results at a Glance

Metric	Basic RAG	Summary Indexing	+ Re-Ranking
Avg Precision	0.43	0.44	0.44
Avg Recall	0.66	0.69	0.69
Avg F1	0.52	0.54	0.54
Avg MRR	0.74	0.74	0.87
End-to-End Accuracy	71%	75%	81%

The biggest wins came from:

Summary indexing improving recall (finding more relevant chunks)
Re-ranking improving MRR (putting the best chunk first), which directly boosted end-to-end accuracy

Production Considerations

Rate limits – Full evaluations can hit rate limits unless you're on Tier 2 or above. Consider running smaller eval sets during development.
Vector database – Our in-memory DB is fine for prototyping. For production, use a scalable solution.
Chunking strategy – Experiment with different chunk sizes and overlap. We found heading-based chunking worked well for documentation.
Embedding model – Voyage AI provides domain-specific embeddings. Test different models for your use case.

Key Takeaways

Evaluate retrieval and generation separately – Use precision, recall, F1, and MRR for retrieval; end-to-end accuracy for the full system.
Summary indexing improves recall – By searching summaries first, you find relevant chunks that might be missed by embedding similarity alone.
Re-ranking with Claude boosts MRR significantly – Putting the most relevant chunk first improves Claude's answers because it sees the best context immediately.
Small improvements compound – A 0.13 increase in MRR translated to a 10% absolute gain in end-to-end accuracy.
Build your evaluation dataset early – Synthetic data generation works well for initial development. Iterate on real user queries as you mature.