Guide2026-05-04

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers naive RAG, summary indexing, re-ranking, and production evaluation metrics.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, achieving up to 81% accuracy.

RAGClaudeEvaluationVector SearchVoyage AI

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Claude excels at general knowledge tasks, but when you need answers rooted in your internal documentation, customer support articles, or proprietary research, a standard LLM prompt often falls short. This is where Retrieval Augmented Generation (RAG) becomes your most powerful tool.

RAG enables Claude to search your knowledge base, retrieve the most relevant chunks, and generate answers grounded in those retrieved documents. In this guide, we'll build a RAG system from scratch using Claude, Voyage AI embeddings, and an in-memory vector store. We'll then go beyond "vibes-based" evaluation and show you how to measure and improve your pipeline with concrete metrics.

By the end, you'll understand how to move from a basic "naive RAG" setup to an advanced system using summary indexing and re-ranking — boosting end-to-end accuracy from 71% to 81%.

What You'll Need

Before we begin, set up your environment with these libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll also need API keys from Anthropic and Voyage AI. Store them as environment variables:

import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG pipeline. This is often called "naive RAG" — it works, but it has clear limitations.

Step 1: Chunk Your Documents

We'll split documents by headings. Each chunk contains the content under a single subheading:

def chunk_by_headings(text):
    chunks = []
    current_heading = None
    current_content = []
    
    for line in text.split('\n'):
        if line.startswith('##') or line.startswith('###'):
            if current_heading:
                chunks.append({
                    'heading': current_heading,
                    'content': '\n'.join(current_content)
                })
            current_heading = line
            current_content = []
        else:
            current_content.append(line)
    
    if current_heading:
        chunks.append({
            'heading': current_heading,
            'content': '\n'.join(current_content)
        })
    
    return chunks

Step 2: Embed and Store

Use Voyage AI to generate embeddings for each chunk, then store them in an in-memory vector database:

import voyageai
vo = voyageai.Client()
Generate embeddings for all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple dict (in production, use Pinecone, Weaviate, etc.)
vector_db = {}
for i, chunk in enumerate(chunks):
    vector_db[i] = {
        'text': chunk['content'],
        'embedding': embeddings[i]
    }

Step 3: Retrieve and Generate

When a user asks a question, embed the query, find the most similar chunks using cosine similarity, and pass them to Claude:

import numpy as np
from anthropic import Anthropic
client = Anthropic()
def retrieve(query, top_k=3):
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    scores = []
    for idx, doc in vector_db.items():
        similarity = np.dot(query_embedding, doc['embedding'])
        scores.append((idx, similarity))
    
    scores.sort(key=lambda x: x[1], reverse=True)
    top_indices = [idx for idx, _ in scores[:top_k]]
    
    return [vector_db[idx]['text'] for idx in top_indices]
def answer_question(query):
    chunks = retrieve(query)
    context = "\n\n---\n\n".join(chunks)
    
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        system="You are a helpful assistant. Answer the question based only on the provided context.",
        messages=[
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )
    
    return response.content[0].text

This basic pipeline works, but it has a critical flaw: it only retrieves chunks that contain the exact query terms. If a relevant chunk uses different terminology, it will be missed.

Building a Robust Evaluation System

To improve your RAG system, you need to measure it. We'll evaluate two things separately:

Retrieval quality — How well does the system find relevant chunks?
End-to-end accuracy — Does Claude produce the correct final answer?

Creating an Evaluation Dataset

We synthetically generated 100 test samples. Each sample contains:

A question
A list of "golden" chunk IDs that contain the answer
A correct answer string

import json
with open("evaluation/docs_evaluation_dataset.json") as f:
    eval_data = json.load(f)
Preview
print(eval_data[0])
{
  "question": "What is the max token limit for Claude 3 Opus?",
  "relevant_chunks": [12, 45],
  "correct_answer": "200,000 tokens"
}

Key Retrieval Metrics

We'll track four retrieval metrics:

Metric	What It Measures	Formula
Precision	Of the chunks we retrieved, how many were relevant?	TP / (TP + FP)
Recall	Of all relevant chunks, how many did we retrieve?	TP / (TP + FN)
F1 Score	Harmonic mean of precision and recall	2 (P R) / (P + R)
MRR	How high did the first relevant chunk rank?	1 / rank_first_relevant

def calculate_metrics(retrieved_chunks, relevant_chunks):
    retrieved_set = set(retrieved_chunks)
    relevant_set = set(relevant_chunks)
    
    true_positives = len(retrieved_set & relevant_set)
    
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(relevant_set) if relevant_set else 0
    f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    # MRR: reciprocal rank of first relevant chunk
    for rank, chunk in enumerate(retrieved_chunks, 1):
        if chunk in relevant_set:
            mrr = 1 / rank
            break
    else:
        mrr = 0
    
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "mrr": mrr
    }

End-to-End Accuracy

For the final answer, we use Claude itself to judge correctness:

def evaluate_answer(question, generated_answer, correct_answer):
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        system="You are an evaluator. Respond with exactly 'CORRECT' or 'INCORRECT'.",
        messages=[{
            "role": "user",
            "content": f"Question: {question}\nCorrect answer: {correct_answer}\nGenerated answer: {generated_answer}\n\nIs the generated answer correct?"
        }]
    )
    return response.content[0].text.strip() == "CORRECT"

Level 2: Summary Indexing

The first improvement is summary indexing. Instead of only storing raw chunks, we also generate and store a one-sentence summary of each chunk. During retrieval, we compare the query against the summaries first, then fetch the full chunks.

def generate_summary(chunk_text):
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system="Summarize the following text in one sentence.",
        messages=[{"role": "user", "content": chunk_text}]
    )
    return response.content[0].text
During indexing
for chunk in chunks:
    chunk['summary'] = generate_summary(chunk['content'])
    
During retrieval, embed the summary instead of the full text
summary_embeddings = vo.embed([c['summary'] for c in chunks], model="voyage-2").embeddings

This simple change improved our recall from 0.66 to 0.69 — we were now finding relevant chunks even when the query used different wording.

Level 3: Summary Indexing + Re-Ranking

The final optimization is re-ranking. After retrieving the top 10 chunks by summary similarity, we use Claude to score each chunk's relevance to the query, then keep only the top 3:

def rerank(query, chunks, top_k=3):
    scored_chunks = []
    
    for chunk in chunks:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            system="Rate relevance from 0 to 10. Only output the number.",
            messages=[{
                "role": "user",
                "content": f"Query: {query}\nChunk: {chunk[:500]}\n\nRelevance score:"
            }]
        )
        score = float(response.content[0].text.strip())
        scored_chunks.append((chunk, score))
    
    scored_chunks.sort(key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in scored_chunks[:top_k]]

Re-ranking dramatically improved our Mean Reciprocal Rank (MRR) from 0.74 to 0.87 — the first relevant chunk was now almost always at position 1.

Results Summary

Here's how the metrics improved across our three levels:

Metric	Basic RAG	+ Summary Indexing	+ Re-Ranking
Precision	0.43	0.44	0.44
Recall	0.66	0.69	0.69
F1 Score	0.52	0.54	0.54
MRR	0.74	0.78	0.87
End-to-End Accuracy	71%	76%	81%

Key Takeaways

Evaluate retrieval and generation separately. A perfect retrieval system is useless if Claude can't synthesize the answer, and a perfect generator is useless if it never sees the right context. Measure both.
Summary indexing boosts recall. By matching queries against summaries rather than raw text, you capture semantically related chunks that naive keyword search would miss.
Re-ranking dramatically improves MRR. Using Claude to score relevance after initial retrieval ensures the most useful chunks appear first, which improves final answer quality.
Start simple, then optimize. Begin with basic RAG, establish your baseline metrics, then add complexity only where you see clear gaps.
Use synthetic evaluation datasets. Generate 50-200 question-answer pairs from your own documents. This gives you a reliable benchmark without manual labeling.

Building a production RAG system is an iterative process. Start with the basics, measure everything, and apply targeted improvements. Your users — and Claude — will thank you.