Guide2026-05-02

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Learn how to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.

In this guide, we'll walk through building a RAG system using the Claude documentation as our knowledge base. We'll start with a basic implementation, then show you how to evaluate it properly, and finally apply advanced techniques that boost end-to-end accuracy from 71% to 81%.

What You'll Learn

How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
How to build a robust evaluation suite with 5 key metrics
How to implement summary indexing for better retrieval
How to use Claude as a re-ranker to improve result quality

Prerequisites

You'll need:

An Anthropic API key
A Voyage AI API key
Python 3.8+ with anthropic, voyageai, pandas, numpy, scikit-learn installed

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG." This is the simplest approach, but it's a solid foundation.

Step 1: Chunk Your Documents

We split documents by headings, keeping content from each subheading together. This creates natural, semantically coherent chunks.

import re
def chunk_by_headings(text):
    """Split document text by markdown headings."""
    chunks = []
    current_heading = "Introduction"
    current_content = []
    
    for line in text.split('\n'):
        if line.startswith('##') or line.startswith('###'):
            if current_content:
                chunks.append({
                    'heading': current_heading,
                    'content': '\n'.join(current_content).strip()
                })
            current_heading = line.strip('# ')
            current_content = []
        else:
            current_content.append(line)
    
    # Don't forget the last chunk
    if current_content:
        chunks.append({
            'heading': current_heading,
            'content': '\n'.join(current_content).strip()
        })
    
    return chunks

Step 2: Embed Each Chunk

We use Voyage AI's embedding model to convert each chunk into a vector representation.

import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
def embed_chunks(chunks):
    """Generate embeddings for all chunks."""
    texts = [chunk['content'] for chunk in chunks]
    embeddings = vo.embed(texts, model="voyage-2").embeddings
    
    for i, chunk in enumerate(chunks):
        chunk['embedding'] = embeddings[i]
    
    return chunks

Step 3: Retrieve and Answer

When a user asks a question, we embed their query, find the most similar chunks using cosine similarity, and pass them to Claude.

import numpy as np
from typing import List, Dict
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query: str, chunks: List[Dict], top_k: int = 3):
    """Retrieve top-k most relevant chunks."""
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    scores = []
    for chunk in chunks:
        score = cosine_similarity(query_embedding, chunk['embedding'])
        scores.append(score)
    
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [chunks[i] for i in top_indices]
def answer_with_claude(query: str, context_chunks: List[Dict]):
    """Generate answer using Claude with retrieved context."""
    context = "\n\n".join([c['content'] for c in context_chunks])
    
    prompt = f"""Answer the question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
    
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Building a Robust Evaluation System

Most RAG systems fail because they're evaluated on "vibes" rather than metrics. Let's fix that.

The Evaluation Dataset

We synthetically generated 100 test samples. Each sample contains:

A question
Relevant chunks (the ground truth for retrieval)
A correct answer (the ground truth for end-to-end)

Some questions require synthesizing information from multiple chunks, making this a challenging dataset.

The Five Key Metrics

#### 1. Precision

Precision answers: "Of the chunks we retrieved, how many were actually relevant?"

Precision = True Positives / Total Retrieved

High precision means you're not wasting Claude's context window with irrelevant information.

#### 2. Recall

Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"

Recall = True Positives / Total Relevant

High recall ensures Claude has all the information it needs.

#### 3. F1 Score

The harmonic mean of precision and recall. A balanced measure of retrieval quality.

F1 = 2  (Precision  Recall) / (Precision + Recall)

#### 4. Mean Reciprocal Rank (MRR)

MRR measures how early the first relevant result appears. If the first relevant chunk is at position 1, the reciprocal rank is 1.0. At position 3, it's 0.33.

def reciprocal_rank(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk['heading'] in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

#### 5. End-to-End Accuracy

This measures whether Claude's final answer is correct. You can use LLM-as-judge or manual evaluation.

def evaluate_answer(question, generated_answer, correct_answer):
    prompt = f"""Determine if the generated answer correctly answers the question.
Question: {question}
Correct Answer: {correct_answer}
Generated Answer: {generated_answer}
Is the generated answer correct? Answer only 'yes' or 'no'."""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip().lower() == 'yes'

Level 2: Summary Indexing

Basic RAG retrieves raw chunks, but sometimes the most relevant chunk doesn't contain the exact keywords. Summary indexing solves this by creating a condensed version of each chunk and using that for retrieval.

def create_summary(chunk_content):
    """Use Claude to summarize a chunk."""
    prompt = f"""Summarize the following text in 2-3 sentences, capturing the key information:
{chunk_content}
Summary:"""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Then, instead of embedding the raw chunk content, you embed the summary. This often improves recall because summaries are more semantically aligned with questions.

Level 3: Summary Indexing + Re-Ranking

This is where things get interesting. We combine summary indexing with a re-ranking step using Claude.

How Re-Ranking Works

Retrieve a larger set of candidates (e.g., top 10 chunks)
Use Claude to score each chunk's relevance to the query
Keep only the top 3-5 most relevant chunks

def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3):
    """Use Claude to re-rank retrieved chunks."""
    scored_chunks = []
    
    for chunk in candidates:
        prompt = f"""On a scale of 1-10, how relevant is the following text to answering the question?
Question: {query}
Text: {chunk['content']}
Relevance score (just the number):"""
        
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{"role": "user", "content": prompt}]
        )
        
        try:
            score = int(response.content[0].text.strip())
        except ValueError:
            score = 5  # Default if parsing fails
        
        scored_chunks.append((score, chunk))
    
    # Sort by score descending and return top_k
    scored_chunks.sort(key=lambda x: x[0], reverse=True)
    return [chunk for _, chunk in scored_chunks[:top_k]]

The Results

Here's what we achieved by combining summary indexing and re-ranking:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The biggest win is in MRR (0.74 → 0.87), meaning the first relevant result appears much earlier. End-to-end accuracy jumped 10 percentage points.

Production Considerations

Vector Database

For production, replace the in-memory store with a hosted vector database like Pinecone, Weaviate, or pgvector.

Rate Limits

The full evaluation suite can hit rate limits. If you're not on Tier 2+, consider running smaller subsets.

Cost Optimization

Use Claude Haiku for re-ranking (cheaper than Sonnet)
Cache embeddings for static documents
Batch API calls where possible

Key Takeaways

Start simple, then optimize: A basic RAG pipeline works surprisingly well. Add complexity only when you have metrics to justify it.
Evaluate retrieval and generation separately: Your RAG system is only as good as its weakest link. Measure both components independently.
Summary indexing improves recall: By embedding summaries instead of raw chunks, you capture semantic relevance that keyword matching misses.
Re-ranking with Claude boosts MRR significantly: Using Claude to score relevance after initial retrieval ensures the most relevant chunks appear first.
End-to-end accuracy is the ultimate metric: All retrieval metrics are proxies. Always measure whether your system actually answers questions correctly.