Guide2026-04-29

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics, achieving up to 81% accuracy.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for deploying Claude in enterprise contexts. While Claude excels at general knowledge tasks, it needs access to your specific business data—internal documentation, customer support articles, or proprietary research—to answer domain-specific questions accurately.

In this guide, you'll learn how to build, evaluate, and optimize a RAG system using Claude and Voyage AI embeddings. We'll start with a basic implementation and progressively add advanced techniques that measurably improve performance.

What You'll Build

By the end of this guide, you'll have:

A working RAG pipeline using Claude and an in-memory vector database
A robust evaluation suite that measures retrieval and end-to-end performance independently
Advanced techniques including summary indexing and re-ranking
Concrete metrics showing improvement from 71% to 81% end-to-end accuracy

Prerequisites

Before diving in, make sure you have:

API keys from Anthropic and Voyage AI
Python 3.8+ installed
Basic familiarity with Python and the Claude API

Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Next, initialize your API clients:

import anthropic
import voyageai
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG implementation. This "naive" approach has three steps:

Chunk documents by heading (each subheading becomes a chunk)
Embed each chunk using Voyage AI
Retrieve relevant chunks using cosine similarity

Creating a Vector Database

For this example, we'll use an in-memory vector database. In production, consider hosted solutions like Pinecone or Weaviate.

import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def add_document(self, embedding: List[float], metadata: Dict):
        self.vectors.append(embedding)
        self.metadata.append(metadata)
    
    def search(self, query_embedding: List[float], top_k: int = 3) -> List[Dict]:
        similarities = [
            np.dot(query_embedding, vec) / (np.linalg.norm(query_embedding) * np.linalg.norm(vec))
            for vec in self.vectors
        ]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.metadata[i] for i in top_indices]

Chunking and Embedding

def chunk_document(text: str) -> List[str]:
    """Split document by headings (## or ###)"""
    import re
    chunks = re.split(r'(?=^##\s)', text, flags=re.MULTILINE)
    return [chunk.strip() for chunk in chunks if chunk.strip()]
def embed_chunks(chunks: List[str]) -> List[List[float]]:
    """Embed chunks using Voyage AI"""
    response = vo.embed(chunks, model="voyage-2")
    return response.embeddings

Building the RAG Pipeline

def rag_query(query: str, vector_db: InMemoryVectorDB, top_k: int = 3) -> str:
    # Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Retrieve relevant chunks
    retrieved_chunks = vector_db.search(query_embedding, top_k=top_k)
    context = "\n\n".join([chunk["text"] for chunk in retrieved_chunks])
    
    # Generate answer with Claude
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Based on the following context, answer the question:\n\nContext: {context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production systems. You need objective metrics. We'll evaluate two independent components:

Retrieval performance – How well does the system find relevant chunks?
End-to-end performance – How accurate are the final answers?

Creating an Evaluation Dataset

Generate a synthetic dataset with 100 samples, each containing:

A question
Relevant chunks (ground truth)
A correct answer

import json
Load pre-generated evaluation dataset
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
    eval_data = json.load(f)
Preview the first sample
print(json.dumps(eval_data[0], indent=2))

Retrieval Metrics

#### Precision

Precision measures how many retrieved chunks are actually relevant:

Precision = True Positives / Total Retrieved

High precision means fewer irrelevant chunks. Since we retrieve a minimum of 3 chunks, precision can be affected by the number of truly relevant chunks available.

#### Recall

Recall measures how many relevant chunks we successfully retrieved:

Recall = True Positives / Total Relevant

High recall ensures Claude has all the information it needs.

#### F1 Score

The harmonic mean of precision and recall:

F1 = 2  (Precision  Recall) / (Precision + Recall)

#### Mean Reciprocal Rank (MRR)

MRR measures how early the first relevant chunk appears in the results:

def calculate_mrr(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

End-to-End Accuracy

This measures whether Claude's final answer is correct. Use LLM-as-judge or human evaluation:

def evaluate_answer(question: str, answer: str, correct_answer: str) -> bool:
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1,
        messages=[{
            "role": "user",
            "content": f"Question: {question}\nCorrect Answer: {correct_answer}\nModel Answer: {answer}\n\nIs the model answer correct? Answer only 'yes' or 'no'."
        }]
    )
    return response.content[0].text.lower() == "yes"

Level 2: Summary Indexing

Basic RAG often misses context that spans multiple chunks. Summary indexing solves this by creating condensed representations of document sections.

How It Works

For each document section, generate a summary using Claude
Index both the original chunk and its summary
Retrieve using the summary for better semantic matching

def generate_summary(chunk: str) -> str:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize the following text in 2-3 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text
def index_with_summaries(chunks: List[str], vector_db: InMemoryVectorDB):
    for chunk in chunks:
        summary = generate_summary(chunk)
        combined_text = f"{summary}\n\n{chunk}"
        embedding = vo.embed([combined_text], model="voyage-2").embeddings[0]
        vector_db.add_document(embedding, {"text": chunk, "summary": summary})

Level 3: Summary Indexing + Re-Ranking

Re-ranking adds a second retrieval step that dramatically improves MRR. After initial retrieval, use Claude to score and reorder results.

Implementing Re-Ranking

def rerank_with_claude(query: str, candidates: List[Dict]) -> List[Dict]:
    # Prepare candidates for Claude
    candidate_text = "\n".join([
        f"[{i}] {c['text'][:200]}" for i, c in enumerate(candidates)
    ])
    
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Query: {query}\n\nCandidates:\n{candidate_text}\n\nRank these candidates by relevance. Return the indices in order of relevance, most relevant first."
        }]
    )
    
    # Parse the ranked indices
    ranked_indices = [int(x) for x in response.content[0].text.split() if x.isdigit()]
    return [candidates[i] for i in ranked_indices]
def advanced_rag_query(query: str, vector_db: InMemoryVectorDB) -> str:
    # Initial retrieval
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    initial_results = vector_db.search(query_embedding, top_k=10)
    
    # Re-rank
    reranked_results = rerank_with_claude(query, initial_results)
    top_results = reranked_results[:3]
    
    # Generate answer
    context = "\n\n".join([chunk["text"] for chunk in top_results])
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context: {context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Results: Measurable Improvements

After implementing these techniques, here are the performance gains:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement is in MRR (Mean Reciprocal Rank), jumping from 0.74 to 0.87—a direct result of re-ranking. End-to-end accuracy improved by 10 percentage points.

Production Considerations

Rate limits: Full evaluations may hit rate limits unless you're at Tier 2 or above. Consider sampling your evaluation dataset.
Vector database: For production, use a hosted solution like Pinecone, Weaviate, or Chroma.
Chunking strategy: Experiment with different chunk sizes and overlap strategies.
Embedding model: Voyage AI's voyage-2 works well, but test with your specific domain.

Key Takeaways

Evaluate retrieval and generation separately – This lets you pinpoint where improvements are needed, whether in finding relevant chunks or in answer quality.
Summary indexing improves semantic matching – By indexing both summaries and original chunks, you capture context that naive chunking misses.
Re-ranking dramatically improves MRR – Adding a Claude-powered re-ranking step after initial retrieval ensures the most relevant chunks appear first, boosting MRR from 0.74 to 0.87.
End-to-end accuracy gains are real – Advanced RAG techniques improved accuracy from 71% to 81% in our tests, a meaningful improvement for production systems.
Start simple, then iterate – Begin with basic RAG, establish your evaluation metrics, then layer on advanced techniques. Measure each change to confirm improvement.