Guide2026-04-22

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement chunking, embedding, retrieval, and evaluation using metrics like Precision, Recall, F1, and MRR. Advanced techniques like summary indexing and re-ranking boost end-to-end accuracy from 71% to 81%.

RAGClaudeEvaluationEmbeddingsVector Search

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.

In this guide, we'll walk through building a complete RAG system using Claude, Voyage AI embeddings, and a robust evaluation framework. You'll learn not just how to build it, but how to measure and improve it systematically.

What You'll Learn

How to set up a basic RAG pipeline with Claude and Voyage AI
How to build a proper evaluation suite with meaningful metrics
Advanced techniques: summary indexing and re-ranking with Claude
How to achieve measurable improvements in retrieval and end-to-end accuracy

Prerequisites

You'll need:

Python 3.8+
An Anthropic API key
A Voyage AI API key
Basic familiarity with Python and Jupyter notebooks

Install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG implementation. This is often called "Naive RAG" in the industry, and it follows three steps:

Chunk your documents by heading
Embed each chunk using Voyage AI
Retrieve the most relevant chunks using cosine similarity

Step 1: Initialize Your Vector Database

For this example, we'll use an in-memory vector store. In production, you'd likely use Pinecone, Weaviate, or another hosted solution.

import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[str], embeddings: List[List[float]]):
        self.documents.extend(documents)
        self.embeddings.extend(embeddings)
    
    def search(self, query_embedding: List[float], k: int = 3) -> List[Dict[str, Any]]:
        scores = [
            self._cosine_similarity(query_embedding, emb)
            for emb in self.embeddings
        ]
        top_indices = np.argsort(scores)[-k:][::-1]
        return [
            {"document": self.documents[i], "score": scores[i]}
            for i in top_indices
        ]
    
    def _cosine_similarity(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Step 2: Chunk and Embed Documents

Chunking by heading is a simple but effective strategy. Each chunk contains the content under a single subheading.

import voyageai
vo = voyageai.Client(api_key="your-voyage-api-key")
def chunk_by_heading(text: str) -> List[str]:
    """Split text by markdown headings."""
    chunks = []
    current_chunk = []
    for line in text.split('\n'):
        if line.startswith('##') or line.startswith('###'):
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
            current_chunk = [line]
        else:
            current_chunk.append(line)
    if current_chunk:
        chunks.append('\n'.join(current_chunk))
    return chunks
def embed_documents(chunks: List[str]) -> List[List[float]]:
    """Embed chunks using Voyage AI."""
    result = vo.embed(chunks, model="voyage-2")
    return result.embeddings

Step 3: Retrieve and Answer

from anthropic import Anthropic
client = Anthropic(api_key="your-anthropic-api-key")
def answer_with_rag(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str:
    # Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Retrieve relevant chunks
    results = vector_db.search(query_embedding, k=k)
    context = "\n\n".join([r["document"] for r in results])
    
    # Generate answer with Claude
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need quantitative metrics. Here's how to build a proper evaluation suite.

Create a Synthetic Evaluation Dataset

Generate 100+ question-answer pairs with known relevant chunks. This is your ground truth.

[
  {
    "question": "How do I set up rate limiting in Claude?",
    "relevant_chunks": ["chunk_42", "chunk_43"],
    "correct_answer": "To set up rate limiting..."
  },
  ...
]

Key Metrics

#### Retrieval Metrics

Precision: Of the chunks we retrieved, how many were relevant?

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

Recall: Of all relevant chunks, how many did we retrieve?

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

F1 Score: Harmonic mean of precision and recall. Mean Reciprocal Rank (MRR): How high did the first relevant result appear?

#### End-to-End Metric

Accuracy: Does Claude's final answer match the ground truth?

Implementing the Evaluation

def evaluate_retrieval(questions, ground_truth, vector_db, k=3):
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for q, gt in zip(questions, ground_truth):
        query_emb = vo.embed([q], model="voyage-2").embeddings[0]
        results = vector_db.search(query_emb, k=k)
        retrieved_ids = [r["id"] for r in results]
        
        tp = len(set(retrieved_ids) & set(gt["relevant_chunks"]))
        
        precision = tp / k
        recall = tp / len(gt["relevant_chunks"]) if gt["relevant_chunks"] else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant result
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in gt["relevant_chunks"]:
                mrr = 1.0 / rank
                break
        else:
            mrr = 0.0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic RAG misses context when a question spans multiple chunks. Summary indexing solves this by creating a summary for each chunk and indexing both.

def create_summary(chunk: str) -> str:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize this text in 2-3 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text
def embed_with_summary(chunks: List[str]) -> List[List[float]]:
    summaries = [create_summary(c) for c in chunks]
    combined = [f"{s}\n\n{c}" for s, c in zip(summaries, chunks)]
    return vo.embed(combined, model="voyage-2").embeddings

Level 3: Summary Indexing + Re-Ranking

Re-ranking with Claude dramatically improves MRR. After initial retrieval, Claude scores each chunk for relevance to the query.

def rerank_with_claude(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
    prompt = f"""
    Query: {query}
    
    For each chunk below, rate its relevance to the query on a scale of 1-5.
    Return only the scores as a comma-separated list.
    
    Chunks:
    {chr(10).join([f'{i+1}. {c[:200]}...' for i, c in enumerate(chunks)])}
    """
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    scores = [int(s.strip()) for s in response.content[0].text.split(",")]
    scored_chunks = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [c for c, s in scored_chunks[:top_k]]

Results: Measurable Improvements

After implementing summary indexing and re-ranking, here are the improvements over basic RAG:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The biggest win? MRR jumped from 0.74 to 0.87, meaning the first retrieved chunk is almost always relevant. This directly impacts end-to-end accuracy, which rose from 71% to 81%.

Production Considerations

Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ accounts or run partial evals.
Vector Database: In-memory works for prototyping. Use Pinecone, Weaviate, or pgvector for production.
Chunking Strategy: Experiment with overlap, semantic chunking, or recursive splitting.
Embedding Model: Voyage AI's voyage-2 is excellent, but test alternatives like text-embedding-3-small.

Key Takeaways

Evaluate retrieval and generation separately to pinpoint bottlenecks in your RAG pipeline
Summary indexing improves recall by enriching chunk embeddings with broader context
Re-ranking with Claude dramatically boosts MRR, ensuring the most relevant chunk appears first
Start simple, measure everything — basic RAG gives a strong baseline, and targeted improvements yield measurable gains
End-to-end accuracy improved by 10 percentage points (71% → 81%) through these techniques, proving the value of systematic optimization