BeClaude
Guide2026-04-25

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Learn how to build and optimize a Retrieval Augmented Generation (RAG) system with Claude, including evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG enables Claude to answer domain-specific questions with accuracy and context.

In this guide, we'll walk through building and optimizing a RAG system using Claude and Voyage AI embeddings. We'll start with a basic pipeline, then layer in advanced techniques that measurably improve performance.

What You'll Learn

  • How to set up a basic RAG pipeline with Claude
  • How to build a robust evaluation suite (beyond "vibes")
  • How to implement summary indexing for better retrieval
  • How to use Claude as a re-ranker for improved precision
  • How to measure and optimize key metrics: Precision, Recall, F1, MRR, and End-to-End Accuracy

Prerequisites

To follow along, you'll need:

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG implementation. This is often called "Naive RAG" in the industry, and it consists of three steps:

  • Chunk documents by heading (each chunk contains content from one subheading)
  • Embed each chunk using Voyage AI
  • Retrieve relevant chunks using cosine similarity

Setting Up the Vector Database

We'll use an in-memory vector database for simplicity. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict, Any

class InMemoryVectorDB: def __init__(self, api_key: str): self.client = voyageai.Client(api_key=api_key) self.documents = [] self.embeddings = [] def add_documents(self, documents: List[Dict[str, str]]): texts = [doc["text"] for doc in documents] embeddings = self.client.embed(texts, model="voyage-2").embeddings self.documents.extend(documents) self.embeddings.extend(embeddings) def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]: query_embedding = self.client.embed([query], model="voyage-2").embeddings[0] similarities = [ np.dot(query_embedding, doc_emb) for doc_emb in self.embeddings ] top_indices = np.argsort(similarities)[-k:][::-1] return [ {**self.documents[i], "score": similarities[i]} for i in top_indices ]

The Basic RAG Query Function

Once your vector DB is populated, querying Claude with retrieved context is straightforward:

from anthropic import Anthropic

anthropic = Anthropic(api_key="your-anthropic-key")

def query_with_rag(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str: # Step 1: Retrieve relevant chunks results = vector_db.search(query, k=k) context = "\n\n".join([r["text"] for r in results]) # Step 2: Build prompt with context prompt = f"""Answer the question based on the following context.

Context: {context}

Question: {query}

Answer:""" # Step 3: Query Claude response = anthropic.messages.create( model="claude-3-sonnet-20240229", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

This works, but how well does it actually perform? Let's find out.

Building an Evaluation System

To improve your RAG system, you need to measure it. We'll build an evaluation suite that separates retrieval performance from end-to-end answer quality.

Creating a Synthetic Evaluation Dataset

We'll generate 100 evaluation samples, each containing:

  • A question
  • Relevant chunks (ground truth for retrieval)
  • A correct answer (ground truth for end-to-end)
import json

Load the evaluation dataset (pre-generated)

with open("evaluation/docs_evaluation_dataset.json", "r") as f: eval_data = json.load(f)

Preview the first sample

print(json.dumps(eval_data[0], indent=2))

Defining Key Metrics

We'll track five metrics:

#### Precision Precision answers: "Of the chunks we retrieved, how many were actually relevant?"

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

High precision means fewer false positives (irrelevant chunks).

#### Recall Recall answers: "Of all the correct chunks that exist, how many did we retrieve?"

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

High recall means we're not missing important information.

#### F1 Score The harmonic mean of precision and recall:

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

#### Mean Reciprocal Rank (MRR) MRR measures how early the first relevant chunk appears in your results. If the first relevant chunk is at position 1, the reciprocal rank is 1. If it's at position 3, it's 1/3.

$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$

#### End-to-End Accuracy This measures whether Claude's final answer is correct, as judged by a human or an LLM evaluator.

Implementing the Evaluation

def evaluate_retrieval(vector_db, eval_data, k=3):
    precisions = []
    recalls = []
    f1_scores = []
    reciprocal_ranks = []
    
    for sample in eval_data:
        query = sample["question"]
        correct_chunks = set(sample["relevant_chunks"])
        
        # Retrieve
        results = vector_db.search(query, k=k)
        retrieved_chunks = set([r["id"] for r in results])
        
        # Calculate metrics
        true_positives = len(retrieved_chunks & correct_chunks)
        precision = true_positives / len(retrieved_chunks) if retrieved_chunks else 0
        recall = true_positives / len(correct_chunks) if correct_chunks else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: find first relevant chunk
        for i, r in enumerate(results):
            if r["id"] in correct_chunks:
                reciprocal_ranks.append(1 / (i + 1))
                break
        else:
            reciprocal_ranks.append(0)
        
        precisions.append(precision)
        recalls.append(recall)
        f1_scores.append(f1)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1_scores),
        "avg_mrr": np.mean(reciprocal_ranks)
    }

Level 2: Summary Indexing

Basic RAG struggles when a single chunk doesn't contain enough context. Summary indexing solves this by creating a "summary chunk" for each section that includes both the heading and a condensed version of the content.

How It Works

Instead of chunking by subheading only, we:

  • Create a summary of each major section using Claude
  • Store both the summary and the original chunks
  • Retrieve against summaries first, then use the full chunks as context
def create_summary_index(documents, anthropic_client):
    summary_index = []
    for doc in documents:
        prompt = f"""Summarize the following document section in 2-3 sentences:

{doc['text']}

Summary:""" response = anthropic_client.messages.create( model="claude-3-haiku-20240307", max_tokens=200, messages=[{"role": "user", "content": prompt}] ) summary = response.content[0].text summary_index.append({ "id": doc["id"], "summary": summary, "full_text": doc["text"] }) return summary_index

This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.

Level 3: Summary Indexing + Re-Ranking

Re-ranking takes the initial retrieval results and uses Claude to reorder them by relevance. This dramatically improves MRR.

Implementing a Re-Ranker

def rerank_with_claude(query, candidates, anthropic_client, top_k=3):
    # Build a prompt asking Claude to rank chunks by relevance
    chunks_text = "\n\n---\n\n".join([
        f"Chunk {i+1}: {c['text']}" 
        for i, c in enumerate(candidates)
    ])
    
    prompt = f"""Given the question below, rank the following chunks by relevance.
Return the chunk numbers in order of relevance (most relevant first).

Question: {query}

{chunks_text}

Ranked chunk numbers (most relevant first):""" response = anthropic_client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) # Parse the response ranked_indices = [ int(x.strip()) - 1 for x in response.content[0].text.split(",") if x.strip().isdigit() ] # Reorder candidates reranked = [candidates[i] for i in ranked_indices if i < len(candidates)] return reranked[:top_k]

This technique boosted MRR from 0.74 to 0.87 and end-to-end accuracy from 71% to 81%.

Results Summary

MetricBasic RAG+ Summary Indexing+ Re-Ranking
Avg Precision0.430.440.44
Avg Recall0.660.690.69
Avg F10.520.540.54
Avg MRR0.740.740.87
End-to-End Accuracy71%71%81%

Production Considerations

  • Rate Limits: Full evaluations may hit rate limits unless you're on Tier 2+. Consider running smaller subsets during development.
  • Vector Database: Use a hosted solution (Pinecone, Weaviate, Chroma) for production workloads.
  • Embedding Model: Voyage AI's voyage-2 is excellent, but experiment with other models for your domain.
  • Chunking Strategy: Experiment with different chunk sizes and overlap strategies.

Key Takeaways

  • Measure separately: Always evaluate retrieval performance independently from end-to-end answer quality. This helps you pinpoint where improvements are needed.
  • Start simple, then optimize: A basic RAG pipeline works surprisingly well. Add complexity (summary indexing, re-ranking) only when metrics show a clear need.
  • MRR matters most for user experience: Users care most about whether the first result is relevant. Re-ranking with Claude dramatically improves this metric.
  • Synthetic evaluation datasets are powerful: Generate 100-200 Q&A pairs with ground truth chunks and answers. This gives you a repeatable benchmark for measuring improvements.
  • Advanced techniques pay off: Summary indexing and re-ranking together improved end-to-end accuracy by 10 percentage points (from 71% to 81%).