Guide2026-04-24

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.

What You'll Learn

By the end of this guide, you'll know how to:

Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
Build a robust evaluation suite that measures retrieval and end-to-end performance independently
Implement advanced techniques including summary indexing and re-ranking with Claude

Through these improvements, you can achieve significant performance gains. For example, one implementation saw End-to-End Accuracy jump from 71% to 81%, and Mean Reciprocal Rank (MRR) improve from 0.74 to 0.87.

Prerequisites

Before diving in, you'll need:

An Anthropic API key
A Voyage AI API key
Python 3.8+
Basic familiarity with Python and API usage

Required Libraries

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG" — a bare-bones approach that includes three steps:

Chunk documents by heading (each chunk contains content from one subheading)
Embed each chunk using Voyage AI embeddings
Retrieve relevant chunks using cosine similarity

Initialize a Vector Database

For this example, we'll use an in-memory vector DB. For production, consider a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, str]]):
        """Add documents with their embeddings."""
        texts = [doc["content"] for doc in documents]
        embeddings = self.client.embed(texts, model="voyage-2").embeddings
        self.documents.extend(documents)
        self.embeddings.extend(embeddings)
    
    def search(self, query: str, k: int = 3) -> List[Dict]:
        """Retrieve top-k relevant documents."""
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        similarities = [
            np.dot(query_embedding, doc_emb)
            for doc_emb in self.embeddings
        ]
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.documents[i] for i in top_indices]

Query Claude with Retrieved Context

from anthropic import Anthropic
anthropic = Anthropic(api_key="your-anthropic-api-key")
def answer_with_rag(query: str, context_chunks: List[str]) -> str:
    context = "\n\n---\n\n".join(context_chunks)
    
    prompt = f"""Answer the following question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
    
    response = anthropic.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. This allows you to pinpoint where improvements are needed.

Creating an Evaluation Dataset

You'll need a dataset with:

A question
Relevant chunks (ground truth for retrieval)
A correct answer (ground truth for end-to-end)

Here's a sample structure:

[
  {
    "question": "How do I set up rate limiting in Claude?",
    "relevant_chunks": ["chunk_1_content", "chunk_2_content"],
    "correct_answer": "To set up rate limiting..."
  }
]

Key Metrics

#### Retrieval Metrics

Precision measures the proportion of retrieved chunks that are actually relevant.

Precision = True Positives / Total Retrieved

Recall measures the completeness of retrieval — how many of the relevant chunks were retrieved.

Recall = True Positives / Total Relevant

F1 Score is the harmonic mean of precision and recall.

F1 = 2  (Precision  Recall) / (Precision + Recall)

Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the results.

def calculate_mrr(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

#### End-to-End Metric

Accuracy measures whether the final answer is correct. This requires human or LLM-as-judge evaluation.

def evaluate_accuracy(question, generated_answer, correct_answer):
    # Use Claude to judge if the answer is correct
    prompt = f"""Question: {question}
Generated Answer: {generated_answer}
Correct Answer: {correct_answer}
Is the generated answer correct? Answer only 'yes' or 'no'."""
    # ... call Claude and parse response

Level 2: Summary Indexing

Basic RAG often fails when a question requires synthesizing information across multiple chunks. Summary indexing addresses this by creating condensed representations of document sections.

How It Works

For each document chunk, generate a summary using Claude
Store both the original chunk and its summary
During retrieval, search against summaries first, then retrieve full chunks

def create_summary(chunk_content: str) -> str:
    prompt = f"Summarize the following text in 2-3 sentences:\n\n{chunk_content}"
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Level 3: Summary Indexing + Re-Ranking

Re-ranking adds a second stage to retrieval. After initial retrieval, Claude re-ranks the chunks by relevance to the specific query.

Implementation

def rerank_chunks(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
    prompt = f"""Given the query: "{query}"
Rank the following chunks by relevance (most relevant first).
Chunks:
"""
    for i, chunk in enumerate(chunks):
        prompt += f"\n[{i+1}] {chunk[:200]}..."
    
    prompt += "\n\nReturn the chunk numbers in order of relevance, comma-separated."
    
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the response to get ordered indices
    indices = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
    return [chunks[i] for i in indices[:top_k]]

Performance Gains

With summary indexing and re-ranking, you can expect improvements like:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

Best Practices for Production RAG

Chunk strategically: Experiment with chunk sizes (256-512 tokens often work well) and overlap
Use dedicated embedding models: Voyage AI and Cohere offer purpose-built embeddings for RAG
Implement caching: Cache embeddings and common queries to reduce latency and cost
Monitor and iterate: Continuously evaluate your system and add edge cases to your test set
Consider hybrid search: Combine semantic search with keyword matching for better recall

Key Takeaways

Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
Summary indexing improves recall by creating condensed representations that capture the essence of document sections, making retrieval more effective for complex queries.
Re-ranking with Claude significantly boosts MRR by ensuring the most relevant chunks appear first, which improves the quality of the final answer.
Start simple, then iterate — a basic RAG pipeline can be surprisingly effective. Add complexity like summary indexing and re-ranking only when evaluation shows they're needed.
Build a robust evaluation dataset with diverse questions, including those requiring synthesis across multiple chunks, to stress-test your system.