BeClaude
Guide2026-04-24

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.

What You'll Learn

By the end of this guide, you'll know how to:

  • Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
  • Build a robust evaluation suite that measures retrieval and end-to-end performance independently
  • Implement advanced techniques including summary indexing and re-ranking with Claude
Through these improvements, you can achieve significant performance gains. For example, one implementation saw End-to-End Accuracy jump from 71% to 81%, and Mean Reciprocal Rank (MRR) improve from 0.74 to 0.87.

Prerequisites

Before diving in, you'll need:

Required Libraries

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG" — a bare-bones approach that includes three steps:

  • Chunk documents by heading (each chunk contains content from one subheading)
  • Embed each chunk using Voyage AI embeddings
  • Retrieve relevant chunks using cosine similarity

Initialize a Vector Database

For this example, we'll use an in-memory vector DB. For production, consider a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict

class InMemoryVectorDB: def __init__(self, api_key: str): self.client = voyageai.Client(api_key=api_key) self.documents = [] self.embeddings = [] def add_documents(self, documents: List[Dict[str, str]]): """Add documents with their embeddings.""" texts = [doc["content"] for doc in documents] embeddings = self.client.embed(texts, model="voyage-2").embeddings self.documents.extend(documents) self.embeddings.extend(embeddings) def search(self, query: str, k: int = 3) -> List[Dict]: """Retrieve top-k relevant documents.""" query_embedding = self.client.embed([query], model="voyage-2").embeddings[0] similarities = [ np.dot(query_embedding, doc_emb) for doc_emb in self.embeddings ] top_indices = np.argsort(similarities)[-k:][::-1] return [self.documents[i] for i in top_indices]

Query Claude with Retrieved Context

from anthropic import Anthropic

anthropic = Anthropic(api_key="your-anthropic-api-key")

def answer_with_rag(query: str, context_chunks: List[str]) -> str: context = "\n\n---\n\n".join(context_chunks) prompt = f"""Answer the following question based on the provided context.

Context: {context}

Question: {query}

Answer:""" response = anthropic.messages.create( model="claude-3-sonnet-20240229", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. This allows you to pinpoint where improvements are needed.

Creating an Evaluation Dataset

You'll need a dataset with:

  • A question
  • Relevant chunks (ground truth for retrieval)
  • A correct answer (ground truth for end-to-end)
Here's a sample structure:

[
  {
    "question": "How do I set up rate limiting in Claude?",
    "relevant_chunks": ["chunk_1_content", "chunk_2_content"],
    "correct_answer": "To set up rate limiting..."
  }
]

Key Metrics

#### Retrieval Metrics

Precision measures the proportion of retrieved chunks that are actually relevant.
Precision = True Positives / Total Retrieved
Recall measures the completeness of retrieval — how many of the relevant chunks were retrieved.
Recall = True Positives / Total Relevant
F1 Score is the harmonic mean of precision and recall.
F1 = 2  (Precision  Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the results.
def calculate_mrr(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

#### End-to-End Metric

Accuracy measures whether the final answer is correct. This requires human or LLM-as-judge evaluation.
def evaluate_accuracy(question, generated_answer, correct_answer):
    # Use Claude to judge if the answer is correct
    prompt = f"""Question: {question}
Generated Answer: {generated_answer}
Correct Answer: {correct_answer}

Is the generated answer correct? Answer only 'yes' or 'no'.""" # ... call Claude and parse response

Level 2: Summary Indexing

Basic RAG often fails when a question requires synthesizing information across multiple chunks. Summary indexing addresses this by creating condensed representations of document sections.

How It Works

  • For each document chunk, generate a summary using Claude
  • Store both the original chunk and its summary
  • During retrieval, search against summaries first, then retrieve full chunks
def create_summary(chunk_content: str) -> str:
    prompt = f"Summarize the following text in 2-3 sentences:\n\n{chunk_content}"
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Level 3: Summary Indexing + Re-Ranking

Re-ranking adds a second stage to retrieval. After initial retrieval, Claude re-ranks the chunks by relevance to the specific query.

Implementation

def rerank_chunks(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
    prompt = f"""Given the query: "{query}"

Rank the following chunks by relevance (most relevant first).

Chunks: """ for i, chunk in enumerate(chunks): prompt += f"\n[{i+1}] {chunk[:200]}..." prompt += "\n\nReturn the chunk numbers in order of relevance, comma-separated." response = anthropic.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) # Parse the response to get ordered indices indices = [int(x.strip()) - 1 for x in response.content[0].text.split(",")] return [chunks[i] for i in indices[:top_k]]

Performance Gains

With summary indexing and re-ranking, you can expect improvements like:

MetricBasic RAGAdvanced RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%

Best Practices for Production RAG

  • Chunk strategically: Experiment with chunk sizes (256-512 tokens often work well) and overlap
  • Use dedicated embedding models: Voyage AI and Cohere offer purpose-built embeddings for RAG
  • Implement caching: Cache embeddings and common queries to reduce latency and cost
  • Monitor and iterate: Continuously evaluate your system and add edge cases to your test set
  • Consider hybrid search: Combine semantic search with keyword matching for better recall

Key Takeaways

  • Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
  • Summary indexing improves recall by creating condensed representations that capture the essence of document sections, making retrieval more effective for complex queries.
  • Re-ranking with Claude significantly boosts MRR by ensuring the most relevant chunks appear first, which improves the quality of the final answer.
  • Start simple, then iterate — a basic RAG pipeline can be surprisingly effective. Add complexity like summary indexing and re-ranking only when evaluation shows they're needed.
  • Build a robust evaluation dataset with diverse questions, including those requiring synthesis across multiple chunks, to stress-test your system.