BeClaude
Guide2026-05-05

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn how to build and optimize a Retrieval Augmented Generation (RAG) system with Claude, including evaluation metrics, summary indexing, and re-ranking techniques for production use.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to set up retrieval, evaluate performance with precision/recall/F1 metrics, and achieve significant accuracy improvements.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using Claude Documentation as our knowledge base.

What You'll Learn

By the end of this guide, you'll know how to:

  • Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
  • Build a robust evaluation suite that measures retrieval and end-to-end performance independently
  • Implement advanced techniques like summary indexing and re-ranking with Claude

Prerequisites and Setup

Before we begin, you'll need:

  • API keys from Anthropic and Voyage AI
  • Python 3.8+ installed
  • Basic familiarity with Python and API calls

Required Libraries

# Install the required packages
pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Initialize Your Clients

import anthropic
import voyageai

Initialize Claude client

claude_client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

Initialize Voyage AI client for embeddings

voyage_client = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")

Setting Up an In-Memory Vector Database

For this guide, we'll use an in-memory vector database. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

import numpy as np
from typing import List, Dict, Any

class InMemoryVectorDB: def __init__(self): self.documents = [] self.embeddings = [] def add_document(self, text: str, embedding: List[float], metadata: Dict[str, Any] = None): self.documents.append({ "text": text, "metadata": metadata or {} }) self.embeddings.append(embedding) def search(self, query_embedding: List[float], top_k: int = 5) -> List[Dict[str, Any]]: # Cosine similarity search similarities = [] for doc_embedding in self.embeddings: sim = np.dot(query_embedding, doc_embedding) / ( np.linalg.norm(query_embedding) * np.linalg.norm(doc_embedding) ) similarities.append(sim) top_indices = np.argsort(similarities)[-top_k:][::-1] return [self.documents[i] for i in top_indices]

Level 1: Basic RAG Pipeline

Let's start with a basic RAG pipeline, sometimes called "Naive RAG." This involves three steps:

  • Chunk documents by heading (containing only content from each subheading)
  • Embed each chunk using Voyage AI
  • Retrieve relevant chunks using cosine similarity

Step 1: Chunk Your Documents

def chunk_by_headings(document_text: str) -> List[Dict[str, str]]:
    """Split document into chunks based on headings."""
    chunks = []
    lines = document_text.split('\n')
    current_heading = "Introduction"
    current_content = []
    
    for line in lines:
        if line.startswith('#'):  # Markdown heading
            if current_content:
                chunks.append({
                    "heading": current_heading,
                    "content": '\n'.join(current_content).strip()
                })
            current_heading = line.lstrip('#').strip()
            current_content = []
        else:
            current_content.append(line)
    
    # Don't forget the last chunk
    if current_content:
        chunks.append({
            "heading": current_heading,
            "content": '\n'.join(current_content).strip()
        })
    
    return chunks

Step 2: Embed and Store

def build_vector_db(chunks: List[Dict[str, str]]) -> InMemoryVectorDB:
    db = InMemoryVectorDB()
    
    for chunk in chunks:
        # Generate embedding using Voyage AI
        response = voyage_client.embed(
            texts=[chunk["content"]],
            model="voyage-2"
        )
        embedding = response.embeddings[0]
        
        db.add_document(
            text=chunk["content"],
            embedding=embedding,
            metadata={"heading": chunk["heading"]}
        )
    
    return db

Step 3: Retrieve and Generate

def rag_query(db: InMemoryVectorDB, query: str, top_k: int = 3) -> str:
    # Embed the query
    response = voyage_client.embed(
        texts=[query],
        model="voyage-2"
    )
    query_embedding = response.embeddings[0]
    
    # Retrieve relevant chunks
    retrieved_chunks = db.search(query_embedding, top_k=top_k)
    
    # Build context from retrieved chunks
    context = "\n\n".join([chunk["text"] for chunk in retrieved_chunks])
    
    # Generate answer using Claude
    prompt = f"""Based on the following context, answer the user's question.
    If the context doesn't contain enough information, say so.
    
    Context:
    {context}
    
    Question: {query}
    
    Answer:"""
    
    response = claude_client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end-to-end system separately. We'll use a synthetic evaluation dataset with 100 samples, each containing:

  • A question
  • Relevant chunks (ground truth)
  • A correct answer

Key Metrics

#### Retrieval Metrics

Precision measures how many of the retrieved chunks are actually relevant:
Precision = True Positives / Total Retrieved

High precision means fewer irrelevant chunks are being retrieved.

Recall measures how many of the relevant chunks were retrieved:
Recall = True Positives / Total Relevant

High recall means you're capturing most of the necessary information.

F1 Score is the harmonic mean of precision and recall:
F1 = 2  (Precision  Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the results:
def calculate_mrr(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

#### End-to-End Metric

End-to-End Accuracy measures whether the final answer is correct, considering both retrieval and generation quality.

Level 2: Summary Indexing

A major improvement over basic RAG is summary indexing. Instead of retrieving raw chunks, you:

  • Generate a summary of each chunk using Claude
  • Store both the summary and the original chunk
  • Retrieve based on summary similarity, then return the full chunk
def generate_summary(chunk_text: str) -> str:
    prompt = f"Summarize the following text in 2-3 sentences:\n\n{chunk_text}"
    
    response = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Level 3: Summary Indexing + Re-Ranking

The most advanced approach combines summary indexing with re-ranking. After initial retrieval, you use Claude to re-rank the results based on relevance to the query.

def rerank_with_claude(query: str, chunks: List[Dict[str, str]], top_k: int = 3) -> List[Dict[str, str]]:
    # Ask Claude to rank chunks by relevance
    chunks_text = "\n---\n".join([
        f"Chunk {i+1}: {chunk['text'][:200]}..." 
        for i, chunk in enumerate(chunks)
    ])
    
    prompt = f"""Given the query: "{query}"
    
    Rank the following chunks by relevance (most relevant first):
    
    {chunks_text}
    
    Return the chunk numbers in order of relevance, separated by commas."""
    
    response = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the response to get ordered indices
    import re
    indices = [int(x) - 1 for x in re.findall(r'\d+', response.content[0].text)]
    
    return [chunks[i] for i in indices[:top_k]]

Performance Gains

Through these targeted improvements, you can achieve significant performance gains:

MetricBasic RAGAdvanced RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%

Key Takeaways

  • Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
  • Summary indexing improves retrieval quality by matching queries against concise summaries rather than raw chunks, leading to better semantic alignment.
  • Re-ranking with Claude significantly boosts MRR by ensuring the most relevant chunks appear first, which is critical for time-sensitive applications.
  • Start simple, then iterate — a basic RAG pipeline can be surprisingly effective. Only add complexity (summary indexing, re-ranking) when metrics show it's needed.
  • Use high-quality embeddings from providers like Voyage AI to ensure your retrieval foundation is solid before optimizing other components.