BeClaude
Guide2026-05-01

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, and see how targeted improvements boosted accuracy from 71% to 81%.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.

We'll walk you through: 1) Setting up a basic RAG system using an in-memory vector database and embeddings from Voyage AI. 2) Building a robust evaluation suite that measures retrieval and end-to-end performance independently. 3) Implementing advanced techniques including summary indexing and re-ranking with Claude.

Through these targeted improvements, we achieved significant performance gains:

  • Avg Precision: 0.43 → 0.44
  • Avg Recall: 0.66 → 0.69
  • Avg F1 Score: 0.52 → 0.54
  • Avg Mean Reciprocal Rank (MRR): 0.74 → 0.87
  • End-to-End Accuracy: 71% → 81%

Prerequisites and Setup

Before diving in, you'll need:

  • API keys from Anthropic and Voyage AI
  • Python 3.8+
  • Required libraries: anthropic, voyageai, pandas, numpy, matplotlib, scikit-learn

Initialize Your Environment

import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

Initialize clients

claude = anthropic.Anthropic(api_key="your-anthropic-key") vo = voyageai.Client(api_key="your-voyage-key")

Create a Vector Database Class

For this guide, we'll use an in-memory vector database. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.

class InMemoryVectorDB:
    def __init__(self):
        self.vectors = []
        self.metadata = []
    
    def add(self, vector, metadata):
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search(self, query_vector, top_k=5):
        similarities = cosine_similarity([query_vector], self.vectors)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [(self.metadata[i], similarities[i]) for i in top_indices]

Level 1: Basic RAG Pipeline

A basic RAG pipeline, sometimes called "Naive RAG," includes three steps:

  • Chunk documents by heading, containing only the content from each subheading
  • Embed each chunk using Voyage AI's embedding model
  • Retrieve relevant chunks using cosine similarity

Step 1: Chunk Your Documents

def chunk_by_headings(document):
    """Split document into chunks based on headings."""
    chunks = []
    current_heading = None
    current_content = []
    
    for line in document.split('\n'):
        if line.startswith('#'):
            if current_heading:
                chunks.append({
                    'heading': current_heading,
                    'content': '\n'.join(current_content)
                })
            current_heading = line
            current_content = []
        else:
            current_content.append(line)
    
    if current_heading:
        chunks.append({
            'heading': current_heading,
            'content': '\n'.join(current_content)
        })
    
    return chunks

Step 2: Embed and Index

def embed_and_index(chunks, vector_db):
    """Embed chunks and add to vector database."""
    for chunk in chunks:
        # Generate embedding using Voyage AI
        embedding = vo.embed(
            [chunk['content']],
            model="voyage-2"
        ).embeddings[0]
        
        # Add to vector DB
        vector_db.add(embedding, chunk)

Step 3: Retrieve and Generate

def retrieve_and_answer(query, vector_db, claude_client):
    # Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Retrieve top 3 relevant chunks
    results = vector_db.search(query_embedding, top_k=3)
    context = "\n\n".join([r[0]['content'] for r in results])
    
    # Generate answer with Claude
    response = claude_client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Context: {context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
        }]
    )
    
    return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the performance of the retrieval system and end-to-end system separately. We synthetically generated an evaluation dataset of 100 samples, each containing:

  • A question
  • Relevant chunks (ground truth for retrieval)
  • A correct answer (ground truth for generation)

Retrieval Metrics

#### Precision Precision answers: "Of the chunks we retrieved, how many were correct?"

def precision(retrieved, relevant):
    """Calculate precision@k."""
    true_positives = len(set(retrieved) & set(relevant))
    return true_positives / len(retrieved) if retrieved else 0

#### Recall Recall answers: "Of all the correct chunks, how many did we retrieve?"

def recall(retrieved, relevant):
    """Calculate recall@k."""
    true_positives = len(set(retrieved) & set(relevant))
    return true_positives / len(relevant) if relevant else 0

#### F1 Score The harmonic mean of precision and recall.

def f1_score(prec, rec):
    """Calculate F1 score."""
    if prec + rec == 0:
        return 0
    return 2  (prec  rec) / (prec + rec)

#### Mean Reciprocal Rank (MRR) MRR measures how early the first relevant chunk appears in the results.

def mrr(retrieved, relevant):
    """Calculate Mean Reciprocal Rank."""
    for i, chunk in enumerate(retrieved):
        if chunk in relevant:
            return 1 / (i + 1)
    return 0

End-to-End Accuracy

This measures whether Claude's final answer is correct. Use Claude itself as a judge:

def evaluate_answer(question, generated_answer, correct_answer, claude_client):
    prompt = f"""
    Question: {question}
    Generated Answer: {generated_answer}
    Correct Answer: {correct_answer}
    
    Is the generated answer correct? Answer only 'YES' or 'NO'.
    """
    
    response = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text.strip() == 'YES'

Level 2: Summary Indexing

Basic RAG often fails when a question requires synthesizing information across multiple chunks. Summary indexing addresses this by creating higher-level summaries that capture cross-chunk relationships.

def create_summary_index(chunks, claude_client):
    """Create summary embeddings for groups of related chunks."""
    summaries = []
    
    # Group chunks by topic (e.g., same section)
    for i in range(0, len(chunks), 3):
        group = chunks[i:i+3]
        combined = "\n\n".join([c['content'] for c in group])
        
        # Generate summary with Claude
        response = claude_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize the following content in 2-3 sentences:\n\n{combined}"
            }]
        )
        
        summary = response.content[0].text
        
        # Embed the summary
        embedding = vo.embed([summary], model="voyage-2").embeddings[0]
        summaries.append({
            'summary': summary,
            'embedding': embedding,
            'chunks': group
        })
    
    return summaries

When a query comes in, search both the chunk embeddings and summary embeddings, then merge results.

Level 3: Summary Indexing + Re-Ranking

Re-ranking with Claude adds an intelligent filtering step after initial retrieval. This dramatically improves MRR by ensuring the most relevant chunks appear first.

def rerank_with_claude(query, candidates, claude_client):
    """Use Claude to re-rank retrieved chunks by relevance."""
    # Prepare candidate list for Claude
    candidate_text = ""
    for i, c in enumerate(candidates):
        candidate_text += f"[{i}] {c['content'][:200]}...\n\n"
    
    prompt = f"""
    Query: {query}
    
    Candidates:
    {candidate_text}
    
    Rank the candidates from most relevant (0) to least relevant. 
    Return only the indices in order, comma-separated.
    Example: "3, 1, 0, 2"
    """
    
    response = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=50,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the ranked indices
    ranked_indices = [int(x.strip()) for x in response.content[0].text.split(',')]
    return [candidates[i] for i in ranked_indices]

Putting It All Together

def advanced_rag_pipeline(query, vector_db, summary_index, claude_client):
    # Step 1: Embed query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Step 2: Retrieve from both indexes
    chunk_results = vector_db.search(query_embedding, top_k=5)
    summary_results = search_summaries(query_embedding, summary_index, top_k=3)
    
    # Step 3: Merge candidates (remove duplicates)
    all_candidates = merge_candidates(chunk_results, summary_results)
    
    # Step 4: Re-rank with Claude
    reranked = rerank_with_claude(query, all_candidates, claude_client)
    
    # Step 5: Generate answer with top chunks
    top_context = "\n\n".join([c['content'] for c in reranked[:3]])
    
    response = claude_client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Context: {top_context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
        }]
    )
    
    return response.content[0].text

Performance Results

After implementing these techniques, here's how our metrics improved:

MetricBasic RAGAdvanced RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%
The most dramatic improvement came in MRR (from 0.74 to 0.87), thanks to Claude's re-ranking capability. End-to-end accuracy jumped 10 percentage points, demonstrating that better retrieval directly leads to better answers.

Key Takeaways

  • Evaluate retrieval and generation separately: Use precision, recall, F1, and MRR for retrieval; use accuracy or Claude-as-judge for end-to-end performance. This helps you pinpoint where improvements are needed.
  • Summary indexing bridges the gap: When questions require synthesizing information across multiple chunks, summary embeddings capture higher-level relationships that individual chunk embeddings miss.
  • Re-ranking with Claude dramatically improves MRR: Using Claude to intelligently re-order retrieved chunks ensures the most relevant information appears first, which is critical for generating accurate answers.
  • Start simple, then iterate: Begin with a basic RAG pipeline, establish your evaluation metrics, then layer on advanced techniques. This approach ensures you can measure the impact of each improvement.
  • Consider production trade-offs: In-memory vector databases work for prototyping, but production systems should use hosted solutions. Also be mindful of rate limits and token usage when running full evaluations.