Guide2026-05-06

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide teaches you to build a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, and achieve significant performance improvements.

RAGClaudeEvaluationVector SearchPrompt Engineering

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll walk through building and optimizing a RAG system using Claude, from a basic "naive" approach to advanced techniques that deliver measurable improvements.

What You'll Learn

By the end of this guide, you'll know how to:

Set up a basic RAG pipeline using embeddings and vector search
Build a robust evaluation suite that measures retrieval and end-to-end performance independently
Implement advanced techniques like summary indexing and re-ranking with Claude
Achieve significant performance gains on key metrics

Understanding the RAG Architecture

A basic RAG pipeline follows three steps:

Chunk documents into manageable pieces (e.g., by heading)
Embed each chunk using a vector embedding model
Retrieve relevant chunks via cosine similarity to answer a query

Let's start building.

Level 1: Basic RAG Setup

Prerequisites

You'll need:

An Anthropic API key for Claude
A Voyage AI API key for embeddings
Python libraries: anthropic, voyageai, pandas, numpy, matplotlib, scikit-learn

Initialize Your Vector Database

For this example, we'll use an in-memory vector DB. For production, consider a hosted solution like Pinecone or Weaviate.

import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, docs: List[str]):
        self.documents.extend(docs)
        response = self.client.embed(docs, model="voyage-2")
        self.embeddings.extend(response.embeddings)
    
    def search(self, query: str, top_k: int = 3) -> List[str]:
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        similarities = [
            np.dot(query_embedding, doc_emb)
            for doc_emb in self.embeddings
        ]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.documents[i] for i in top_indices]

Chunking Strategy

A simple but effective strategy is to chunk documents by heading, keeping content from each subheading together. This preserves semantic boundaries.

def chunk_by_heading(text: str) -> List[str]:
    chunks = []
    current_chunk = []
    for line in text.split('\n'):
        if line.startswith('#'):
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
            current_chunk = [line]
        else:
            current_chunk.append(line)
    if current_chunk:
        chunks.append('\n'.join(current_chunk))
    return chunks

Basic RAG Query Function

import anthropic
claude = anthropic.Anthropic(api_key="your-api-key")
db = InMemoryVectorDB(api_key="your-voyage-key")
def basic_rag(query: str) -> str:
    # Retrieve relevant chunks
    chunks = db.search(query, top_k=3)
    context = "\n\n---\n\n".join(chunks)
    
    # Generate answer with Claude
    response = claude.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context."
        }]
    )
    return response.content[0].text

Building an Evaluation System

To improve your RAG system, you must measure it. We'll evaluate two dimensions independently:

Retrieval performance – How well does the system find relevant chunks?
End-to-end accuracy – How well does Claude answer using those chunks?

Creating an Evaluation Dataset

We synthetically generated 100 samples, each containing:

A question
Ground-truth relevant chunks
A correct answer

This dataset includes challenging questions that require synthesis across multiple chunks.

Key Metrics Defined

#### Precision

Precision answers: "Of the chunks we retrieved, how many were relevant?"

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

High precision means fewer false positives.

#### Recall

Recall answers: "Of all the correct chunks, how many did we retrieve?"

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

High recall means we're not missing important information.

#### F1 Score

The harmonic mean of precision and recall:

$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

#### Mean Reciprocal Rank (MRR)

MRR measures how early the first relevant chunk appears in the results. If the first relevant chunk is at position 1, the reciprocal rank is 1. At position 2, it's 1/2, and so on.

#### End-to-End Accuracy

This measures whether Claude's final answer is correct, given the retrieved context.

Evaluation Code

def evaluate_retrieval(queries, ground_truth_chunks, db, top_k=3):
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for query, correct_chunks in zip(queries, ground_truth_chunks):
        retrieved = db.search(query, top_k=top_k)
        
        # Calculate metrics
        true_positives = len(set(retrieved) & set(correct_chunks))
        precision = true_positives / len(retrieved) if retrieved else 0
        recall = true_positives / len(correct_chunks) if correct_chunks else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant chunk
        for rank, chunk in enumerate(retrieved, 1):
            if chunk in correct_chunks:
                mrr = 1.0 / rank
                break
        else:
            mrr = 0.0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic RAG often fails when a single chunk doesn't contain enough context. Summary indexing addresses this by creating a secondary index of chunk summaries.

How It Works

For each chunk, ask Claude to generate a concise summary
Store both the summary and the full chunk
At query time, search summaries first, then retrieve corresponding full chunks

def generate_summary(chunk: str, client) -> str:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": f"Summarize this text in 1-2 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text
Build summary index
summary_db = InMemoryVectorDB(api_key="your-voyage-key")
full_chunks = []
for chunk in all_chunks:
    summary = generate_summary(chunk, claude)
    summary_db.add_documents([summary])
    full_chunks.append(chunk)
def summary_rag(query: str) -> str:
    # Search summaries
    top_summaries = summary_db.search(query, top_k=3)
    # Map back to full chunks
    indices = [summary_db.documents.index(s) for s in top_summaries]
    context = "\n\n---\n\n".join([full_chunks[i] for i in indices])
    
    response = claude.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context."
        }]
    )
    return response.content[0].text

Level 3: Adding Re-Ranking

Re-ranking improves precision by having Claude score the relevance of retrieved chunks before generating an answer.

def rerank_chunks(query: str, chunks: List[str], client) -> List[str]:
    scored_chunks = []
    for chunk in chunks:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"On a scale of 0-10, how relevant is this chunk to the query?\n\nQuery: {query}\n\nChunk: {chunk}\n\nAnswer with just a number."
            }]
        )
        score = float(response.content[0].text.strip())
        scored_chunks.append((score, chunk))
    
    scored_chunks.sort(reverse=True, key=lambda x: x[0])
    return [chunk for _, chunk in scored_chunks]
def advanced_rag(query: str) -> str:
    # Retrieve candidates
    candidates = db.search(query, top_k=10)
    # Re-rank with Claude
    top_chunks = rerank_chunks(query, candidates, claude)[:3]
    context = "\n\n---\n\n".join(top_chunks)
    
    response = claude.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context."
        }]
    )
    return response.content[0].text

Results: Measurable Improvements

After implementing summary indexing and re-ranking, we achieved:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement was in MRR (0.74 → 0.87), meaning the first relevant chunk appears much earlier in results. End-to-end accuracy jumped from 71% to 81%.

Production Considerations

Rate limits: Full evaluations may hit rate limits unless you're at Tier 2 or above on Anthropic's API
Token usage: Summary indexing and re-ranking consume additional tokens; optimize by using Haiku for auxiliary tasks
Vector database: For production, use a hosted vector DB with built-in indexing and scaling
Evaluation dataset: Invest time in creating a high-quality evaluation set that reflects real user queries

Key Takeaways

Measure separately: Always evaluate retrieval and end-to-end performance independently to identify bottlenecks
Summary indexing improves recall: By searching summaries, you capture chunks that might be missed by keyword or embedding search alone
Re-ranking boosts precision: Claude can effectively score relevance, pushing the most useful chunks to the top
MRR is your early-warning metric: A low MRR indicates your system retrieves relevant content too late, hurting answer quality
Start simple, then optimize: Begin with basic RAG, establish baselines, then add complexity only where metrics show improvement

By following this guide, you can build a RAG system that not only works but demonstrably improves with each optimization. The key is rigorous evaluation—without it, you're just guessing.