Guide2026-04-28

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic pipelines, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for grounding Claude in your proprietary knowledge. Whether you're building a customer support bot, an internal Q&A system, or a document analysis tool, RAG lets Claude answer questions about your specific data without fine-tuning.

In this guide, we'll walk through building a RAG system using Claude, Voyage AI embeddings, and an in-memory vector database. We'll start with a basic pipeline, then show you how to measure performance properly, and finally implement advanced techniques that boost accuracy from 71% to 81%.

Understanding the RAG Architecture

A RAG system works in three stages:

Indexing: Chunk your documents, embed each chunk, and store the embeddings in a vector database
Retrieval: When a query comes in, embed it and find the most similar chunks
Generation: Pass the retrieved chunks as context to Claude along with the user's question

The magic happens when you optimize each stage independently. Let's see how.

Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll need API keys from Anthropic and Voyage AI. Store them as environment variables:

import os
os.environ["ANTHROPIC_API_KEY"] = "your-key-here"
os.environ["VOYAGE_API_KEY"] = "your-key-here"

Initialize Your Vector Database

For this guide, we'll use an in-memory vector database. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, Any]]):
        texts = [doc["text"] for doc in documents]
        response = self.client.embed(texts, model="voyage-2")
        self.embeddings.extend(response.embeddings)
        self.documents.extend(documents)
    
    def search(self, query: str, top_k: int = 3) -> List[Dict[str, Any]]:
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        scores = [self._cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[-top_k:][::-1]
        return [self.documents[i] for i in top_indices]
    
    def _cosine_similarity(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Level 1: Basic RAG Pipeline

Let's build a "naive" RAG pipeline. This is the simplest approach: chunk documents by heading, embed each chunk, and retrieve using cosine similarity.

from anthropic import Anthropic
class BasicRAG:
    def __init__(self, api_key: str, voyage_key: str):
        self.vector_db = InMemoryVectorDB(api_key=voyage_key)
        self.llm = Anthropic(api_key=api_key)
    
    def index_documents(self, documents: List[Dict[str, str]]):
        """
        documents: list of dicts with 'id', 'title', 'content' keys
        """
        chunks = []
        for doc in documents:
            # Simple chunking by paragraph
            paragraphs = doc["content"].split("\n\n")
            for i, para in enumerate(paragraphs):
                if len(para.strip()) > 50:  # Skip very short chunks
                    chunks.append({
                        "id": f"{doc['id']}_{i}",
                        "text": para,
                        "source": doc["title"]
                    })
        self.vector_db.add_documents(chunks)
    
    def query(self, question: str) -> str:
        # Retrieve relevant chunks
        chunks = self.vector_db.search(question, top_k=3)
        context = "\n\n".join([c["text"] for c in chunks])
        
        # Generate answer with Claude
        response = self.llm.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Answer the question based on the provided context.
Context:
{context}
Question: {question}
Answer:"""
            }]
        )
        return response.content[0].text

This works, but how well? Let's find out.

Building a Robust Evaluation System

Most RAG projects fail because teams rely on "vibes" instead of metrics. You need to evaluate two things independently:

Retrieval quality: Are we finding the right chunks?
End-to-end accuracy: Is Claude giving correct answers?

Creating an Evaluation Dataset

Generate a synthetic dataset with 100+ examples. Each example should have:

A question
The correct answer
The IDs of relevant chunks

import json
eval_data = [
    {
        "question": "How do I stream responses from Claude?",
        "answer": "Use the 'stream' parameter set to True when calling the Messages API...",
        "relevant_chunks": ["doc_3_2", "doc_3_5"]
    },
    # ... 99 more examples
]
with open("evaluation_dataset.json", "w") as f:
    json.dump(eval_data, f, indent=2)

Key Metrics Explained

Precision: Of the chunks we retrieved, how many were actually relevant?

Precision = True Positives / Total Retrieved

Recall: Of all the relevant chunks that exist, how many did we retrieve?

Recall = True Positives / Total Relevant

F1 Score: Harmonic mean of precision and recall. Mean Reciprocal Rank (MRR): How early did the first relevant chunk appear in our results? End-to-End Accuracy: Did Claude's final answer match the expected answer?

Running the Evaluation

def evaluate_retrieval(rag_system, eval_data):
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for item in eval_data:
        retrieved = rag_system.vector_db.search(item["question"], top_k=3)
        retrieved_ids = [r["id"] for r in retrieved]
        relevant = item["relevant_chunks"]
        
        tp = len(set(retrieved_ids) & set(relevant))
        
        precision = tp / len(retrieved_ids)
        recall = tp / len(relevant)
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant chunk
        for rank, rid in enumerate(retrieved_ids, 1):
            if rid in relevant:
                mrr = 1.0 / rank
                break
        else:
            mrr = 0.0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic chunking loses context. A document about "Claude's safety features" might have chunks about "Constitutional AI" and "Red teaming" that are useless without each other.

Summary indexing solves this by creating a summary of each document and using it as an additional retrieval target.

class SummaryIndexRAG(BasicRAG):
    def index_documents(self, documents: List[Dict[str, str]]):
        # First, create summaries
        for doc in documents:
            response = self.llm.messages.create(
                model="claude-3-haiku-20240307",
                max_tokens=200,
                messages=[{
                    "role": "user",
                    "content": f"Summarize this document in 2-3 sentences:\n\n{doc['content']}"
                }]
            )
            summary = response.content[0].text
            
            # Add summary as a retrievable chunk
            self.vector_db.add_documents([{
                "id": f"{doc['id']}_summary",
                "text": summary,
                "source": doc["title"],
                "type": "summary"
            }])
        
        # Then add regular chunks
        super().index_documents(documents)

This improved recall from 0.66 to 0.69 in our tests. The summaries act as "table of contents" entries that help the retriever find the right document even when the exact wording doesn't match.

Level 3: Adding Re-Ranking

Retrieval gives you candidates; re-ranking ensures you only pass the best ones to Claude. This is critical because Claude's context window is valuable real estate.

class ReRankRAG(SummaryIndexRAG):
    def query(self, question: str) -> str:
        # Retrieve more candidates
        candidates = self.vector_db.search(question, top_k=10)
        
        # Use Claude to re-rank
        candidate_texts = "\n---\n".join([
            f"[{i+1}] {c['text']}" for i, c in enumerate(candidates)
        ])
        
        response = self.llm.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"""Given this question: "{question}"
Rank these chunks by relevance (1=most relevant). Return only the numbers of the top 3, comma-separated.
{candidate_texts}"""
            }]
        )
        
        # Parse rankings and select top 3
        top_indices = [int(x.strip()) - 1 for x in response.content[0].text.split(",")[:3]]
        top_chunks = [candidates[i] for i in top_indices]
        
        context = "\n\n".join([c["text"] for c in top_chunks])
        
        final_response = self.llm.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Answer the question based on the context.\n\nContext:\n{context}\n\nQuestion: {question}"
            }]
        )
        return final_response.content[0].text

Re-ranking boosted our MRR from 0.74 to 0.87 and end-to-end accuracy from 71% to 81%.

Performance Results

Here's what the improvements look like in practice:

Metric	Basic RAG	+Summary Index	+Re-Ranking
Avg Precision	0.43	0.44	0.44
Avg Recall	0.66	0.69	0.69
Avg F1	0.52	0.54	0.54
Avg MRR	0.74	0.78	0.87
End-to-End Accuracy	71%	75%	81%

Production Considerations

Rate limits: Full evaluations can hit API limits. Use Tier 2+ accounts or run evaluations incrementally.
Vector database: Switch to Pinecone, Weaviate, or Qdrant for production workloads.
Chunking strategy: Experiment with semantic chunking (by sentence boundaries) vs. fixed-size chunks.
Embedding model: Voyage AI's voyage-2 is excellent, but test voyage-law-2 for legal or voyage-code-2 for code.
Caching: Cache embeddings for static documents to reduce API costs.

Key Takeaways

Evaluate retrieval and generation separately to identify where your RAG system is failing. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
Summary indexing improves recall by creating high-level representations of documents that match broader queries.
Re-ranking with Claude dramatically improves MRR and end-to-end accuracy by filtering out irrelevant chunks before generation.
Start simple, measure everything, then optimize — a basic RAG pipeline with proper evaluation beats a complex system you can't debug.
Your evaluation dataset is your most important asset — invest time in creating realistic, challenging questions that reflect actual user queries.