BeClaude
GuideBeginnerBest Practices2026-05-22

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide teaches you to build a RAG system with Claude, from basic implementation to advanced optimization. You'll learn to set up vector search, create evaluation metrics (precision, recall, F1, MRR), and improve performance through summary indexing and re-ranking—achieving up to 81% end-to-end accuracy.

RAGClaude APIVector SearchEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs access to your specific business context—internal documents, customer support knowledge bases, or proprietary data—to deliver truly valuable answers.

In this guide, we'll walk through building a production-quality RAG system using Claude and Voyage AI embeddings. We'll start with a basic implementation, then systematically improve it using evaluation-driven optimization. By the end, you'll understand how to achieve significant performance gains: our optimized system improved end-to-end accuracy from 71% to 81%.

Understanding the RAG Pipeline

A RAG system works in three stages:

  • Ingestion: Chunk and embed your documents into a vector database
  • Retrieval: Find the most relevant chunks for a user query
  • Generation: Feed retrieved context to Claude to produce an answer
Let's build each component, starting simple and adding sophistication.

Level 1: Basic RAG Implementation

Setup and Dependencies

First, install the required libraries:

pip install anthropic voyageai pandas numpy scikit-learn matplotlib

You'll need API keys from Anthropic and Voyage AI. Set them as environment variables:

import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"

Building the Vector Database

For this example, we'll use an in-memory vector store. In production, consider solutions like Pinecone, Weaviate, or pgvector.

import voyageai
import numpy as np
from typing import List, Dict

class InMemoryVectorDB: def __init__(self, api_key: str): self.client = voyageai.Client(api_key=api_key) self.documents = [] self.embeddings = [] def add_documents(self, documents: List[Dict[str, str]]): """Add documents with their embeddings""" texts = [doc["content"] for doc in documents] embeddings = self.client.embed(texts, model="voyage-2").embeddings self.documents.extend(documents) self.embeddings.extend(embeddings) def search(self, query: str, k: int = 3) -> List[Dict]: """Retrieve top-k most similar documents""" query_embedding = self.client.embed([query], model="voyage-2").embeddings[0] # Cosine similarity similarities = np.dot(self.embeddings, query_embedding) top_indices = np.argsort(similarities)[-k:][::-1] return [self.documents[i] for i in top_indices]

Chunking Strategy

A naive approach chunks documents by heading:

def chunk_by_heading(document: str) -> List[Dict[str, str]]:
    """Split document by markdown headings"""
    chunks = []
    current_heading = "Introduction"
    current_content = []
    
    for line in document.split("\n"):
        if line.startswith("##"):
            if current_content:
                chunks.append({
                    "heading": current_heading,
                    "content": "\n".join(current_content)
                })
            current_heading = line.strip("# ").strip()
            current_content = []
        else:
            current_content.append(line)
    
    # Don't forget the last section
    if current_content:
        chunks.append({
            "heading": current_heading,
            "content": "\n".join(current_content)
        })
    
    return chunks

The Complete RAG Pipeline

from anthropic import Anthropic

class BasicRAG: def __init__(self, anthropic_key: str, voyage_key: str): self.vector_db = InMemoryVectorDB(voyage_key) self.llm = Anthropic(api_key=anthropic_key) def ingest(self, documents: List[str]): """Process and store documents""" all_chunks = [] for doc in documents: chunks = chunk_by_heading(doc) all_chunks.extend(chunks) self.vector_db.add_documents(all_chunks) def query(self, question: str) -> str: """Answer a question using RAG""" # Retrieve relevant chunks relevant_chunks = self.vector_db.search(question, k=3) # Build context context = "\n\n---\n\n".join([ chunk["content"] for chunk in relevant_chunks ]) # Generate answer with Claude response = self.llm.messages.create( model="claude-3-sonnet-20240229", max_tokens=1000, messages=[{ "role": "user", "content": f"""Based on the following context, answer the question. Context: {context}

Question: {question}

Provide a clear, accurate answer based only on the context provided.""" }] ) return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need quantitative metrics. Let's build a robust evaluation suite.

Creating a Test Dataset

Generate 100+ test samples with:

  • A question
  • Ground truth relevant chunks
  • A correct answer
# Example test sample structure
test_sample = {
    "question": "What is the maximum context window for Claude 3 Opus?",
    "relevant_chunks": [
        "Claude 3 Opus supports a 200K token context window...",
        "The Claude 3 family offers different context windows..."
    ],
    "correct_answer": "Claude 3 Opus supports up to 200K tokens..."
}

Key Metrics Explained

#### Retrieval Metrics

Precision: Of the chunks we retrieved, how many were actually relevant?
Precision = |Retrieved ∩ Relevant| / |Retrieved|
Recall: Of all relevant chunks, how many did we retrieve?
Recall = |Retrieved ∩ Relevant| / |Relevant|
F1 Score: Harmonic mean of precision and recall
F1 = 2  (Precision  Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR): How high did the first relevant result rank?
def calculate_mrr(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

#### End-to-End Metrics

Accuracy: Does Claude's answer match the ground truth?
def evaluate_answer(generated_answer: str, correct_answer: str) -> bool:
    """Use Claude to judge if answers are semantically equivalent"""
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        messages=[{
            "role": "user",
            "content": f"""Are these two answers equivalent?

Answer 1: {generated_answer} Answer 2: {correct_answer}

Respond with only 'YES' or 'NO'.""" }] ) return response.content[0].text.strip() == "YES"

Level 2: Summary Indexing

Basic chunking loses context. Summary indexing creates a two-tier retrieval system:

  • Summary chunks: High-level overviews for initial retrieval
  • Detail chunks: Full content for answer generation
def create_summary_index(chunks: List[Dict]) -> List[Dict]:
    """Create summary-level representations"""
    summary_chunks = []
    
    for chunk in chunks:
        # Use Claude to generate a concise summary
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            messages=[{
                "role": "user",
                "content": f"Summarize this text in 1-2 sentences:\n\n{chunk['content']}"
            }]
        )
        
        summary_chunks.append({
            "summary": response.content[0].text,
            "original_content": chunk["content"],
            "heading": chunk["heading"]
        })
    
    return summary_chunks

This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.

Level 3: Adding Re-Ranking

Re-ranking refines initial retrieval results using Claude's understanding of relevance:

def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3) -> List[Dict]:
    """Use Claude to re-rank retrieved chunks by relevance"""
    
    # Prepare chunks for ranking
    chunks_text = "\n\n".join([
        f"[{i}] {chunk['content']}" 
        for i, chunk in enumerate(candidates)
    ])
    
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        messages=[{
            "role": "user",
            "content": f"""Given this query: "{query}"

Rank these chunks by relevance (most relevant first). Return only the indices in order, comma-separated.

Chunks: {chunks_text}""" }] ) # Parse ranked indices ranked_indices = [ int(idx.strip()) for idx in response.content[0].text.split(",") ] return [candidates[i] for i in ranked_indices[:top_k]]

Re-ranking dramatically improved MRR from 0.74 to 0.87—meaning the first retrieved chunk was almost always relevant.

Performance Results

Here's what our optimizations achieved:

MetricBasic RAGOptimized RAG
Precision0.430.44
Recall0.660.69
F1 Score0.520.54
MRR0.740.87
End-to-End Accuracy71%81%

Production Considerations

  • Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access or sample your evaluation set.
  • Vector Database: Move from in-memory to Pinecone, Weaviate, or pgvector for production.
  • Chunk Size: Experiment with chunk sizes (256-1024 tokens) based on your document structure.
  • Embedding Model: Voyage AI's voyage-2 offers excellent performance, but test alternatives.
  • Caching: Cache embeddings and common queries to reduce API costs.

Key Takeaways

  • Evaluate systematically: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy to identify bottlenecks in your RAG pipeline.
  • Summary indexing improves recall: Creating two-tier representations helps retrieve relevant content even when queries don't match exact phrasing.
  • Re-ranking with Claude dramatically improves MRR: Using Claude to re-rank initial results ensures the most relevant context reaches the generation step.
  • Start simple, optimize iteratively: A basic RAG pipeline can achieve 71% accuracy; targeted improvements push it to 81%.
  • Build for production from day one: Consider rate limits, vector database choices, and caching strategies early in development.