BeClaude
GuideBeginnerBest Practices2026-05-22

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide teaches you to build a production-ready RAG system with Claude, covering basic setup with Voyage AI embeddings, a comprehensive evaluation framework with 5 key metrics, and advanced optimization techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.

RAGClaude APIVector SearchEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs RAG to answer questions specific to your business context—whether that's internal documentation, customer support knowledge bases, or proprietary research.

In this guide, we'll walk through building a RAG system using Claude and Voyage AI embeddings, then systematically improve it using advanced techniques. We'll use the Claude documentation as our knowledge base, but the principles apply to any domain.

Why RAG Matters for Claude Users

Claude's training data has a cutoff date, and it doesn't know your internal documents. RAG bridges this gap by:

  • Grounding responses in your verified content
  • Reducing hallucinations by providing relevant context
  • Enabling domain-specific Q&A without fine-tuning
  • Maintaining data freshness as your knowledge base evolves

Setting Up Your RAG Environment

Required Libraries

# Core dependencies
pip install anthropic voyageai pandas numpy matplotlib scikit-learn

API Key Configuration

import os
from anthropic import Anthropic
import voyageai

Set your API keys

os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key" os.environ["VOYAGE_API_KEY"] = "your-voyage-key"

Initialize clients

anthropic_client = Anthropic() vo_client = voyageai.Client()

Building a Vector Database Class

For this guide, we'll use an in-memory vector store. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.

import numpy as np
from typing import List, Dict, Tuple

class InMemoryVectorDB: def __init__(self): self.documents = [] self.embeddings = [] self.metadata = [] def add_documents(self, texts: List[str], embeddings: List[List[float]], metadata: List[Dict] = None): self.documents.extend(texts) self.embeddings.extend(embeddings) if metadata: self.metadata.extend(metadata) else: self.metadata.extend([{}] * len(texts)) def search(self, query_embedding: List[float], k: int = 5) -> List[Tuple[str, float, Dict]]: # Cosine similarity search query_norm = np.array(query_embedding) / np.linalg.norm(query_embedding) doc_norms = np.array(self.embeddings) / np.linalg.norm(self.embeddings, axis=1, keepdims=True) similarities = np.dot(doc_norms, query_norm) top_indices = np.argsort(similarities)[-k:][::-1] results = [] for idx in top_indices: results.append(( self.documents[idx], similarities[idx], self.metadata[idx] )) return results

Level 1: Basic RAG Pipeline

This is the "naive RAG" approach that many tutorials start with. It works, but has significant limitations.

Step 1: Chunk Your Documents

def chunk_documents(documents: List[Dict]) -> List[Dict]:
    """Split documents by headings for meaningful chunks."""
    chunks = []
    for doc in documents:
        # Split by markdown headings
        sections = doc['content'].split('\n## ')
        for section in sections:
            if section.strip():
                chunks.append({
                    'text': section.strip(),
                    'source': doc['source'],
                    'heading': section.split('\n')[0] if '\n' in section else ''
                })
    return chunks

Step 2: Embed and Index

def embed_and_index(chunks: List[Dict], vector_db: InMemoryVectorDB):
    """Embed chunks and add to vector database."""
    texts = [chunk['text'] for chunk in chunks]
    
    # Generate embeddings using Voyage AI
    response = vo_client.embed(
        texts,
        model="voyage-2",
        input_type="document"
    )
    
    vector_db.add_documents(
        texts=texts,
        embeddings=response.embeddings,
        metadata=[{'source': c['source'], 'heading': c['heading']} for c in chunks]
    )

Step 3: Retrieve and Generate

def rag_query(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str:
    """Full RAG pipeline: retrieve context, then generate answer."""
    # Embed the query
    query_embedding = vo_client.embed(
        [query],
        model="voyage-2",
        input_type="query"
    ).embeddings[0]
    
    # Retrieve relevant chunks
    results = vector_db.search(query_embedding, k=k)
    context = "\n\n---\n\n".join([doc for doc, _, _ in results])
    
    # Generate answer with Claude
    response = anthropic_client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        system="You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain enough information, say so.",
        messages=[
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )
    return response.content[0].text

Building a Robust Evaluation System

Most RAG tutorials skip evaluation, but it's critical for production systems. We'll measure two things independently:

  • Retrieval Performance: How well does our system find relevant documents?
  • End-to-End Performance: How well does Claude answer questions given the retrieved context?

Creating an Evaluation Dataset

We synthetically generated 100 test samples, each containing:

  • A question
  • Ground truth relevant chunks
  • A correct answer
import json

Load evaluation dataset

with open('evaluation/docs_evaluation_dataset.json', 'r') as f: eval_data = json.load(f)

Preview

print(f"Total samples: {len(eval_data)}") print(f"Sample question: {eval_data[0]['question']}") print(f"Relevant chunks: {len(eval_data[0]['relevant_chunks'])}")

Key Metrics Explained

#### Precision What it measures: Of all chunks retrieved, how many were actually relevant?

def calculate_precision(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
    retrieved_set = set(retrieved_chunks)
    relevant_set = set(relevant_chunks)
    if len(retrieved_set) == 0:
        return 0.0
    return len(retrieved_set & relevant_set) / len(retrieved_set)

#### Recall What it measures: Of all relevant chunks, how many did we retrieve?

def calculate_recall(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
    retrieved_set = set(retrieved_chunks)
    relevant_set = set(relevant_chunks)
    if len(relevant_set) == 0:
        return 0.0
    return len(retrieved_set & relevant_set) / len(relevant_set)

#### F1 Score What it measures: Harmonic mean of precision and recall.

def calculate_f1(precision: float, recall: float) -> float:
    if precision + recall == 0:
        return 0.0
    return 2  (precision  recall) / (precision + recall)

#### Mean Reciprocal Rank (MRR) What it measures: How early in the results does the first relevant chunk appear?

def calculate_mrr(retrieved_chunks: List[str], relevant_chunks: List[str]) -> float:
    relevant_set = set(relevant_chunks)
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_set:
            return 1.0 / (i + 1)
    return 0.0

#### End-to-End Accuracy What it measures: Does Claude's final answer match the ground truth?

def calculate_e2e_accuracy(generated_answer: str, correct_answer: str) -> bool:
    # Use Claude to judge if answers are semantically equivalent
    response = anthropic_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        system="You are an answer evaluator. Respond with only 'YES' or 'NO'.",
        messages=[{
            "role": "user", 
            "content": f"Does this answer:\n'{generated_answer}'\n\nCorrectly answer the question? Correct answer is:\n'{correct_answer}'"
        }]
    )
    return response.content[0].text.strip().upper() == "YES"

Level 2: Summary Indexing

Basic RAG fails when a single chunk doesn't contain enough context. Summary indexing creates higher-level summaries that capture the "big picture."

def create_summary_index(chunks: List[Dict], group_size: int = 5) -> List[Dict]:
    """Group chunks and create summaries for each group."""
    summary_index = []
    
    for i in range(0, len(chunks), group_size):
        group = chunks[i:i+group_size]
        combined_text = "\n\n".join([c['text'] for c in group])
        
        # Generate summary using Claude
        response = anthropic_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=512,
            system="Summarize the following text, preserving key information and relationships between concepts.",
            messages=[{"role": "user", "content": combined_text}]
        )
        
        summary_index.append({
            'summary': response.content[0].text,
            'original_chunks': group,
            'chunk_indices': list(range(i, min(i+group_size, len(chunks))))
        })
    
    return summary_index

Hybrid Retrieval Strategy

def hybrid_retrieve(query: str, vector_db: InMemoryVectorDB, summary_index: List[Dict], k: int = 3) -> List[str]:
    """Retrieve from both chunk-level and summary-level indices."""
    # Get chunk-level results
    chunk_results = vector_db.search(query_embedding, k=k)
    
    # Get summary-level results
    summary_results = summary_vector_db.search(query_embedding, k=2)
    
    # Combine and deduplicate
    all_chunks = []
    seen = set()
    
    for doc, _, _ in chunk_results:
        if doc not in seen:
            all_chunks.append(doc)
            seen.add(doc)
    
    for summary in summary_results:
        for chunk in summary['original_chunks']:
            if chunk['text'] not in seen:
                all_chunks.append(chunk['text'])
                seen.add(chunk['text'])
    
    return all_chunks[:k]

Level 3: Re-Ranking with Claude

Re-ranking uses Claude to evaluate the relevance of retrieved chunks before generating the final answer.

def rerank_with_claude(query: str, candidates: List[str], top_k: int = 3) -> List[str]:
    """Use Claude to re-rank retrieved chunks by relevance."""
    # Prepare chunks for evaluation
    chunk_text = "\n\n---\n\n".join([
        f"CHUNK {i+1}:\n{chunk}" 
        for i, chunk in enumerate(candidates)
    ])
    
    response = anthropic_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=256,
        system="You are a relevance evaluator. Rank the following chunks by their relevance to the query. Return the chunk numbers in order of relevance, separated by commas.",
        messages=[{
            "role": "user",
            "content": f"Query: {query}\n\n{chunk_text}\n\nReturn the chunk numbers ranked by relevance (most relevant first):"
        }]
    )
    
    # Parse the ranked chunk numbers
    ranked_indices = [
        int(x.strip()) - 1 
        for x in response.content[0].text.split(',') 
        if x.strip().isdigit()
    ]
    
    return [candidates[i] for i in ranked_indices[:top_k]]

Results: Before and After

After implementing summary indexing and re-ranking, we achieved significant improvements:

MetricBasic RAGAdvanced RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%
The most dramatic improvement was in MRR (0.74 → 0.87), showing that re-ranking helps Claude find the most relevant information first. The 10% improvement in end-to-end accuracy demonstrates that better retrieval directly translates to better answers.

Production Considerations

  • Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ API access for large-scale testing.
  • Cost Management: Summary indexing and re-ranking add token costs. Balance quality improvements against budget.
  • Vector Database: Replace the in-memory store with Pinecone, Weaviate, or Chroma for production.
  • Chunking Strategy: Experiment with different chunk sizes and overlap percentages.
  • Caching: Cache embeddings and common queries to reduce API calls.

Key Takeaways

  • Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Basic RAG often fails at retrieval, not generation.
  • Summary indexing bridges the gap between granular chunks and high-level concepts, improving recall for complex questions.
  • Re-ranking with Claude significantly improves MRR, ensuring the most relevant context appears first in your prompt.
  • End-to-end accuracy improved by 10% (71% → 81%) through these optimizations, proving that better retrieval directly improves answer quality.
  • Build your evaluation dataset first before optimizing. Without ground truth data, you're optimizing blindly.