BeClaude
Guide2026-04-22

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement chunking, embedding, retrieval, and evaluation using metrics like Precision, Recall, F1, and MRR. Advanced techniques like summary indexing and re-ranking boost end-to-end accuracy from 71% to 81%.

RAGClaudeEvaluationEmbeddingsVector Search

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.

In this guide, we'll walk through building a complete RAG system using Claude, Voyage AI embeddings, and a robust evaluation framework. You'll learn not just how to build it, but how to measure and improve it systematically.

What You'll Learn

  • How to set up a basic RAG pipeline with Claude and Voyage AI
  • How to build a proper evaluation suite with meaningful metrics
  • Advanced techniques: summary indexing and re-ranking with Claude
  • How to achieve measurable improvements in retrieval and end-to-end accuracy

Prerequisites

You'll need:

Install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG implementation. This is often called "Naive RAG" in the industry, and it follows three steps:

  • Chunk your documents by heading
  • Embed each chunk using Voyage AI
  • Retrieve the most relevant chunks using cosine similarity

Step 1: Initialize Your Vector Database

For this example, we'll use an in-memory vector store. In production, you'd likely use Pinecone, Weaviate, or another hosted solution.

import numpy as np
from typing import List, Dict, Any

class InMemoryVectorDB: def __init__(self): self.documents = [] self.embeddings = [] def add_documents(self, documents: List[str], embeddings: List[List[float]]): self.documents.extend(documents) self.embeddings.extend(embeddings) def search(self, query_embedding: List[float], k: int = 3) -> List[Dict[str, Any]]: scores = [ self._cosine_similarity(query_embedding, emb) for emb in self.embeddings ] top_indices = np.argsort(scores)[-k:][::-1] return [ {"document": self.documents[i], "score": scores[i]} for i in top_indices ] def _cosine_similarity(self, a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Step 2: Chunk and Embed Documents

Chunking by heading is a simple but effective strategy. Each chunk contains the content under a single subheading.

import voyageai

vo = voyageai.Client(api_key="your-voyage-api-key")

def chunk_by_heading(text: str) -> List[str]: """Split text by markdown headings.""" chunks = [] current_chunk = [] for line in text.split('\n'): if line.startswith('##') or line.startswith('###'): if current_chunk: chunks.append('\n'.join(current_chunk)) current_chunk = [line] else: current_chunk.append(line) if current_chunk: chunks.append('\n'.join(current_chunk)) return chunks

def embed_documents(chunks: List[str]) -> List[List[float]]: """Embed chunks using Voyage AI.""" result = vo.embed(chunks, model="voyage-2") return result.embeddings

Step 3: Retrieve and Answer

from anthropic import Anthropic

client = Anthropic(api_key="your-anthropic-api-key")

def answer_with_rag(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str: # Embed the query query_embedding = vo.embed([query], model="voyage-2").embeddings[0] # Retrieve relevant chunks results = vector_db.search(query_embedding, k=k) context = "\n\n".join([r["document"] for r in results]) # Generate answer with Claude response = client.messages.create( model="claude-3-sonnet-20241022", max_tokens=1024, messages=[{ "role": "user", "content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query}" }] ) return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need quantitative metrics. Here's how to build a proper evaluation suite.

Create a Synthetic Evaluation Dataset

Generate 100+ question-answer pairs with known relevant chunks. This is your ground truth.

[
  {
    "question": "How do I set up rate limiting in Claude?",
    "relevant_chunks": ["chunk_42", "chunk_43"],
    "correct_answer": "To set up rate limiting..."
  },
  ...
]

Key Metrics

#### Retrieval Metrics

Precision: Of the chunks we retrieved, how many were relevant?

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

Recall: Of all relevant chunks, how many did we retrieve?

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

F1 Score: Harmonic mean of precision and recall. Mean Reciprocal Rank (MRR): How high did the first relevant result appear?

#### End-to-End Metric

Accuracy: Does Claude's final answer match the ground truth?

Implementing the Evaluation

def evaluate_retrieval(questions, ground_truth, vector_db, k=3):
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for q, gt in zip(questions, ground_truth):
        query_emb = vo.embed([q], model="voyage-2").embeddings[0]
        results = vector_db.search(query_emb, k=k)
        retrieved_ids = [r["id"] for r in results]
        
        tp = len(set(retrieved_ids) & set(gt["relevant_chunks"]))
        
        precision = tp / k
        recall = tp / len(gt["relevant_chunks"]) if gt["relevant_chunks"] else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant result
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in gt["relevant_chunks"]:
                mrr = 1.0 / rank
                break
        else:
            mrr = 0.0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic RAG misses context when a question spans multiple chunks. Summary indexing solves this by creating a summary for each chunk and indexing both.

def create_summary(chunk: str) -> str:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize this text in 2-3 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text

def embed_with_summary(chunks: List[str]) -> List[List[float]]: summaries = [create_summary(c) for c in chunks] combined = [f"{s}\n\n{c}" for s, c in zip(summaries, chunks)] return vo.embed(combined, model="voyage-2").embeddings

Level 3: Summary Indexing + Re-Ranking

Re-ranking with Claude dramatically improves MRR. After initial retrieval, Claude scores each chunk for relevance to the query.

def rerank_with_claude(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
    prompt = f"""
    Query: {query}
    
    For each chunk below, rate its relevance to the query on a scale of 1-5.
    Return only the scores as a comma-separated list.
    
    Chunks:
    {chr(10).join([f'{i+1}. {c[:200]}...' for i, c in enumerate(chunks)])}
    """
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    scores = [int(s.strip()) for s in response.content[0].text.split(",")]
    scored_chunks = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [c for c, s in scored_chunks[:top_k]]

Results: Measurable Improvements

After implementing summary indexing and re-ranking, here are the improvements over basic RAG:

MetricBasic RAGAdvanced RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%
The biggest win? MRR jumped from 0.74 to 0.87, meaning the first retrieved chunk is almost always relevant. This directly impacts end-to-end accuracy, which rose from 71% to 81%.

Production Considerations

  • Rate Limits: Full evaluations can hit rate limits. Use Tier 2+ accounts or run partial evals.
  • Vector Database: In-memory works for prototyping. Use Pinecone, Weaviate, or pgvector for production.
  • Chunking Strategy: Experiment with overlap, semantic chunking, or recursive splitting.
  • Embedding Model: Voyage AI's voyage-2 is excellent, but test alternatives like text-embedding-3-small.

Key Takeaways

  • Evaluate retrieval and generation separately to pinpoint bottlenecks in your RAG pipeline
  • Summary indexing improves recall by enriching chunk embeddings with broader context
  • Re-ranking with Claude dramatically boosts MRR, ensuring the most relevant chunk appears first
  • Start simple, measure everything — basic RAG gives a strong baseline, and targeted improvements yield measurable gains
  • End-to-end accuracy improved by 10 percentage points (71% → 81%) through these techniques, proving the value of systematic optimization