Guide2026-04-26

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced optimizations like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, then apply techniques that boosted accuracy from 71% to 81%.

RAGClaudeEvaluationEmbeddingsVector Search

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your unique business context. Whether you're building a customer support chatbot, an internal knowledge base Q&A system, or a financial analysis tool, RAG enables Claude to answer questions grounded in your own documents.

In this guide, we'll walk through building a RAG system using the Claude Documentation as our knowledge base. We'll start with a basic implementation, then show you how to measure performance objectively, and finally apply advanced techniques that improved our end-to-end accuracy from 71% to 81%.

What You'll Need

Before we dive in, let's set up our environment. You'll need:

An Anthropic API key for Claude
A Voyage AI API key for embeddings
Python libraries: anthropic, voyageai, pandas, numpy, matplotlib, scikit-learn

import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

Initialize Your Vector Database

For this guide, we'll use an in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

class InMemoryVectorDB:
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text, embedding):
        self.documents.append(text)
        self.embeddings.append(embedding)
    
    def search(self, query_embedding, top_k=3):
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [(self.documents[i], similarities[i]) for i in top_indices]

Level 1: Basic RAG (Naive RAG)

Let's start with the simplest possible RAG pipeline. This is often called "Naive RAG" in the industry, and it follows three steps:

Chunk documents by heading (each subheading becomes a chunk)
Embed each chunk using Voyage AI
Retrieve relevant chunks using cosine similarity

def basic_rag_pipeline(query, vector_db, top_k=3):
    # Step 1: Embed the query
    vo = voyageai.Client()
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Step 2: Retrieve relevant chunks
    retrieved_chunks = vector_db.search(query_embedding, top_k=top_k)
    
    # Step 3: Generate answer with Claude
    context = "\n\n".join([chunk for chunk, _ in retrieved_chunks])
    
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

This works, but how do we know if it's working well? That's where evaluation comes in.

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production systems. We need objective metrics. Let's build an evaluation suite that measures two things independently:

Retrieval performance – How good is our system at finding the right documents?
End-to-end performance – How good are the final answers?

We'll use a synthetic evaluation dataset of 100 samples, each containing:

A question
The correct chunks (ground truth relevant documents)
A correct answer

Key Retrieval Metrics

#### Precision Precision answers: "Of the chunks we retrieved, how many were actually relevant?"

Precision = True Positives / Total Retrieved

High precision means you're not flooding Claude with irrelevant information. Low precision means you're wasting context window space.

#### Recall Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"

Recall = True Positives / Total Relevant

High recall means Claude has access to all the information it needs. Low recall means you're missing important context.

#### F1 Score The harmonic mean of precision and recall. A balanced measure of retrieval quality.

#### Mean Reciprocal Rank (MRR) MRR measures how high the first relevant result appears in your retrieval list. If the first relevant chunk is always at position 1, MRR is 1.0. If it's often at position 3, MRR drops.

def calculate_mrr(retrieved_chunks, correct_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in correct_chunks:
            return 1.0 / (i + 1)
    return 0.0

End-to-End Accuracy

This measures whether Claude's final answer is correct. You can use LLM-as-judge or exact match against a golden answer.

def evaluate_end_to_end(question, expected_answer, actual_answer):
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1,
        messages=[{
            "role": "user",
            "content": f"Question: {question}\nExpected: {expected_answer}\nActual: {actual_answer}\n\nIs the actual answer correct? Answer only 'yes' or 'no'."
        }]
    )
    return response.content[0].text.strip().lower() == 'yes'

Level 2: Summary Indexing

Our basic RAG has a problem: chunking by heading loses the broader context. A chunk about "rate limits" might not mention it's from the "API Reference" section, making it harder to retrieve for questions about API usage.

Summary indexing solves this by creating a short summary for each chunk and using that summary (along with the chunk) for retrieval.

def create_summary(chunk_text):
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"Summarize this document chunk in 1-2 sentences:\n\n{chunk_text}"
        }]
    )
    return response.content[0].text
During indexing:
for chunk in all_chunks:
    summary = create_summary(chunk)
    combined_text = f"[Summary]: {summary}\n[Content]: {chunk}"
    embedding = vo.embed([combined_text]).embeddings[0]
    vector_db.add_document(chunk, embedding)

This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.

Level 3: Summary Indexing + Re-Ranking

Even with better indexing, we might retrieve 10 chunks but only have room for 3 in Claude's context. Re-ranking uses Claude itself to select the most relevant chunks from an initial candidate set.

def rerank_with_claude(query, candidates, top_k=3):
    client = anthropic.Anthropic()
    
    # Format candidates for Claude
    candidate_text = "\n\n".join([
        f"[Chunk {i+1}]: {chunk}" 
        for i, chunk in enumerate(candidates)
    ])
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Given this query: '{query}'\n\nRank these chunks by relevance (most relevant first). Return only the chunk numbers in order, comma-separated.\n\n{candidate_text}"
        }]
    )
    
    # Parse the ranked indices
    ranked_indices = [int(x.strip())-1 for x in response.content[0].text.split(',')]
    return [candidates[i] for i in ranked_indices[:top_k]]

This technique dramatically improved our Mean Reciprocal Rank from 0.74 to 0.87 – meaning the most relevant chunk almost always appeared first.

Results Summary

Here's what we achieved by layering these techniques:

Metric	Basic RAG	+ Summary Indexing	+ Re-Ranking
Avg Precision	0.43	0.44	0.44
Avg Recall	0.66	0.69	0.69
Avg F1 Score	0.52	0.54	0.54
Avg MRR	0.74	0.74	0.87
End-to-End Accuracy	71%	78%	81%

Production Considerations

Rate limits: Full evaluations can hit API rate limits. Consider using Tier 2+ accounts or sampling your eval set.
Cost: Summary indexing and re-ranking add token costs. Balance improvements against your budget.
Chunking strategy: Experiment with different chunk sizes and overlap. We found heading-based chunking with 200-500 token chunks works well.
Embedding model: Voyage AI's voyage-2 is excellent, but test other models for your domain.

Key Takeaways

Evaluate retrieval and generation separately – Don't rely on "vibes." Use precision, recall, F1, and MRR to measure retrieval quality, and a separate metric for end-to-end accuracy.
Summary indexing improves recall – By enriching chunks with summaries, you make them more discoverable for semantic search.
Re-ranking with Claude boosts MRR significantly – Using Claude to select the most relevant chunks from a candidate pool ensures the best information reaches your final prompt.
Start simple, then iterate – Basic RAG works. Measure it, then apply targeted improvements based on where your metrics are weakest.
End-to-end accuracy is the ultimate metric – All retrieval improvements should ultimately serve the goal of better answers. Our 10-point accuracy gain (71% → 81%) validated our approach.