GuideBeginnerBest Practices2026-05-22

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn how to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques for production-grade performance.

Quick Answer

This guide teaches you to build a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement vector search, create evaluation suites, and apply techniques like summary indexing and re-ranking to improve retrieval accuracy from 71% to 81%.

RAGClaude APIVector DatabasesEvaluationEmbeddings

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities into your specific business domain. While Claude excels at general knowledge tasks, it needs RAG to answer questions about your internal documentation, customer support history, or proprietary data.

In this comprehensive guide, we'll walk through building a production-grade RAG system using Claude, Voyage AI embeddings, and systematic evaluation. We'll start with a basic implementation and progressively optimize it using advanced techniques that improved end-to-end accuracy from 71% to 81% in production testing.

Understanding the RAG Architecture

Before diving into code, let's understand what makes RAG tick. A RAG system has three core components:

Ingestion Pipeline: Chunks your documents, generates embeddings, and stores them in a vector database
Retrieval System: Finds the most relevant document chunks for a given query
Generation System: Feeds retrieved context to Claude to produce accurate answers

The magic happens when these components work together seamlessly. Let's build them step by step.

Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

You'll need API keys from both Anthropic and Voyage AI. Set them as environment variables:

import os
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-key"
os.environ["VOYAGE_API_KEY"] = "your-voyage-key"

Level 1: Building a Basic RAG Pipeline

Let's start with what the industry calls "Naive RAG" – a straightforward implementation that gets the job done but has room for improvement.

Step 1: Document Chunking

We'll chunk documents by headings, keeping content from each subheading together:

def chunk_document_by_headings(text):
    """Split document into chunks based on markdown headings"""
    chunks = []
    current_heading = None
    current_content = []
    
    for line in text.split('\n'):
        if line.startswith('##') or line.startswith('###'):
            if current_heading:
                chunks.append({
                    'heading': current_heading,
                    'content': '\n'.join(current_content)
                })
            current_heading = line
            current_content = []
        else:
            current_content.append(line)
    
    # Don't forget the last chunk
    if current_heading:
        chunks.append({
            'heading': current_heading,
            'content': '\n'.join(current_content)
        })
    
    return chunks

Step 2: Generate Embeddings

Using Voyage AI's embedding model:

import voyageai
vo = voyageai.Client()
def generate_embeddings(chunks):
    """Generate embeddings for each chunk"""
    texts = [chunk['content'] for chunk in chunks]
    embeddings = vo.embed(texts, model="voyage-2").embeddings
    
    for i, chunk in enumerate(chunks):
        chunk['embedding'] = embeddings[i]
    
    return chunks

Step 3: In-Memory Vector Database

For this guide, we'll use a simple in-memory store. In production, consider Pinecone, Weaviate, or Chroma:

class InMemoryVectorDB:
    def __init__(self):
        self.chunks = []
    
    def add_chunks(self, chunks):
        self.chunks.extend(chunks)
    
    def search(self, query_embedding, top_k=3):
        """Find top_k most similar chunks using cosine similarity"""
        similarities = []
        for chunk in self.chunks:
            similarity = cosine_similarity(query_embedding, chunk['embedding'])
            similarities.append(similarity)
        
        # Get indices of top_k results
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.chunks[i] for i in top_indices]
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Step 4: Query Pipeline

Now let's tie it all together with Claude:

from anthropic import Anthropic
client = Anthropic()
def rag_query(query, vector_db, top_k=3):
    # 1. Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # 2. Retrieve relevant chunks
    relevant_chunks = vector_db.search(query_embedding, top_k=top_k)
    
    # 3. Build context from chunks
    context = "\n\n---\n\n".join([
        f"From section '{chunk['heading']}':\n{chunk['content']}"
        for chunk in relevant_chunks
    ])
    
    # 4. Generate answer with Claude
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Based on the following documentation, answer the question.\n\nDocumentation:\n{context}\n\nQuestion: {query}"
        }]
    )
    
    return response.content[0].text

Building a Robust Evaluation System

This is where most RAG tutorials stop – but it's where the real work begins. You can't improve what you can't measure.

Creating an Evaluation Dataset

We need three things for each test case:

A question
The correct chunks (ground truth for retrieval)
A correct answer (ground truth for generation)

Here's how to structure it:

evaluation_data = [
    {
        "question": "How do I set up rate limiting in Claude?",
        "relevant_chunks": ["rate_limiting_intro", "rate_limit_config"],
        "correct_answer": "You can set up rate limiting by configuring..."
    },
    # ... 97 more samples
]

Key Metrics Explained

Retrieval Metrics (measure your search quality):

Precision: Of all chunks retrieved, how many were relevant?

- Formula: True Positives / Total Retrieved - High precision = fewer irrelevant results

Recall: Of all relevant chunks, how many did we retrieve?

- Formula: True Positives / Total Relevant - High recall = we're not missing important information

F1 Score: Harmonic mean of precision and recall

- Formula: 2 (Precision Recall) / (Precision + Recall)

Mean Reciprocal Rank (MRR): How early does the first relevant result appear?

- Critical for user experience – users want answers fast End-to-End Metric:

Accuracy: Does Claude produce the correct answer given the retrieved context?

- This is your ultimate business metric

Implementing the Evaluation

def evaluate_retrieval(vector_db, eval_data):
    metrics = {
        'precision': [],
        'recall': [],
        'f1': [],
        'mrr': []
    }
    
    for item in eval_data:
        query_embedding = vo.embed([item['question']]).embeddings[0]
        retrieved = vector_db.search(query_embedding, top_k=3)
        
        retrieved_ids = [chunk['id'] for chunk in retrieved]
        relevant_ids = item['relevant_chunks']
        
        # Calculate metrics
        true_positives = len(set(retrieved_ids) & set(relevant_ids))
        
        precision = true_positives / len(retrieved_ids)
        recall = true_positives / len(relevant_ids)
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant result
        mrr = 0
        for i, chunk_id in enumerate(retrieved_ids):
            if chunk_id in relevant_ids:
                mrr = 1 / (i + 1)
                break
        
        metrics['precision'].append(precision)
        metrics['recall'].append(recall)
        metrics['f1'].append(f1)
        metrics['mrr'].append(mrr)
    
    return {k: np.mean(v) for k, v in metrics.items()}

Level 2: Summary Indexing

Basic RAG struggles when answers span multiple chunks. Summary indexing solves this by creating higher-level summaries that capture cross-cutting concepts.

def create_summary_index(chunks):
    """Create summary embeddings for groups of related chunks"""
    summaries = []
    
    # Group chunks by topic (simplified - use clustering in production)
    for i in range(0, len(chunks), 3):
        group = chunks[i:i+3]
        combined = " ".join([c['content'] for c in group])
        
        # Generate summary using Claude
        summary = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize this documentation section:\n\n{combined}"
            }]
        ).content[0].text
        
        # Embed the summary
        summary_embedding = vo.embed([summary]).embeddings[0]
        
        summaries.append({
            'summary': summary,
            'embedding': summary_embedding,
            'source_chunks': group
        })
    
    return summaries

Level 3: Re-Ranking with Claude

Re-ranking dramatically improves MRR by having Claude evaluate the relevance of retrieved chunks before generating an answer:

def rerank_with_claude(query, chunks, top_k=3):
    """Use Claude to re-rank retrieved chunks by relevance"""
    chunk_scores = []
    
    for chunk in chunks:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"On a scale of 0-10, how relevant is this chunk to the question?\n\nQuestion: {query}\n\nChunk: {chunk['content'][:500]}\n\nAnswer with just a number."
            }]
        )
        
        try:
            score = int(response.content[0].text.strip())
        except ValueError:
            score = 5  # Default score if parsing fails
        
        chunk_scores.append((score, chunk))
    
    # Sort by score descending and return top_k
    chunk_scores.sort(key=lambda x: x[0], reverse=True)
    return [chunk for score, chunk in chunk_scores[:top_k]]

Performance Results

After implementing all three levels of optimization, here are the improvements we observed:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement came in MRR (from 0.74 to 0.87), meaning users find relevant information much earlier in the results. The 10% improvement in end-to-end accuracy translates to significantly more reliable answers.

Production Considerations

When moving to production, consider these additional optimizations:

Hybrid Search: Combine semantic search with keyword matching for better recall
Caching: Cache frequent queries and their results
Monitoring: Track retrieval metrics in production to catch degradation
A/B Testing: Test different chunking strategies and embedding models

Key Takeaways

Start simple, measure everything: Build a basic RAG pipeline first, then establish rigorous evaluation metrics before optimizing
Separate retrieval from generation metrics: You need to know whether failures come from missing context or poor reasoning
Summary indexing bridges the gap: When answers span multiple chunks, summary-level retrieval captures the big picture
Re-ranking with Claude boosts MRR significantly: Having Claude evaluate relevance before answering improves both speed and accuracy
Production RAG requires continuous evaluation: Metrics like precision and recall should be monitored in production to catch data drift and degradation