GuideBeginnerBest Practices2026-05-13

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques for production-grade performance.

Quick Answer

This guide walks through building a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement chunking, embedding, retrieval, and evaluation pipelines, plus advanced techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.

RAGClaude APIVector SearchEvaluationProduction

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs RAG to handle domain-specific queries about your internal documents, customer support data, or proprietary knowledge bases.

In this guide, we'll build a production-grade RAG system using Claude and Voyage AI embeddings. We'll start with a basic implementation, then systematically improve it using advanced techniques that boosted our end-to-end accuracy from 71% to 81%.

Understanding the RAG Pipeline

A RAG system works in three stages:

Ingestion: Chunk and embed your documents into a vector database
Retrieval: Find relevant chunks for a user's query
Generation: Feed retrieved context to Claude for answer generation

Let's build each component, starting simple and adding sophistication.

Level 1: Basic RAG Implementation

Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai pandas numpy scikit-learn matplotlib

Initialize your API clients:

import anthropic
import voyageai
Initialize clients
claude = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyage-key")

Building a Simple Vector Database

For production, use a hosted vector database like Pinecone or Weaviate. For this guide, we'll use an in-memory implementation:

import numpy as np
from typing import List, Dict, Tuple
class SimpleVectorDB:
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text: str, metadata: Dict = None):
        embedding = vo.embed([text], model="voyage-2").embeddings[0]
        self.documents.append({"text": text, "metadata": metadata or {}})
        self.embeddings.append(embedding)
    
    def search(self, query: str, k: int = 3) -> List[Tuple[str, float]]:
        query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
        scores = [cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[-k:][::-1]
        return [(self.documents[i]["text"], scores[i]) for i in top_indices]
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Chunking Strategy

Basic RAG chunks documents by heading:

def chunk_by_heading(document: str) -> List[str]:
    """Split document into chunks based on markdown headings."""
    chunks = []
    current_chunk = []
    
    for line in document.split('\n'):
        if line.startswith('##') or line.startswith('###'):
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
            current_chunk = [line]
        else:
            current_chunk.append(line)
    
    if current_chunk:
        chunks.append('\n'.join(current_chunk))
    
    return chunks

The Basic RAG Query Function

def basic_rag_query(query: str, vector_db: SimpleVectorDB) -> str:
    # Retrieve relevant chunks
    results = vector_db.search(query, k=3)
    context = "\n\n---\n\n".join([text for text, score in results])
    
    # Generate answer with Claude
    response = claude.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
        }]
    )
    return response.content[0].text

Building a Robust Evaluation System

Don't rely on "vibes" to evaluate your RAG system. We built a synthetic evaluation dataset with 100 samples, each containing:

A question
Relevant document chunks (ground truth)
A correct answer

Key Metrics

#### Retrieval Metrics

Precision: Of the chunks we retrieved, how many were relevant?

Precision = |Retrieved ∩ Correct| / |Retrieved|

Recall: Of all correct chunks, how many did we retrieve?

Recall = |Retrieved ∩ Correct| / |Correct|

F1 Score: Harmonic mean of precision and recall

F1 = 2  (Precision  Recall) / (Precision + Recall)

Mean Reciprocal Rank (MRR): How high did the first relevant result rank?

MRR = 1 / rank_of_first_relevant_result

#### End-to-End Metric

Accuracy: Does Claude's answer match the ground truth? We use Claude itself as a judge:

def evaluate_answer(question: str, generated: str, ground_truth: str) -> bool:
    prompt = f"""Question: {question}
Generated Answer: {generated}
Correct Answer: {ground_truth}
Does the generated answer correctly address the question? Answer only YES or NO."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip() == "YES"

Level 2: Summary Indexing

Basic RAG loses context when chunks are too granular. Summary indexing creates higher-level chunks that preserve document structure:

def create_summary_index(documents: List[str]) -> SimpleVectorDB:
    db = SimpleVectorDB()
    
    for doc in documents:
        # Create a summary of the document
        summary = claude.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize this document in 2-3 sentences:\n\n{doc[:2000]}"
            }]
        ).content[0].text
        
        # Store both summary and full text
        db.add_document(
            text=doc,
            metadata={"summary": summary}
        )
    
    return db

Level 3: Adding Re-Ranking

Re-ranking dramatically improves MRR by having Claude score retrieved chunks for relevance:

def rerank_chunks(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
    prompt = f"""Query: {query}
For each chunk below, rate its relevance to the query on a scale of 1-10.
Return only the chunk indices sorted by relevance (most relevant first).
Chunks:
"""
    
    for i, chunk in enumerate(chunks):
        prompt += f"\n[{i}]: {chunk[:500]}..."
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the ranked indices
    ranked_indices = [int(x) for x in response.content[0].text.split() if x.isdigit()]
    return [chunks[i] for i in ranked_indices[:top_k]]

Putting It All Together

def advanced_rag_query(query: str, vector_db: SimpleVectorDB) -> str:
    # Initial retrieval (get more candidates for re-ranking)
    initial_results = vector_db.search(query, k=10)
    initial_chunks = [text for text, score in initial_results]
    
    # Re-rank with Claude
    top_chunks = rerank_chunks(query, initial_chunks, top_k=3)
    context = "\n\n---\n\n".join(top_chunks)
    
    # Generate answer
    response = claude.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
        }]
    )
    return response.content[0].text

Results: The Impact of Each Improvement

Our systematic improvements yielded measurable gains:

Metric	Basic RAG	+Summary Index	+Re-Ranking
Avg Precision	0.43	0.43	0.44
Avg Recall	0.66	0.67	0.69
Avg F1 Score	0.52	0.53	0.54
Avg MRR	0.74	0.80	0.87
End-to-End Accuracy	71%	76%	81%

Production Considerations

Rate Limits: Full evaluations can hit API rate limits. Use Tier 2+ accounts for production workloads.
Token Budget: Summary indexing and re-ranking increase token usage. Monitor costs.
Vector Database: Replace the in-memory DB with Pinecone, Weaviate, or Qdrant for production.
Caching: Cache embeddings and common queries to reduce API calls.
Monitoring: Log all queries, retrievals, and generations for debugging and improvement.

Key Takeaways

Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline
Summary indexing preserves document context and improves recall by 1-2% over basic chunking
Re-ranking with Claude dramatically improves MRR (from 0.74 to 0.87), ensuring the most relevant context reaches the model
End-to-end accuracy improved 10 percentage points (71% to 81%) through these optimizations
Build a synthetic evaluation dataset before optimizing—you can't improve what you can't measure