Guide2026-04-21

Building and Optimizing RAG Systems with Claude: A Practical Guide

Learn how to implement and optimize Retrieval Augmented Generation (RAG) with Claude AI. This guide covers basic setup, evaluation metrics, and advanced techniques like summary indexing and re-ranking.

Quick Answer

This guide teaches you to build a RAG system using Claude, evaluate its performance with precision, recall, and accuracy metrics, and implement advanced optimizations like summary indexing and re-ranking to improve results from 71% to 81% accuracy.

RAGClaude APIVector DatabasesEvaluationEmbeddings

Building and Optimizing RAG Systems with Claude: A Practical Guide

Claude excels at general tasks but may struggle with domain-specific queries about your business context. Retrieval Augmented Generation (RAG) solves this by enabling Claude to access your internal knowledge bases, documents, and support materials. Enterprises use RAG to enhance customer support, analyze financial/legal documents, and answer internal questions.

In this guide, we'll walk through building a production-ready RAG system using Claude Documentation as our knowledge base, complete with evaluation frameworks and optimization techniques.

Why RAG Matters for Claude Users

RAG bridges the gap between Claude's general knowledge and your specific domain expertise. Instead of retraining models or fine-tuning, you can dynamically retrieve relevant information from your documents and feed it to Claude as context. This approach is:

Cost-effective: No model retraining required
Updatable: Simply add new documents to your knowledge base
Transparent: You can trace answers back to source materials
Accurate: Reduces hallucinations by grounding responses in your data

Prerequisites and Setup

Before building your RAG system, you'll need:

API Keys:

- Anthropic API key - Voyage AI API key for embeddings

Required Libraries:

# Install required packages
!pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Import libraries
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json
from typing import List, Dict, Tuple

Initialize Clients:

# Initialize API clients
client = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyageai-key")
Simple in-memory vector database class
class VectorDB:
    def __init__(self):
        self.documents = []
        self.embeddings = []
        self.metadata = []
    
    def add_document(self, text: str, metadata: dict = None):
        """Add document and generate embedding"""
        embedding = vo.embed(text, model="voyage-2").embeddings[0]
        self.documents.append(text)
        self.embeddings.append(embedding)
        self.metadata.append(metadata or {})
    
    def search(self, query: str, k: int = 3) -> List[Tuple[str, float, dict]]:
        """Search for similar documents"""
        query_embedding = vo.embed(query, model="voyage-2").embeddings[0]
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        indices = np.argsort(similarities)[::-1][:k]
        return [(self.documents[i], similarities[i], self.metadata[i]) 
                for i in indices]

Level 1: Building a Basic RAG Pipeline

A basic ("Naive") RAG pipeline consists of three steps:

1. Document Chunking

Chunk documents by logical sections (like headings) to maintain context:

def chunk_by_heading(document_text: str) -> List[Dict]:
    """Simple chunking by heading sections"""
    chunks = []
    current_chunk = {"heading": "", "content": ""}
    
    for line in document_text.split('\n'):
        if line.startswith('#'):  # Markdown heading
            if current_chunk["content"]:
                chunks.append(current_chunk)
            current_chunk = {"heading": line.strip('# '), "content": ""}
        else:
            current_chunk["content"] += line + "\n"
    
    if current_chunk["content"]:
        chunks.append(current_chunk)
    return chunks

2. Embedding Generation

Generate embeddings for each chunk using Voyage AI:

def index_documents(documents: List[Dict]) -> VectorDB:
    """Index documents in vector database"""
    db = VectorDB()
    for doc in documents:
        # Combine heading and content for better context
        text = f"{doc['heading']}\n{doc['content']}"
        db.add_document(text, {"heading": doc["heading"]})
    return db

3. Query and Response Generation

Retrieve relevant chunks and generate answers with Claude:

def basic_rag_query(db: VectorDB, query: str) -> str:
    """Execute RAG query with Claude"""
    # Retrieve relevant chunks
    results = db.search(query, k=3)
    
    # Build context
    context = "\n\n".join([f"## {res[2].get('heading', '')}\n{res[0]}" 
                              for res in results])
    
    # Generate response with Claude
    prompt = f"""You are a helpful assistant answering questions based on the provided context.
Context:
{context}
Question: {query}
Answer based only on the context provided. If the answer isn't in the context, say "I don't have enough information to answer that question."
Answer:"""
    
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Building an Evaluation System

Moving beyond "vibes-based" evaluation is crucial for production RAG systems. We need to measure both retrieval performance and end-to-end accuracy.

Creating an Evaluation Dataset

Create a synthetic dataset with:

Questions
Relevant document chunks (expected retrieval)
Correct answers

# Example evaluation dataset structure
evaluation_data = [
    {
        "question": "How do I set up API rate limits?",
        "relevant_chunks": ["chunk_id_1", "chunk_id_2"],
        "correct_answer": "API rate limits are configured in the dashboard..."
    }
    # ... 99 more samples
]

Key Evaluation Metrics

#### Retrieval Metrics:

Precision: Proportion of retrieved chunks that are relevant

def calculate_precision(retrieved: List[str], relevant: List[str]) -> float:
       relevant_set = set(relevant)
       retrieved_set = set(retrieved)
       true_positives = len(retrieved_set.intersection(relevant_set))
       return true_positives / len(retrieved_set) if retrieved_set else 0

Recall: Proportion of relevant chunks that were retrieved

def calculate_recall(retrieved: List[str], relevant: List[str]) -> float:
       relevant_set = set(relevant)
       retrieved_set = set(retrieved)
       true_positives = len(retrieved_set.intersection(relevant_set))
       return true_positives / len(relevant_set) if relevant_set else 0

F1 Score: Harmonic mean of precision and recall

Mean Reciprocal Rank (MRR): Measures how high the first relevant result appears

#### End-to-End Accuracy: Measure whether Claude's final answer matches the expected answer (using semantic similarity or human evaluation).

Level 2: Summary Indexing

Basic RAG can miss broader context. Summary indexing adds hierarchical structure:

def create_summary_index(documents: List[Dict]) -> Tuple[VectorDB, VectorDB]:
    """Create two-level index: summaries and detailed chunks"""
    summary_db = VectorDB()
    detail_db = VectorDB()
    
    for doc in documents:
        # Create summary using Claude
        summary_prompt = f"Summarize this document section in 2-3 sentences:\n\n{doc['content']}"
        summary_response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=150,
            messages=[{"role": "user", "content": summary_prompt}]
        )
        summary = summary_response.content[0].text
        
        # Index summary
        summary_db.add_document(
            f"{doc['heading']}\n{summary}",
            {"type": "summary", "doc_id": doc["id"]}
        )
        
        # Index detailed content
        detail_db.add_document(
            f"{doc['heading']}\n{doc['content']}",
            {"type": "detail", "doc_id": doc["id"]}
        )
    
    return summary_db, detail_db

Two-stage retrieval process:

First search the summary index to identify relevant topics
Then retrieve detailed chunks from those topics

Level 3: Summary Indexing with Re-Ranking

Add a re-ranking step using Claude to improve result ordering:

def rerank_with_claude(query: str, candidates: List[Tuple[str, float, dict]]) -> List[Tuple[str, float, dict]]:
    """Use Claude to re-rank retrieved documents"""
    if len(candidates) <= 1:
        return candidates
    
    # Prepare documents for ranking
    docs_text = "\n\n".join([
        f"[Document {i+1}]\n{candidates[i][0]}" 
        for i in range(len(candidates))
    ])
    
    ranking_prompt = f"""Rank these documents by relevance to the query.
Query: {query}
Documents:
{docs_text}
Return ONLY a comma-separated list of document numbers in order of relevance (most relevant first).
Example: "3,1,2""""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=50,
        messages=[{"role": "user", "content": ranking_prompt}]
    )
    
    # Parse ranking and reorder
    ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
    return [candidates[i] for i in ranking]

Performance Improvements

Through these optimizations, we achieved significant gains:

Metric	Basic RAG	Optimized RAG	Improvement
Avg Precision	0.43	0.44	+2.3%
Avg Recall	0.66	0.69	+4.5%
Avg F1 Score	0.52	0.54	+3.8%
Avg MRR	0.74	0.87	+17.6%
End-to-End Accuracy	71%	81%	+14.1%

Note on Rate Limits: Full evaluations may hit API rate limits unless you're in Tier 2 or above. Consider sampling your evaluation dataset if conserving tokens.

Production Considerations

Vector Database Choice: For production, use hosted solutions like Pinecone, Weaviate, or pgvector instead of in-memory storage.

Chunking Strategy: Experiment with different chunk sizes (200-1000 tokens) and overlap strategies.

Embedding Models: Test different models (Voyage, OpenAI, Cohere) for your specific domain.

Hybrid Search: Combine semantic search with keyword matching for better recall.

Caching: Cache embeddings and frequent queries to reduce costs and latency.

Key Takeaways

Start with Basic RAG: Implement a simple pipeline first (chunk → embed → retrieve) before adding complexity.

Measure Systematically: Use precision, recall, F1, MRR, and end-to-end accuracy metrics—don't rely on subjective evaluation.

Optimize Retrieval First: Poor retrieval can't be fixed by the LLM. Focus on getting the right documents before improving answer generation.

Add Hierarchy with Summaries: Summary indexing helps Claude understand broader context and improves retrieval of related information.

Re-rank with Claude: Use Claude's understanding of relevance to improve the order of retrieved documents, significantly boosting MRR and final answer quality.

By following this guide, you can build a RAG system that transforms Claude from a general-purpose assistant into a domain expert with access to your specific knowledge base.