Building a Production-Grade RAG System with Claude: From Basic to Advanced
Learn how to build and optimize a Retrieval Augmented Generation (RAG) system with Claude, including evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building a Production-Grade RAG System with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG enables Claude to answer domain-specific questions with accuracy and context.
In this guide, we'll walk through building and optimizing a RAG system using Claude and Voyage AI embeddings. We'll start with a basic pipeline, then layer in advanced techniques that measurably improve performance.
What You'll Learn
- How to set up a basic RAG pipeline with Claude
- How to build a robust evaluation suite (beyond "vibes")
- How to implement summary indexing for better retrieval
- How to use Claude as a re-ranker for improved precision
- How to measure and optimize key metrics: Precision, Recall, F1, MRR, and End-to-End Accuracy
Prerequisites
To follow along, you'll need:
- An Anthropic API key
- A Voyage AI API key
- Python 3.8+ with
anthropic,voyageai,pandas,numpy,matplotlib, andscikit-learninstalled
Level 1: Basic RAG (Naive RAG)
Let's start with the simplest possible RAG implementation. This is often called "Naive RAG" in the industry, and it consists of three steps:
- Chunk documents by heading (each chunk contains content from one subheading)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks using cosine similarity
Setting Up the Vector Database
We'll use an in-memory vector database for simplicity. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
texts = [doc["text"] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]:
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, doc_emb)
for doc_emb in self.embeddings
]
top_indices = np.argsort(similarities)[-k:][::-1]
return [
{**self.documents[i], "score": similarities[i]}
for i in top_indices
]
The Basic RAG Query Function
Once your vector DB is populated, querying Claude with retrieved context is straightforward:
from anthropic import Anthropic
anthropic = Anthropic(api_key="your-anthropic-key")
def query_with_rag(query: str, vector_db: InMemoryVectorDB, k: int = 3) -> str:
# Step 1: Retrieve relevant chunks
results = vector_db.search(query, k=k)
context = "\n\n".join([r["text"] for r in results])
# Step 2: Build prompt with context
prompt = f"""Answer the question based on the following context.
Context:
{context}
Question: {query}
Answer:"""
# Step 3: Query Claude
response = anthropic.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
This works, but how well does it actually perform? Let's find out.
Building an Evaluation System
To improve your RAG system, you need to measure it. We'll build an evaluation suite that separates retrieval performance from end-to-end answer quality.
Creating a Synthetic Evaluation Dataset
We'll generate 100 evaluation samples, each containing:
- A question
- Relevant chunks (ground truth for retrieval)
- A correct answer (ground truth for end-to-end)
import json
Load the evaluation dataset (pre-generated)
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
eval_data = json.load(f)
Preview the first sample
print(json.dumps(eval_data[0], indent=2))
Defining Key Metrics
We'll track five metrics:
#### Precision Precision answers: "Of the chunks we retrieved, how many were actually relevant?"
$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$
High precision means fewer false positives (irrelevant chunks).
#### Recall Recall answers: "Of all the correct chunks that exist, how many did we retrieve?"
$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$
High recall means we're not missing important information.
#### F1 Score The harmonic mean of precision and recall:
$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
#### Mean Reciprocal Rank (MRR) MRR measures how early the first relevant chunk appears in your results. If the first relevant chunk is at position 1, the reciprocal rank is 1. If it's at position 3, it's 1/3.
$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$
#### End-to-End Accuracy This measures whether Claude's final answer is correct, as judged by a human or an LLM evaluator.
Implementing the Evaluation
def evaluate_retrieval(vector_db, eval_data, k=3):
precisions = []
recalls = []
f1_scores = []
reciprocal_ranks = []
for sample in eval_data:
query = sample["question"]
correct_chunks = set(sample["relevant_chunks"])
# Retrieve
results = vector_db.search(query, k=k)
retrieved_chunks = set([r["id"] for r in results])
# Calculate metrics
true_positives = len(retrieved_chunks & correct_chunks)
precision = true_positives / len(retrieved_chunks) if retrieved_chunks else 0
recall = true_positives / len(correct_chunks) if correct_chunks else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: find first relevant chunk
for i, r in enumerate(results):
if r["id"] in correct_chunks:
reciprocal_ranks.append(1 / (i + 1))
break
else:
reciprocal_ranks.append(0)
precisions.append(precision)
recalls.append(recall)
f1_scores.append(f1)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean(f1_scores),
"avg_mrr": np.mean(reciprocal_ranks)
}
Level 2: Summary Indexing
Basic RAG struggles when a single chunk doesn't contain enough context. Summary indexing solves this by creating a "summary chunk" for each section that includes both the heading and a condensed version of the content.
How It Works
Instead of chunking by subheading only, we:
- Create a summary of each major section using Claude
- Store both the summary and the original chunks
- Retrieve against summaries first, then use the full chunks as context
def create_summary_index(documents, anthropic_client):
summary_index = []
for doc in documents:
prompt = f"""Summarize the following document section in 2-3 sentences:
{doc['text']}
Summary:"""
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
summary = response.content[0].text
summary_index.append({
"id": doc["id"],
"summary": summary,
"full_text": doc["text"]
})
return summary_index
This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.
Level 3: Summary Indexing + Re-Ranking
Re-ranking takes the initial retrieval results and uses Claude to reorder them by relevance. This dramatically improves MRR.
Implementing a Re-Ranker
def rerank_with_claude(query, candidates, anthropic_client, top_k=3):
# Build a prompt asking Claude to rank chunks by relevance
chunks_text = "\n\n---\n\n".join([
f"Chunk {i+1}: {c['text']}"
for i, c in enumerate(candidates)
])
prompt = f"""Given the question below, rank the following chunks by relevance.
Return the chunk numbers in order of relevance (most relevant first).
Question: {query}
{chunks_text}
Ranked chunk numbers (most relevant first):"""
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response
ranked_indices = [
int(x.strip()) - 1
for x in response.content[0].text.split(",")
if x.strip().isdigit()
]
# Reorder candidates
reranked = [candidates[i] for i in ranked_indices if i < len(candidates)]
return reranked[:top_k]
This technique boosted MRR from 0.74 to 0.87 and end-to-end accuracy from 71% to 81%.
Results Summary
| Metric | Basic RAG | + Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.69 | 0.69 |
| Avg F1 | 0.52 | 0.54 | 0.54 |
| Avg MRR | 0.74 | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 71% | 81% |
Production Considerations
- Rate Limits: Full evaluations may hit rate limits unless you're on Tier 2+. Consider running smaller subsets during development.
- Vector Database: Use a hosted solution (Pinecone, Weaviate, Chroma) for production workloads.
- Embedding Model: Voyage AI's
voyage-2is excellent, but experiment with other models for your domain. - Chunking Strategy: Experiment with different chunk sizes and overlap strategies.
Key Takeaways
- Measure separately: Always evaluate retrieval performance independently from end-to-end answer quality. This helps you pinpoint where improvements are needed.
- Start simple, then optimize: A basic RAG pipeline works surprisingly well. Add complexity (summary indexing, re-ranking) only when metrics show a clear need.
- MRR matters most for user experience: Users care most about whether the first result is relevant. Re-ranking with Claude dramatically improves this metric.
- Synthetic evaluation datasets are powerful: Generate 100-200 Q&A pairs with ground truth chunks and answers. This gives you a repeatable benchmark for measuring improvements.
- Advanced techniques pay off: Summary indexing and re-ranking together improved end-to-end accuracy by 10 percentage points (from 71% to 81%).