Building a Production-Ready RAG System with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, and see how targeted improvements boosted accuracy from 71% to 81%.
Building a Production-Ready RAG System with Claude: From Basic to Advanced
Claude excels at general-purpose language tasks, but when you need answers grounded in your proprietary knowledge base—internal documentation, customer support articles, or financial reports—you need Retrieval Augmented Generation (RAG). RAG bridges the gap between Claude's broad capabilities and your specific domain context.
In this guide, we'll build a RAG system using Claude and Voyage AI embeddings, using the Claude Documentation as our knowledge base. We'll start with a basic "naive" pipeline, then layer in advanced techniques like summary indexing and re-ranking. Along the way, we'll build a proper evaluation suite to measure what matters.
By the end, you'll understand how to achieve significant performance gains: our final system improved end-to-end accuracy from 71% to 81%, with Mean Reciprocal Rank jumping from 0.74 to 0.87.
What You'll Need
Before we start, gather your tools:
- Anthropic API key – for accessing Claude
- Voyage AI API key – for generating high-quality embeddings
- Python environment with these libraries:
anthropic
- voyageai
- pandas, numpy, matplotlib, scikit-learn
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Level 1: Basic RAG (Naive RAG)
Let's start with the simplest possible RAG pipeline. This is often called "Naive RAG" in the industry. It follows three steps:
- Chunk documents – Split your knowledge base into manageable pieces. Here, we chunk by heading, keeping content under each subheading together.
- Embed each chunk – Use Voyage AI to convert text chunks into vector embeddings.
- Retrieve by cosine similarity – When a query comes in, embed it, find the most similar chunks, and feed them to Claude as context.
Initialize an In-Memory Vector Database
For this example, we'll use a simple in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.
class SimpleVectorDB:
def __init__(self):
self.chunks = []
self.embeddings = []
def add_chunk(self, text, embedding):
self.chunks.append(text)
self.embeddings.append(embedding)
def search(self, query_embedding, top_k=3):
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(self.chunks[i], similarities[i]) for i in top_indices]
The Basic RAG Loop
def basic_rag(query, vector_db, claude_client):
# Step 1: Embed the query
query_embedding = voyage_client.embed(query)
# Step 2: Retrieve relevant chunks
retrieved = vector_db.search(query_embedding, top_k=3)
context = "\n\n".join([chunk for chunk, _ in retrieved])
# Step 3: Generate answer with Claude
response = claude_client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
This works, but how well? To answer that, we need an evaluation system.
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production. You need quantitative metrics that measure both retrieval quality and end-to-end answer correctness.
The Evaluation Dataset
We synthetically generated 100 test samples, each containing:
- A question
- The correct chunks (ground truth) that should be retrieved
- A correct answer
Retrieval Metrics
#### Precision
What it measures: Of all chunks retrieved, how many were actually relevant?$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$
High precision means you're not wasting Claude's context window on irrelevant information.
#### Recall
What it measures: Of all relevant chunks in the database, how many did we retrieve?$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$
High recall ensures Claude has all the information it needs.
#### F1 Score
The harmonic mean of precision and recall. Balances both concerns.
#### Mean Reciprocal Rank (MRR)
What it measures: How early in the retrieval results does the first relevant chunk appear?$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$
MRR is critical for RAG because Claude's context window is limited—you want the most relevant information first.
End-to-End Accuracy
This measures whether Claude's final answer is correct given the retrieved context. It's the ultimate test: does the system actually help users?
Level 2: Summary Indexing
Basic RAG has a problem: a single chunk might not contain enough context. For example, a chunk about "rate limits" might not mention that it's part of a larger section on "API best practices."
Summary indexing solves this by creating a secondary index of chunk summaries. When a query comes in, you first search the summary index to find the right neighborhood, then retrieve the full chunks.def build_summary_index(chunks, claude_client):
summaries = []
for chunk in chunks:
summary = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{
"role": "user",
"content": f"Summarize this in 1-2 sentences: {chunk}"
}]
)
summaries.append(summary.content[0].text)
return summaries
Then, during retrieval:
- Embed the query and search the summary index.
- Retrieve the full chunks corresponding to the top summary matches.
Level 3: Summary Indexing + Re-Ranking
Even with summary indexing, the top-3 retrieved chunks might not be in the optimal order. Re-ranking uses Claude itself to reorder the retrieved chunks by relevance to the query.
def rerank_chunks(query, chunks, claude_client):
prompt = f"""Given the query: "{query}"
Rank the following chunks by relevance (most relevant first).
Return only the chunk numbers in order, separated by commas.
Chunks:
{chr(10).join([f'{i}: {chunk[:200]}...' for i, chunk in enumerate(chunks)])}
"""
response = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the ordered indices
ordered_indices = [int(x.strip()) for x in response.content[0].text.split(",")]
return [chunks[i] for i in ordered_indices]
Re-ranking dramatically improved our MRR from 0.74 to 0.87, meaning the most relevant chunk almost always appears first.
Results at a Glance
| Metric | Basic RAG | Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.69 | 0.69 |
| Avg F1 | 0.52 | 0.54 | 0.54 |
| Avg MRR | 0.74 | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 75% | 81% |
- Summary indexing improving recall (finding more relevant chunks)
- Re-ranking improving MRR (putting the best chunk first), which directly boosted end-to-end accuracy
Production Considerations
- Rate limits – Full evaluations can hit rate limits unless you're on Tier 2 or above. Consider running smaller eval sets during development.
- Vector database – Our in-memory DB is fine for prototyping. For production, use a scalable solution.
- Chunking strategy – Experiment with different chunk sizes and overlap. We found heading-based chunking worked well for documentation.
- Embedding model – Voyage AI provides domain-specific embeddings. Test different models for your use case.
Key Takeaways
- Evaluate retrieval and generation separately – Use precision, recall, F1, and MRR for retrieval; end-to-end accuracy for the full system.
- Summary indexing improves recall – By searching summaries first, you find relevant chunks that might be missed by embedding similarity alone.
- Re-ranking with Claude boosts MRR significantly – Putting the most relevant chunk first improves Claude's answers because it sees the best context immediately.
- Small improvements compound – A 0.13 increase in MRR translated to a 10% absolute gain in end-to-end accuracy.
- Build your evaluation dataset early – Synthetic data generation works well for initial development. Iterate on real user queries as you mature.