Building a Production-Grade RAG System with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics, with code examples throughout.
Building a Production-Grade RAG System with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for grounding Claude in your specific business context. While Claude excels at general knowledge tasks, it can struggle with queries that require access to your internal documentation, customer support articles, or proprietary data. RAG bridges this gap by dynamically retrieving relevant information from your knowledge base and feeding it into Claude's context window.
In this guide, you'll learn how to build, evaluate, and optimize a RAG system using Claude and Voyage AI embeddings. We'll start with a basic "naive" RAG pipeline and progressively enhance it with advanced techniques like summary indexing and re-ranking. By the end, you'll have a practical understanding of how to achieve significant performance gains—we'll show you how to improve end-to-end accuracy from 71% to 81%.
What You'll Need
Before diving in, make sure you have:
- API keys from Anthropic and Voyage AI
- Python libraries:
anthropic,voyageai,pandas,numpy,matplotlib,scikit-learn
import anthropic
import voyageai
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
Level 1: Basic RAG Pipeline
Let's start with the simplest approach—often called "Naive RAG." This three-step pipeline is the foundation everything else builds on:
- Chunk documents by heading (each subheading becomes a separate chunk)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks via cosine similarity when a query comes in
Step 1: Initialize Your Vector Store
For this example, we'll use an in-memory vector database. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.
class InMemoryVectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
def add_document(self, doc_id, content, embedding):
self.documents.append({"id": doc_id, "content": content})
self.embeddings.append(embedding)
def search(self, query_embedding, top_k=3):
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
Step 2: Chunk and Embed Your Documents
vo = voyageai.Client(api_key="your-voyage-api-key")
def chunk_by_headings(document_text):
# Simple splitting by markdown headings
chunks = []
current_heading = "Introduction"
current_content = []
for line in document_text.split("\n"):
if line.startswith("##"):
if current_content:
chunks.append({"heading": current_heading, "content": "\n".join(current_content)})
current_heading = line.replace("##", "").strip()
current_content = []
else:
current_content.append(line)
if current_content:
chunks.append({"heading": current_heading, "content": "\n".join(current_content)})
return chunks
Embed each chunk
chunks = chunk_by_headings(claude_docs)
for i, chunk in enumerate(chunks):
embedding = vo.embed([chunk["content"]], model="voyage-2").embeddings[0]
vector_db.add_document(f"chunk_{i}", chunk["content"], embedding)
Step 3: Retrieve and Answer
def basic_rag(query):
# Embed the query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Retrieve top 3 chunks
retrieved = vector_db.search(query_embedding, top_k=3)
context = "\n\n".join([doc["content"] for doc in retrieved])
# Generate answer with Claude
client = anthropic.Anthropic(api_key="your-anthropic-api-key")
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
}]
)
return response.content[0].text
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production RAG. You need to measure two things independently:
- Retrieval performance: How well does your system find the right chunks?
- End-to-end performance: How well does Claude answer questions given those chunks?
The Evaluation Dataset
We'll use a synthetically generated dataset of 100 samples. Each sample contains:
- A question
- Relevant chunks (the ground truth documents that should be retrieved)
- A correct answer
import json
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
eval_data = json.load(f)
Preview
for sample in eval_data[:3]:
print(f"Q: {sample['question']}")
print(f"Relevant chunks: {len(sample['relevant_chunks'])}")
print(f"Answer: {sample['answer'][:100]}...")
print("---")
Key Metrics Explained
#### Precision
What it measures: Of all the chunks you retrieved, how many were actually relevant?$$\text{Precision} = \frac{\text{True Positives}}{\text{Total Retrieved}}$$
- High precision = few irrelevant chunks in your results
- Low precision = Claude gets distracting noise
$$\text{Recall} = \frac{\text{True Positives}}{\text{Total Relevant}}$$
- High recall = Claude has all the information it needs
- Low recall = Claude might miss critical context
The harmonic mean of precision and recall. A balanced view of retrieval quality.
$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
#### Mean Reciprocal Rank (MRR)
What it measures: How high up in your results does the first relevant chunk appear?$$MRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{rank_i}$$
- MRR of 1.0 = first result is always relevant
- Critical for user-facing applications where top results matter most
The final test: does Claude give the correct answer? This is evaluated by comparing Claude's response to the ground truth answer.
def evaluate_retrieval(retrieved_chunks, relevant_chunks):
retrieved_set = set(retrieved_chunks)
relevant_set = set(relevant_chunks)
true_positives = len(retrieved_set & relevant_set)
precision = true_positives / len(retrieved_set) if retrieved_set else 0
recall = true_positives / len(relevant_set) if relevant_set else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
return {"precision": precision, "recall": recall, "f1": f1}
Level 2: Summary Indexing
Basic RAG has a problem: chunks are often too granular. A single chunk might not contain enough context for Claude to understand the full picture. Summary indexing solves this by creating higher-level summaries of document sections.
How It Works
Instead of embedding raw chunks, you:
- Group related chunks under their parent heading
- Generate a summary of each group using Claude
- Embed and index the summaries
- Retrieve summaries, then pass the full chunk content to Claude
def generate_summary(chunks, heading):
combined = "\n".join([c["content"] for c in chunks])
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize the following section '{heading}' in 2-3 sentences:\n\n{combined}"
}]
)
return response.content[0].text
Build summary index
summary_index = {}
for heading, group_chunks in group_by_heading(all_chunks):
summary = generate_summary(group_chunks, heading)
summary_embedding = vo.embed([summary], model="voyage-2").embeddings[0]
summary_index[heading] = {
"summary": summary,
"embedding": summary_embedding,
"chunks": group_chunks
}
Why It Works
Summaries capture the essence of a section, making retrieval more accurate. When a query matches a summary, you retrieve all chunks under that heading—giving Claude richer context.
Result: Average recall improved from 0.66 to 0.69 in our tests.Level 3: Summary Indexing + Re-Ranking
Re-ranking is the secret weapon for production RAG. After initial retrieval, you use Claude to re-rank the results based on relevance to the specific query.
The Re-Ranking Workflow
- Retrieve top 10 candidates using summary indexing
- For each candidate, ask Claude to score relevance (1-5) to the query
- Sort by score and take the top 3
def rerank_with_claude(query, candidates, top_k=3):
scored = []
for candidate in candidates:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{
"role": "user",
"content": f"On a scale of 1-5, how relevant is this document to the question?\n\nQuestion: {query}\n\nDocument: {candidate['content'][:500]}\n\nAnswer with only a number."
}]
)
score = int(response.content[0].text.strip())
scored.append((score, candidate))
# Sort by score descending and take top_k
scored.sort(reverse=True, key=lambda x: x[0])
return [candidate for _, candidate in scored[:top_k]]
Why Re-Ranking Matters
Re-ranking dramatically improves Mean Reciprocal Rank (MRR)—the first retrieved chunk is far more likely to be relevant. In our tests, MRR jumped from 0.74 to 0.87.
Putting It All Together: Performance Gains
Here's what we achieved by layering these techniques:
| Metric | Basic RAG | Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.69 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 | 0.54 |
| Avg MRR | 0.74 | 0.80 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
Production Considerations
- Rate limits: Full evaluation runs can hit API rate limits. Consider using Tier 2+ accounts or running evaluations incrementally.
- Token costs: Summary indexing and re-ranking add token usage. Balance quality gains against cost.
- Vector database: For production, use a hosted vector DB with built-in indexing and filtering.
- Chunking strategy: Experiment with different chunk sizes and overlap. There's no one-size-fits-all.
Key Takeaways
- Evaluate retrieval and generation separately to pinpoint where your RAG system needs improvement. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
- Summary indexing bridges the granularity gap by grouping related chunks under high-level summaries, improving recall without sacrificing precision.
- Re-ranking with Claude is a powerful but often overlooked technique that significantly boosts MRR and end-to-end accuracy.
- Start simple, then iterate—a basic RAG pipeline can be surprisingly effective. Add complexity (summary indexing, re-ranking) only when you have data proving the need.
- Your evaluation dataset is your compass—invest time in creating a high-quality, representative set of questions and ground-truth answers. It will guide every optimization decision.