Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic pipelines, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building Production-Grade RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for grounding Claude in your proprietary knowledge. Whether you're building a customer support bot, an internal Q&A system, or a document analysis tool, RAG lets Claude answer questions about your specific data without fine-tuning.
In this guide, we'll walk through building a RAG system using Claude, Voyage AI embeddings, and an in-memory vector database. We'll start with a basic pipeline, then show you how to measure performance properly, and finally implement advanced techniques that boost accuracy from 71% to 81%.
Understanding the RAG Architecture
A RAG system works in three stages:
- Indexing: Chunk your documents, embed each chunk, and store the embeddings in a vector database
- Retrieval: When a query comes in, embed it and find the most similar chunks
- Generation: Pass the retrieved chunks as context to Claude along with the user's question
Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
You'll need API keys from Anthropic and Voyage AI. Store them as environment variables:
import os
os.environ["ANTHROPIC_API_KEY"] = "your-key-here"
os.environ["VOYAGE_API_KEY"] = "your-key-here"
Initialize Your Vector Database
For this guide, we'll use an in-memory vector database. In production, consider hosted solutions like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, Any]]):
texts = [doc["text"] for doc in documents]
response = self.client.embed(texts, model="voyage-2")
self.embeddings.extend(response.embeddings)
self.documents.extend(documents)
def search(self, query: str, top_k: int = 3) -> List[Dict[str, Any]]:
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
scores = [self._cosine_similarity(query_embedding, emb) for emb in self.embeddings]
top_indices = np.argsort(scores)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
def _cosine_similarity(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Level 1: Basic RAG Pipeline
Let's build a "naive" RAG pipeline. This is the simplest approach: chunk documents by heading, embed each chunk, and retrieve using cosine similarity.
from anthropic import Anthropic
class BasicRAG:
def __init__(self, api_key: str, voyage_key: str):
self.vector_db = InMemoryVectorDB(api_key=voyage_key)
self.llm = Anthropic(api_key=api_key)
def index_documents(self, documents: List[Dict[str, str]]):
"""
documents: list of dicts with 'id', 'title', 'content' keys
"""
chunks = []
for doc in documents:
# Simple chunking by paragraph
paragraphs = doc["content"].split("\n\n")
for i, para in enumerate(paragraphs):
if len(para.strip()) > 50: # Skip very short chunks
chunks.append({
"id": f"{doc['id']}_{i}",
"text": para,
"source": doc["title"]
})
self.vector_db.add_documents(chunks)
def query(self, question: str) -> str:
# Retrieve relevant chunks
chunks = self.vector_db.search(question, top_k=3)
context = "\n\n".join([c["text"] for c in chunks])
# Generate answer with Claude
response = self.llm.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Answer the question based on the provided context.
Context:
{context}
Question: {question}
Answer:"""
}]
)
return response.content[0].text
This works, but how well? Let's find out.
Building a Robust Evaluation System
Most RAG projects fail because teams rely on "vibes" instead of metrics. You need to evaluate two things independently:
- Retrieval quality: Are we finding the right chunks?
- End-to-end accuracy: Is Claude giving correct answers?
Creating an Evaluation Dataset
Generate a synthetic dataset with 100+ examples. Each example should have:
- A question
- The correct answer
- The IDs of relevant chunks
import json
eval_data = [
{
"question": "How do I stream responses from Claude?",
"answer": "Use the 'stream' parameter set to True when calling the Messages API...",
"relevant_chunks": ["doc_3_2", "doc_3_5"]
},
# ... 99 more examples
]
with open("evaluation_dataset.json", "w") as f:
json.dump(eval_data, f, indent=2)
Key Metrics Explained
Precision: Of the chunks we retrieved, how many were actually relevant?Precision = True Positives / Total Retrieved
Recall: Of all the relevant chunks that exist, how many did we retrieve?
Recall = True Positives / Total Relevant
F1 Score: Harmonic mean of precision and recall.
Mean Reciprocal Rank (MRR): How early did the first relevant chunk appear in our results?
End-to-End Accuracy: Did Claude's final answer match the expected answer?
Running the Evaluation
def evaluate_retrieval(rag_system, eval_data):
precisions, recalls, f1s, mrrs = [], [], [], []
for item in eval_data:
retrieved = rag_system.vector_db.search(item["question"], top_k=3)
retrieved_ids = [r["id"] for r in retrieved]
relevant = item["relevant_chunks"]
tp = len(set(retrieved_ids) & set(relevant))
precision = tp / len(retrieved_ids)
recall = tp / len(relevant)
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant chunk
for rank, rid in enumerate(retrieved_ids, 1):
if rid in relevant:
mrr = 1.0 / rank
break
else:
mrr = 0.0
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
mrrs.append(mrr)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean(f1s),
"avg_mrr": np.mean(mrrs)
}
Level 2: Summary Indexing
Basic chunking loses context. A document about "Claude's safety features" might have chunks about "Constitutional AI" and "Red teaming" that are useless without each other.
Summary indexing solves this by creating a summary of each document and using it as an additional retrieval target.class SummaryIndexRAG(BasicRAG):
def index_documents(self, documents: List[Dict[str, str]]):
# First, create summaries
for doc in documents:
response = self.llm.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this document in 2-3 sentences:\n\n{doc['content']}"
}]
)
summary = response.content[0].text
# Add summary as a retrievable chunk
self.vector_db.add_documents([{
"id": f"{doc['id']}_summary",
"text": summary,
"source": doc["title"],
"type": "summary"
}])
# Then add regular chunks
super().index_documents(documents)
This improved recall from 0.66 to 0.69 in our tests. The summaries act as "table of contents" entries that help the retriever find the right document even when the exact wording doesn't match.
Level 3: Adding Re-Ranking
Retrieval gives you candidates; re-ranking ensures you only pass the best ones to Claude. This is critical because Claude's context window is valuable real estate.
class ReRankRAG(SummaryIndexRAG):
def query(self, question: str) -> str:
# Retrieve more candidates
candidates = self.vector_db.search(question, top_k=10)
# Use Claude to re-rank
candidate_texts = "\n---\n".join([
f"[{i+1}] {c['text']}" for i, c in enumerate(candidates)
])
response = self.llm.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Given this question: "{question}"
Rank these chunks by relevance (1=most relevant). Return only the numbers of the top 3, comma-separated.
{candidate_texts}"""
}]
)
# Parse rankings and select top 3
top_indices = [int(x.strip()) - 1 for x in response.content[0].text.split(",")[:3]]
top_chunks = [candidates[i] for i in top_indices]
context = "\n\n".join([c["text"] for c in top_chunks])
final_response = self.llm.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Answer the question based on the context.\n\nContext:\n{context}\n\nQuestion: {question}"
}]
)
return final_response.content[0].text
Re-ranking boosted our MRR from 0.74 to 0.87 and end-to-end accuracy from 71% to 81%.
Performance Results
Here's what the improvements look like in practice:
| Metric | Basic RAG | +Summary Index | +Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.69 | 0.69 |
| Avg F1 | 0.52 | 0.54 | 0.54 |
| Avg MRR | 0.74 | 0.78 | 0.87 |
| End-to-End Accuracy | 71% | 75% | 81% |
Production Considerations
- Rate limits: Full evaluations can hit API limits. Use Tier 2+ accounts or run evaluations incrementally.
- Vector database: Switch to Pinecone, Weaviate, or Qdrant for production workloads.
- Chunking strategy: Experiment with semantic chunking (by sentence boundaries) vs. fixed-size chunks.
- Embedding model: Voyage AI's
voyage-2is excellent, but testvoyage-law-2for legal orvoyage-code-2for code. - Caching: Cache embeddings for static documents to reduce API costs.
Key Takeaways
- Evaluate retrieval and generation separately to identify where your RAG system is failing. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
- Summary indexing improves recall by creating high-level representations of documents that match broader queries.
- Re-ranking with Claude dramatically improves MRR and end-to-end accuracy by filtering out irrelevant chunks before generation.
- Start simple, measure everything, then optimize — a basic RAG pipeline with proper evaluation beats a complex system you can't debug.
- Your evaluation dataset is your most important asset — invest time in creating realistic, challenging questions that reflect actual user queries.