Building a Production-Ready RAG System with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced optimizations like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, then apply techniques that boosted accuracy from 71% to 81%.
Building a Production-Ready RAG System with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your unique business context. Whether you're building a customer support chatbot, an internal knowledge base Q&A system, or a financial analysis tool, RAG enables Claude to answer questions grounded in your own documents.
In this guide, we'll walk through building a RAG system using the Claude Documentation as our knowledge base. We'll start with a basic implementation, then show you how to measure performance objectively, and finally apply advanced techniques that improved our end-to-end accuracy from 71% to 81%.
What You'll Need
Before we dive in, let's set up our environment. You'll need:
- An Anthropic API key for Claude
- A Voyage AI API key for embeddings
- Python libraries:
anthropic,voyageai,pandas,numpy,matplotlib,scikit-learn
import anthropic
import voyageai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Initialize Your Vector Database
For this guide, we'll use an in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.
class InMemoryVectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
def add_document(self, text, embedding):
self.documents.append(text)
self.embeddings.append(embedding)
def search(self, query_embedding, top_k=3):
similarities = cosine_similarity([query_embedding], self.embeddings)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(self.documents[i], similarities[i]) for i in top_indices]
Level 1: Basic RAG (Naive RAG)
Let's start with the simplest possible RAG pipeline. This is often called "Naive RAG" in the industry, and it follows three steps:
- Chunk documents by heading (each subheading becomes a chunk)
- Embed each chunk using Voyage AI
- Retrieve relevant chunks using cosine similarity
def basic_rag_pipeline(query, vector_db, top_k=3):
# Step 1: Embed the query
vo = voyageai.Client()
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Step 2: Retrieve relevant chunks
retrieved_chunks = vector_db.search(query_embedding, top_k=top_k)
# Step 3: Generate answer with Claude
context = "\n\n".join([chunk for chunk, _ in retrieved_chunks])
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
This works, but how do we know if it's working well? That's where evaluation comes in.
Building an Evaluation System
"Vibes-based" evaluation won't cut it for production systems. We need objective metrics. Let's build an evaluation suite that measures two things independently:
- Retrieval performance – How good is our system at finding the right documents?
- End-to-end performance – How good are the final answers?
- A question
- The correct chunks (ground truth relevant documents)
- A correct answer
Key Retrieval Metrics
#### Precision Precision answers: "Of the chunks we retrieved, how many were actually relevant?"
Precision = True Positives / Total Retrieved
High precision means you're not flooding Claude with irrelevant information. Low precision means you're wasting context window space.
#### Recall Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"
Recall = True Positives / Total Relevant
High recall means Claude has access to all the information it needs. Low recall means you're missing important context.
#### F1 Score The harmonic mean of precision and recall. A balanced measure of retrieval quality.
#### Mean Reciprocal Rank (MRR) MRR measures how high the first relevant result appears in your retrieval list. If the first relevant chunk is always at position 1, MRR is 1.0. If it's often at position 3, MRR drops.
def calculate_mrr(retrieved_chunks, correct_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in correct_chunks:
return 1.0 / (i + 1)
return 0.0
End-to-End Accuracy
This measures whether Claude's final answer is correct. You can use LLM-as-judge or exact match against a golden answer.
def evaluate_end_to_end(question, expected_answer, actual_answer):
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1,
messages=[{
"role": "user",
"content": f"Question: {question}\nExpected: {expected_answer}\nActual: {actual_answer}\n\nIs the actual answer correct? Answer only 'yes' or 'no'."
}]
)
return response.content[0].text.strip().lower() == 'yes'
Level 2: Summary Indexing
Our basic RAG has a problem: chunking by heading loses the broader context. A chunk about "rate limits" might not mention it's from the "API Reference" section, making it harder to retrieve for questions about API usage.
Summary indexing solves this by creating a short summary for each chunk and using that summary (along with the chunk) for retrieval.def create_summary(chunk_text):
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{
"role": "user",
"content": f"Summarize this document chunk in 1-2 sentences:\n\n{chunk_text}"
}]
)
return response.content[0].text
During indexing:
for chunk in all_chunks:
summary = create_summary(chunk)
combined_text = f"[Summary]: {summary}\n[Content]: {chunk}"
embedding = vo.embed([combined_text]).embeddings[0]
vector_db.add_document(chunk, embedding)
This improved our recall from 0.66 to 0.69 and F1 from 0.52 to 0.54.
Level 3: Summary Indexing + Re-Ranking
Even with better indexing, we might retrieve 10 chunks but only have room for 3 in Claude's context. Re-ranking uses Claude itself to select the most relevant chunks from an initial candidate set.
def rerank_with_claude(query, candidates, top_k=3):
client = anthropic.Anthropic()
# Format candidates for Claude
candidate_text = "\n\n".join([
f"[Chunk {i+1}]: {chunk}"
for i, chunk in enumerate(candidates)
])
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Given this query: '{query}'\n\nRank these chunks by relevance (most relevant first). Return only the chunk numbers in order, comma-separated.\n\n{candidate_text}"
}]
)
# Parse the ranked indices
ranked_indices = [int(x.strip())-1 for x in response.content[0].text.split(',')]
return [candidates[i] for i in ranked_indices[:top_k]]
This technique dramatically improved our Mean Reciprocal Rank from 0.74 to 0.87 – meaning the most relevant chunk almost always appeared first.
Results Summary
Here's what we achieved by layering these techniques:
| Metric | Basic RAG | + Summary Indexing | + Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.44 | 0.44 |
| Avg Recall | 0.66 | 0.69 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 | 0.54 |
| Avg MRR | 0.74 | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 78% | 81% |
Production Considerations
- Rate limits: Full evaluations can hit API rate limits. Consider using Tier 2+ accounts or sampling your eval set.
- Cost: Summary indexing and re-ranking add token costs. Balance improvements against your budget.
- Chunking strategy: Experiment with different chunk sizes and overlap. We found heading-based chunking with 200-500 token chunks works well.
- Embedding model: Voyage AI's
voyage-2is excellent, but test other models for your domain.
Key Takeaways
- Evaluate retrieval and generation separately – Don't rely on "vibes." Use precision, recall, F1, and MRR to measure retrieval quality, and a separate metric for end-to-end accuracy.
- Summary indexing improves recall – By enriching chunks with summaries, you make them more discoverable for semantic search.
- Re-ranking with Claude boosts MRR significantly – Using Claude to select the most relevant chunks from a candidate pool ensures the best information reaches your final prompt.
- Start simple, then iterate – Basic RAG works. Measure it, then apply targeted improvements based on where your metrics are weakest.
- End-to-end accuracy is the ultimate metric – All retrieval improvements should ultimately serve the goal of better answers. Our 10-point accuracy gain (71% → 81%) validated our approach.