Building a Production-Ready RAG System with Claude: From Basic to Advanced
Learn how to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building a Production-Ready RAG System with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.
In this guide, we'll walk through building a RAG system using the Claude documentation as our knowledge base. We'll start with a basic implementation, then show you how to evaluate it properly, and finally apply advanced techniques that boost end-to-end accuracy from 71% to 81%.
What You'll Learn
- How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
- How to build a robust evaluation suite with 5 key metrics
- How to implement summary indexing for better retrieval
- How to use Claude as a re-ranker to improve result quality
Prerequisites
You'll need:
- An Anthropic API key
- A Voyage AI API key
- Python 3.8+ with
anthropic,voyageai,pandas,numpy,scikit-learninstalled
Level 1: Basic RAG Pipeline
Let's start with what's often called "Naive RAG." This is the simplest approach, but it's a solid foundation.
Step 1: Chunk Your Documents
We split documents by headings, keeping content from each subheading together. This creates natural, semantically coherent chunks.
import re
def chunk_by_headings(text):
"""Split document text by markdown headings."""
chunks = []
current_heading = "Introduction"
current_content = []
for line in text.split('\n'):
if line.startswith('##') or line.startswith('###'):
if current_content:
chunks.append({
'heading': current_heading,
'content': '\n'.join(current_content).strip()
})
current_heading = line.strip('# ')
current_content = []
else:
current_content.append(line)
# Don't forget the last chunk
if current_content:
chunks.append({
'heading': current_heading,
'content': '\n'.join(current_content).strip()
})
return chunks
Step 2: Embed Each Chunk
We use Voyage AI's embedding model to convert each chunk into a vector representation.
import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
def embed_chunks(chunks):
"""Generate embeddings for all chunks."""
texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(texts, model="voyage-2").embeddings
for i, chunk in enumerate(chunks):
chunk['embedding'] = embeddings[i]
return chunks
Step 3: Retrieve and Answer
When a user asks a question, we embed their query, find the most similar chunks using cosine similarity, and pass them to Claude.
import numpy as np
from typing import List, Dict
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query: str, chunks: List[Dict], top_k: int = 3):
"""Retrieve top-k most relevant chunks."""
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
scores = []
for chunk in chunks:
score = cosine_similarity(query_embedding, chunk['embedding'])
scores.append(score)
top_indices = np.argsort(scores)[-top_k:][::-1]
return [chunks[i] for i in top_indices]
def answer_with_claude(query: str, context_chunks: List[Dict]):
"""Generate answer using Claude with retrieved context."""
context = "\n\n".join([c['content'] for c in context_chunks])
prompt = f"""Answer the question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building a Robust Evaluation System
Most RAG systems fail because they're evaluated on "vibes" rather than metrics. Let's fix that.
The Evaluation Dataset
We synthetically generated 100 test samples. Each sample contains:
- A question
- Relevant chunks (the ground truth for retrieval)
- A correct answer (the ground truth for end-to-end)
The Five Key Metrics
#### 1. Precision
Precision answers: "Of the chunks we retrieved, how many were actually relevant?"
Precision = True Positives / Total Retrieved
High precision means you're not wasting Claude's context window with irrelevant information.
#### 2. Recall
Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"
Recall = True Positives / Total Relevant
High recall ensures Claude has all the information it needs.
#### 3. F1 Score
The harmonic mean of precision and recall. A balanced measure of retrieval quality.
F1 = 2 (Precision Recall) / (Precision + Recall)
#### 4. Mean Reciprocal Rank (MRR)
MRR measures how early the first relevant result appears. If the first relevant chunk is at position 1, the reciprocal rank is 1.0. At position 3, it's 0.33.
def reciprocal_rank(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk['heading'] in relevant_chunks:
return 1.0 / (i + 1)
return 0.0
#### 5. End-to-End Accuracy
This measures whether Claude's final answer is correct. You can use LLM-as-judge or manual evaluation.
def evaluate_answer(question, generated_answer, correct_answer):
prompt = f"""Determine if the generated answer correctly answers the question.
Question: {question}
Correct Answer: {correct_answer}
Generated Answer: {generated_answer}
Is the generated answer correct? Answer only 'yes' or 'no'."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip().lower() == 'yes'
Level 2: Summary Indexing
Basic RAG retrieves raw chunks, but sometimes the most relevant chunk doesn't contain the exact keywords. Summary indexing solves this by creating a condensed version of each chunk and using that for retrieval.
def create_summary(chunk_content):
"""Use Claude to summarize a chunk."""
prompt = f"""Summarize the following text in 2-3 sentences, capturing the key information:
{chunk_content}
Summary:"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Then, instead of embedding the raw chunk content, you embed the summary. This often improves recall because summaries are more semantically aligned with questions.
Level 3: Summary Indexing + Re-Ranking
This is where things get interesting. We combine summary indexing with a re-ranking step using Claude.
How Re-Ranking Works
- Retrieve a larger set of candidates (e.g., top 10 chunks)
- Use Claude to score each chunk's relevance to the query
- Keep only the top 3-5 most relevant chunks
def rerank_with_claude(query: str, candidates: List[Dict], top_k: int = 3):
"""Use Claude to re-rank retrieved chunks."""
scored_chunks = []
for chunk in candidates:
prompt = f"""On a scale of 1-10, how relevant is the following text to answering the question?
Question: {query}
Text: {chunk['content']}
Relevance score (just the number):"""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
try:
score = int(response.content[0].text.strip())
except ValueError:
score = 5 # Default if parsing fails
scored_chunks.append((score, chunk))
# Sort by score descending and return top_k
scored_chunks.sort(key=lambda x: x[0], reverse=True)
return [chunk for _, chunk in scored_chunks[:top_k]]
The Results
Here's what we achieved by combining summary indexing and re-ranking:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
Vector Database
For production, replace the in-memory store with a hosted vector database like Pinecone, Weaviate, or pgvector.
Rate Limits
The full evaluation suite can hit rate limits. If you're not on Tier 2+, consider running smaller subsets.
Cost Optimization
- Use Claude Haiku for re-ranking (cheaper than Sonnet)
- Cache embeddings for static documents
- Batch API calls where possible
Key Takeaways
- Start simple, then optimize: A basic RAG pipeline works surprisingly well. Add complexity only when you have metrics to justify it.
- Evaluate retrieval and generation separately: Your RAG system is only as good as its weakest link. Measure both components independently.
- Summary indexing improves recall: By embedding summaries instead of raw chunks, you capture semantic relevance that keyword matching misses.
- Re-ranking with Claude boosts MRR significantly: Using Claude to score relevance after initial retrieval ensures the most relevant chunks appear first.
- End-to-end accuracy is the ultimate metric: All retrieval metrics are proxies. Always measure whether your system actually answers questions correctly.