Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Claude excels at general knowledge tasks, but when you need answers grounded in your own documents—internal wikis, product manuals, or customer support logs—standard prompting falls short. That's where Retrieval Augmented Generation (RAG) comes in.
RAG lets Claude tap into your private knowledge bases, dramatically improving its ability to answer domain-specific questions. Enterprises are using RAG to power customer support bots, internal Q&A systems, financial analysis tools, and more.
In this guide, we'll build a RAG system using Claude and the Claude Documentation as our knowledge base. We'll start with a basic pipeline, then level up with advanced techniques that measurably improve performance.
What You'll Learn
- How to set up a basic RAG pipeline with Claude, Voyage AI embeddings, and an in-memory vector store
- How to build a robust evaluation suite that measures retrieval and end-to-end performance independently
- How to implement summary indexing and re-ranking to boost accuracy from 71% to 81%
Prerequisites
Step 1: Basic RAG Setup
Let's start with what the industry calls "Naive RAG." It's simple but effective for many use cases.
Install Dependencies
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Initialize the Vector Database
We'll use an in-memory vector DB for this example. For production, consider hosted solutions like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings."""
texts = [doc["content"] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, k: int = 3) -> List[Dict]:
"""Retrieve top-k documents by cosine similarity."""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, doc_emb)
/ (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
for doc_emb in self.embeddings
]
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.documents[i] for i in top_indices]
Build the Basic RAG Pipeline
from anthropic import Anthropic
class BasicRAG:
def __init__(self, anthropic_key: str, voyage_key: str):
self.anthropic = Anthropic(api_key=anthropic_key)
self.vector_db = InMemoryVectorDB(api_key=voyage_key)
def answer(self, query: str) -> str:
# 1. Retrieve relevant chunks
chunks = self.vector_db.search(query, k=3)
context = "\n\n".join([chunk["content"] for chunk in chunks])
# 2. Generate answer with Claude
response = self.anthropic.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context above."
}]
)
return response.content[0].text
Step 2: Building an Evaluation System
"Vibes-based" evaluation won't cut it for production. You need metrics that tell you exactly where your system excels and where it falls short.
Create a Synthetic Evaluation Dataset
Generate 100 test samples, each containing:
- A question
- The correct answer
- The relevant document chunks (ground truth for retrieval)
import json
Example structure of evaluation dataset
eval_sample = {
"question": "How do I set up streaming with Claude?",
"answer": "To set up streaming, use the stream=True parameter...",
"relevant_chunks": [
"Streaming allows you to receive partial responses...",
"Set stream=True in the Messages API call..."
]
}
Save to file
with open("evaluation_dataset.json", "w") as f:
json.dump(eval_samples, f, indent=2)
Define Key Metrics
We evaluate two dimensions separately: retrieval quality and end-to-end accuracy.
#### Retrieval Metrics
Precision – Of the chunks retrieved, how many are relevant?$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$
Recall – Of all relevant chunks, how many did we retrieve?$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$
F1 Score – Harmonic mean of precision and recall. Mean Reciprocal Rank (MRR) – How high is the first relevant chunk in the results? Critical for question-answering where one good chunk may be enough.#### End-to-End Metric
Accuracy – Does Claude's final answer match the ground truth? Use another LLM call to judge correctness.def evaluate_retrieval(retrieved_chunks, relevant_chunks):
retrieved_set = set(retrieved_chunks)
relevant_set = set(relevant_chunks)
true_positives = len(retrieved_set & relevant_set)
precision = true_positives / len(retrieved_set) if retrieved_set else 0
recall = true_positives / len(relevant_set) if relevant_set else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
return precision, recall, f1
def evaluate_end_to_end(question, predicted_answer, correct_answer):
# Use Claude to judge correctness
prompt = f"""Question: {question}
Predicted Answer: {predicted_answer}
Correct Answer: {correct_answer}
Is the predicted answer correct? Answer only 'yes' or 'no'."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip().lower() == "yes"
Step 3: Advanced Techniques
Level 2: Summary Indexing
Basic RAG chunks by heading, which can miss cross-chunk relationships. Summary indexing creates a separate index of chunk summaries, making retrieval more robust.
def create_summary_index(documents):
summaries = []
for doc in documents:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this document in 1-2 sentences:\n\n{doc['content']}"
}]
)
summaries.append({
"original": doc,
"summary": response.content[0].text
})
return summaries
Level 3: Summary Indexing + Re-Ranking
Re-ranking uses Claude to reorder retrieved chunks by relevance to the query. This dramatically improves MRR.
def rerank_with_claude(query, chunks, top_k=3):
# First pass: retrieve more chunks than needed
initial_chunks = vector_db.search(query, k=10)
# Second pass: ask Claude to rank them
chunk_texts = [f"Chunk {i}: {c['content']}" for i, c in enumerate(initial_chunks)]
prompt = f"""Query: {query}
Chunks:
{"".join(chunk_texts)}
Rank the chunks by relevance to the query. Return the indices of the top {top_k} chunks, most relevant first, as a comma-separated list."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
indices = [int(i.strip()) for i in response.content[0].text.split(",")[:top_k]]
return [initial_chunks[i] for i in indices]
Results: Measurable Improvements
After implementing summary indexing and re-ranking, we saw significant gains:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate Limits: Full evaluations may hit rate limits below Tier 2. Consider sampling your dataset or running evaluations incrementally.
- Vector Database: For production, replace the in-memory DB with a hosted solution (Pinecone, Weaviate, Qdrant).
- Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) and overlap (10-20%).
- Embedding Model: Voyage AI embeddings are optimized for retrieval, but you can also use OpenAI's
text-embedding-3-smallor open-source models likeBAAI/bge-base-en-v1.5.
Key Takeaways
- Evaluate retrieval and generation separately. Use precision, recall, F1, and MRR for retrieval; use LLM-as-judge for end-to-end accuracy.
- Summary indexing improves recall by capturing document-level semantics that chunk-level indexing misses.
- Re-ranking with Claude boosts MRR significantly (from 0.74 to 0.87) by applying semantic understanding to initial retrieval results.
- Start simple, then iterate. A basic RAG pipeline can be surprisingly effective. Add complexity only when metrics show clear room for improvement.
- Synthetic evaluation datasets are powerful. Generate 50-100 samples covering your domain to get reliable performance signals without manual labeling.