Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide teaches you to build a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, and achieve significant performance improvements.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.
Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll walk through building and optimizing a RAG system using Claude, from a basic "naive" approach to advanced techniques that deliver measurable improvements.
What You'll Learn
By the end of this guide, you'll know how to:
- Set up a basic RAG pipeline using embeddings and vector search
- Build a robust evaluation suite that measures retrieval and end-to-end performance independently
- Implement advanced techniques like summary indexing and re-ranking with Claude
- Achieve significant performance gains on key metrics
Understanding the RAG Architecture
A basic RAG pipeline follows three steps:
- Chunk documents into manageable pieces (e.g., by heading)
- Embed each chunk using a vector embedding model
- Retrieve relevant chunks via cosine similarity to answer a query
Level 1: Basic RAG Setup
Prerequisites
You'll need:
- An Anthropic API key for Claude
- A Voyage AI API key for embeddings
- Python libraries:
anthropic,voyageai,pandas,numpy,matplotlib,scikit-learn
Initialize Your Vector Database
For this example, we'll use an in-memory vector DB. For production, consider a hosted solution like Pinecone or Weaviate.
import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, docs: List[str]):
self.documents.extend(docs)
response = self.client.embed(docs, model="voyage-2")
self.embeddings.extend(response.embeddings)
def search(self, query: str, top_k: int = 3) -> List[str]:
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, doc_emb)
for doc_emb in self.embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]
Chunking Strategy
A simple but effective strategy is to chunk documents by heading, keeping content from each subheading together. This preserves semantic boundaries.
def chunk_by_heading(text: str) -> List[str]:
chunks = []
current_chunk = []
for line in text.split('\n'):
if line.startswith('#'):
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = [line]
else:
current_chunk.append(line)
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
Basic RAG Query Function
import anthropic
claude = anthropic.Anthropic(api_key="your-api-key")
db = InMemoryVectorDB(api_key="your-voyage-key")
def basic_rag(query: str) -> str:
# Retrieve relevant chunks
chunks = db.search(query, top_k=3)
context = "\n\n---\n\n".join(chunks)
# Generate answer with Claude
response = claude.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context."
}]
)
return response.content[0].text
Building an Evaluation System
To improve your RAG system, you must measure it. We'll evaluate two dimensions independently:
- Retrieval performance – How well does the system find relevant chunks?
- End-to-end accuracy – How well does Claude answer using those chunks?
Creating an Evaluation Dataset
We synthetically generated 100 samples, each containing:
- A question
- Ground-truth relevant chunks
- A correct answer
Key Metrics Defined
#### Precision
Precision answers: "Of the chunks we retrieved, how many were relevant?"
$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$
High precision means fewer false positives.
#### Recall
Recall answers: "Of all the correct chunks, how many did we retrieve?"
$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$
High recall means we're not missing important information.
#### F1 Score
The harmonic mean of precision and recall:
$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
#### Mean Reciprocal Rank (MRR)
MRR measures how early the first relevant chunk appears in the results. If the first relevant chunk is at position 1, the reciprocal rank is 1. At position 2, it's 1/2, and so on.
#### End-to-End Accuracy
This measures whether Claude's final answer is correct, given the retrieved context.
Evaluation Code
def evaluate_retrieval(queries, ground_truth_chunks, db, top_k=3):
precisions, recalls, f1s, mrrs = [], [], [], []
for query, correct_chunks in zip(queries, ground_truth_chunks):
retrieved = db.search(query, top_k=top_k)
# Calculate metrics
true_positives = len(set(retrieved) & set(correct_chunks))
precision = true_positives / len(retrieved) if retrieved else 0
recall = true_positives / len(correct_chunks) if correct_chunks else 0
f1 = 2 (precision recall) / (precision + recall) if (precision + recall) > 0 else 0
# MRR: reciprocal rank of first relevant chunk
for rank, chunk in enumerate(retrieved, 1):
if chunk in correct_chunks:
mrr = 1.0 / rank
break
else:
mrr = 0.0
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
mrrs.append(mrr)
return {
"avg_precision": np.mean(precisions),
"avg_recall": np.mean(recalls),
"avg_f1": np.mean(f1s),
"avg_mrr": np.mean(mrrs)
}
Level 2: Summary Indexing
Basic RAG often fails when a single chunk doesn't contain enough context. Summary indexing addresses this by creating a secondary index of chunk summaries.
How It Works
- For each chunk, ask Claude to generate a concise summary
- Store both the summary and the full chunk
- At query time, search summaries first, then retrieve corresponding full chunks
def generate_summary(chunk: str, client) -> str:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{
"role": "user",
"content": f"Summarize this text in 1-2 sentences:\n\n{chunk}"
}]
)
return response.content[0].text
Build summary index
summary_db = InMemoryVectorDB(api_key="your-voyage-key")
full_chunks = []
for chunk in all_chunks:
summary = generate_summary(chunk, claude)
summary_db.add_documents([summary])
full_chunks.append(chunk)
def summary_rag(query: str) -> str:
# Search summaries
top_summaries = summary_db.search(query, top_k=3)
# Map back to full chunks
indices = [summary_db.documents.index(s) for s in top_summaries]
context = "\n\n---\n\n".join([full_chunks[i] for i in indices])
response = claude.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context."
}]
)
return response.content[0].text
Level 3: Adding Re-Ranking
Re-ranking improves precision by having Claude score the relevance of retrieved chunks before generating an answer.
def rerank_chunks(query: str, chunks: List[str], client) -> List[str]:
scored_chunks = []
for chunk in chunks:
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{
"role": "user",
"content": f"On a scale of 0-10, how relevant is this chunk to the query?\n\nQuery: {query}\n\nChunk: {chunk}\n\nAnswer with just a number."
}]
)
score = float(response.content[0].text.strip())
scored_chunks.append((score, chunk))
scored_chunks.sort(reverse=True, key=lambda x: x[0])
return [chunk for _, chunk in scored_chunks]
def advanced_rag(query: str) -> str:
# Retrieve candidates
candidates = db.search(query, top_k=10)
# Re-rank with Claude
top_chunks = rerank_chunks(query, candidates, claude)[:3]
context = "\n\n---\n\n".join(top_chunks)
response = claude.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context."
}]
)
return response.content[0].text
Results: Measurable Improvements
After implementing summary indexing and re-ranking, we achieved:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Production Considerations
- Rate limits: Full evaluations may hit rate limits unless you're at Tier 2 or above on Anthropic's API
- Token usage: Summary indexing and re-ranking consume additional tokens; optimize by using Haiku for auxiliary tasks
- Vector database: For production, use a hosted vector DB with built-in indexing and scaling
- Evaluation dataset: Invest time in creating a high-quality evaluation set that reflects real user queries
Key Takeaways
- Measure separately: Always evaluate retrieval and end-to-end performance independently to identify bottlenecks
- Summary indexing improves recall: By searching summaries, you capture chunks that might be missed by keyword or embedding search alone
- Re-ranking boosts precision: Claude can effectively score relevance, pushing the most useful chunks to the top
- MRR is your early-warning metric: A low MRR indicates your system retrieves relevant content too late, hurting answer quality
- Start simple, then optimize: Begin with basic RAG, establish baselines, then add complexity only where metrics show improvement