Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Learn to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques for production-grade performance.
This guide walks through building a RAG system with Claude, from basic setup to advanced optimization. You'll learn to implement chunking, embedding, retrieval, and evaluation pipelines, plus advanced techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.
Building Production-Ready RAG Systems with Claude: From Basic to Advanced
Retrieval Augmented Generation (RAG) is the cornerstone of enterprise AI applications. While Claude excels at general knowledge tasks, it needs RAG to handle domain-specific queries about your internal documents, customer support data, or proprietary knowledge bases.
In this guide, we'll build a production-grade RAG system using Claude and Voyage AI embeddings. We'll start with a basic implementation, then systematically improve it using advanced techniques that boosted our end-to-end accuracy from 71% to 81%.
Understanding the RAG Pipeline
A RAG system works in three stages:
- Ingestion: Chunk and embed your documents into a vector database
- Retrieval: Find relevant chunks for a user's query
- Generation: Feed retrieved context to Claude for answer generation
Level 1: Basic RAG Implementation
Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai pandas numpy scikit-learn matplotlib
Initialize your API clients:
import anthropic
import voyageai
Initialize clients
claude = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyage-key")
Building a Simple Vector Database
For production, use a hosted vector database like Pinecone or Weaviate. For this guide, we'll use an in-memory implementation:
import numpy as np
from typing import List, Dict, Tuple
class SimpleVectorDB:
def __init__(self):
self.documents = []
self.embeddings = []
def add_document(self, text: str, metadata: Dict = None):
embedding = vo.embed([text], model="voyage-2").embeddings[0]
self.documents.append({"text": text, "metadata": metadata or {}})
self.embeddings.append(embedding)
def search(self, query: str, k: int = 3) -> List[Tuple[str, float]]:
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
scores = [cosine_similarity(query_embedding, emb) for emb in self.embeddings]
top_indices = np.argsort(scores)[-k:][::-1]
return [(self.documents[i]["text"], scores[i]) for i in top_indices]
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Chunking Strategy
Basic RAG chunks documents by heading:
def chunk_by_heading(document: str) -> List[str]:
"""Split document into chunks based on markdown headings."""
chunks = []
current_chunk = []
for line in document.split('\n'):
if line.startswith('##') or line.startswith('###'):
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = [line]
else:
current_chunk.append(line)
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
The Basic RAG Query Function
def basic_rag_query(query: str, vector_db: SimpleVectorDB) -> str:
# Retrieve relevant chunks
results = vector_db.search(query, k=3)
context = "\n\n---\n\n".join([text for text, score in results])
# Generate answer with Claude
response = claude.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
}]
)
return response.content[0].text
Building a Robust Evaluation System
Don't rely on "vibes" to evaluate your RAG system. We built a synthetic evaluation dataset with 100 samples, each containing:
- A question
- Relevant document chunks (ground truth)
- A correct answer
Key Metrics
#### Retrieval Metrics
Precision: Of the chunks we retrieved, how many were relevant?Precision = |Retrieved ∩ Correct| / |Retrieved|
Recall: Of all correct chunks, how many did we retrieve?
Recall = |Retrieved ∩ Correct| / |Correct|
F1 Score: Harmonic mean of precision and recall
F1 = 2 (Precision Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR): How high did the first relevant result rank?
MRR = 1 / rank_of_first_relevant_result
#### End-to-End Metric
Accuracy: Does Claude's answer match the ground truth? We use Claude itself as a judge:def evaluate_answer(question: str, generated: str, ground_truth: str) -> bool:
prompt = f"""Question: {question}
Generated Answer: {generated}
Correct Answer: {ground_truth}
Does the generated answer correctly address the question? Answer only YES or NO."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip() == "YES"
Level 2: Summary Indexing
Basic RAG loses context when chunks are too granular. Summary indexing creates higher-level chunks that preserve document structure:
def create_summary_index(documents: List[str]) -> SimpleVectorDB:
db = SimpleVectorDB()
for doc in documents:
# Create a summary of the document
summary = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this document in 2-3 sentences:\n\n{doc[:2000]}"
}]
).content[0].text
# Store both summary and full text
db.add_document(
text=doc,
metadata={"summary": summary}
)
return db
Level 3: Adding Re-Ranking
Re-ranking dramatically improves MRR by having Claude score retrieved chunks for relevance:
def rerank_chunks(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
prompt = f"""Query: {query}
For each chunk below, rate its relevance to the query on a scale of 1-10.
Return only the chunk indices sorted by relevance (most relevant first).
Chunks:
"""
for i, chunk in enumerate(chunks):
prompt += f"\n[{i}]: {chunk[:500]}..."
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the ranked indices
ranked_indices = [int(x) for x in response.content[0].text.split() if x.isdigit()]
return [chunks[i] for i in ranked_indices[:top_k]]
Putting It All Together
def advanced_rag_query(query: str, vector_db: SimpleVectorDB) -> str:
# Initial retrieval (get more candidates for re-ranking)
initial_results = vector_db.search(query, k=10)
initial_chunks = [text for text, score in initial_results]
# Re-rank with Claude
top_chunks = rerank_chunks(query, initial_chunks, top_k=3)
context = "\n\n---\n\n".join(top_chunks)
# Generate answer
response = claude.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
}]
)
return response.content[0].text
Results: The Impact of Each Improvement
Our systematic improvements yielded measurable gains:
| Metric | Basic RAG | +Summary Index | +Re-Ranking |
|---|---|---|---|
| Avg Precision | 0.43 | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.67 | 0.69 |
| Avg F1 Score | 0.52 | 0.53 | 0.54 |
| Avg MRR | 0.74 | 0.80 | 0.87 |
| End-to-End Accuracy | 71% | 76% | 81% |
Production Considerations
- Rate Limits: Full evaluations can hit API rate limits. Use Tier 2+ accounts for production workloads.
- Token Budget: Summary indexing and re-ranking increase token usage. Monitor costs.
- Vector Database: Replace the in-memory DB with Pinecone, Weaviate, or Qdrant for production.
- Caching: Cache embeddings and common queries to reduce API calls.
- Monitoring: Log all queries, retrievals, and generations for debugging and improvement.
Key Takeaways
- Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline
- Summary indexing preserves document context and improves recall by 1-2% over basic chunking
- Re-ranking with Claude dramatically improves MRR (from 0.74 to 0.87), ensuring the most relevant context reaches the model
- End-to-end accuracy improved 10 percentage points (71% to 81%) through these optimizations
- Build a synthetic evaluation dataset before optimizing—you can't improve what you can't measure