Building a Production-Grade RAG System with Claude: From Basic to Advanced
Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.
This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.
Building a Production-Grade RAG System with Claude: From Basic to Advanced
Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.
Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.
What You'll Learn
By the end of this guide, you'll know how to:
- Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
- Build a robust evaluation suite that measures retrieval and end-to-end performance independently
- Implement advanced techniques including summary indexing and re-ranking with Claude
Prerequisites
Before diving in, you'll need:
- An Anthropic API key
- A Voyage AI API key
- Python 3.8+
- Basic familiarity with Python and API usage
Required Libraries
pip install anthropic voyageai pandas numpy matplotlib scikit-learn
Level 1: Basic RAG Pipeline
Let's start with what's often called "Naive RAG" — a bare-bones approach that includes three steps:
- Chunk documents by heading (each chunk contains content from one subheading)
- Embed each chunk using Voyage AI embeddings
- Retrieve relevant chunks using cosine similarity
Initialize a Vector Database
For this example, we'll use an in-memory vector DB. For production, consider a hosted solution like Pinecone, Weaviate, or Chroma.
import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.documents = []
self.embeddings = []
def add_documents(self, documents: List[Dict[str, str]]):
"""Add documents with their embeddings."""
texts = [doc["content"] for doc in documents]
embeddings = self.client.embed(texts, model="voyage-2").embeddings
self.documents.extend(documents)
self.embeddings.extend(embeddings)
def search(self, query: str, k: int = 3) -> List[Dict]:
"""Retrieve top-k relevant documents."""
query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, doc_emb)
for doc_emb in self.embeddings
]
top_indices = np.argsort(similarities)[-k:][::-1]
return [self.documents[i] for i in top_indices]
Query Claude with Retrieved Context
from anthropic import Anthropic
anthropic = Anthropic(api_key="your-anthropic-api-key")
def answer_with_rag(query: str, context_chunks: List[str]) -> str:
context = "\n\n---\n\n".join(context_chunks)
prompt = f"""Answer the following question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
response = anthropic.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Building an Evaluation System
When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. This allows you to pinpoint where improvements are needed.
Creating an Evaluation Dataset
You'll need a dataset with:
- A question
- Relevant chunks (ground truth for retrieval)
- A correct answer (ground truth for end-to-end)
[
{
"question": "How do I set up rate limiting in Claude?",
"relevant_chunks": ["chunk_1_content", "chunk_2_content"],
"correct_answer": "To set up rate limiting..."
}
]
Key Metrics
#### Retrieval Metrics
Precision measures the proportion of retrieved chunks that are actually relevant.Precision = True Positives / Total Retrieved
Recall measures the completeness of retrieval — how many of the relevant chunks were retrieved.
Recall = True Positives / Total Relevant
F1 Score is the harmonic mean of precision and recall.
F1 = 2 (Precision Recall) / (Precision + Recall)
Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the results.
def calculate_mrr(retrieved_chunks, relevant_chunks):
for i, chunk in enumerate(retrieved_chunks):
if chunk in relevant_chunks:
return 1.0 / (i + 1)
return 0.0
#### End-to-End Metric
Accuracy measures whether the final answer is correct. This requires human or LLM-as-judge evaluation.def evaluate_accuracy(question, generated_answer, correct_answer):
# Use Claude to judge if the answer is correct
prompt = f"""Question: {question}
Generated Answer: {generated_answer}
Correct Answer: {correct_answer}
Is the generated answer correct? Answer only 'yes' or 'no'."""
# ... call Claude and parse response
Level 2: Summary Indexing
Basic RAG often fails when a question requires synthesizing information across multiple chunks. Summary indexing addresses this by creating condensed representations of document sections.
How It Works
- For each document chunk, generate a summary using Claude
- Store both the original chunk and its summary
- During retrieval, search against summaries first, then retrieve full chunks
def create_summary(chunk_content: str) -> str:
prompt = f"Summarize the following text in 2-3 sentences:\n\n{chunk_content}"
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Level 3: Summary Indexing + Re-Ranking
Re-ranking adds a second stage to retrieval. After initial retrieval, Claude re-ranks the chunks by relevance to the specific query.
Implementation
def rerank_chunks(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
prompt = f"""Given the query: "{query}"
Rank the following chunks by relevance (most relevant first).
Chunks:
"""
for i, chunk in enumerate(chunks):
prompt += f"\n[{i+1}] {chunk[:200]}..."
prompt += "\n\nReturn the chunk numbers in order of relevance, comma-separated."
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
# Parse the response to get ordered indices
indices = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
return [chunks[i] for i in indices[:top_k]]
Performance Gains
With summary indexing and re-ranking, you can expect improvements like:
| Metric | Basic RAG | Advanced RAG |
|---|---|---|
| Avg Precision | 0.43 | 0.44 |
| Avg Recall | 0.66 | 0.69 |
| Avg F1 Score | 0.52 | 0.54 |
| Avg MRR | 0.74 | 0.87 |
| End-to-End Accuracy | 71% | 81% |
Best Practices for Production RAG
- Chunk strategically: Experiment with chunk sizes (256-512 tokens often work well) and overlap
- Use dedicated embedding models: Voyage AI and Cohere offer purpose-built embeddings for RAG
- Implement caching: Cache embeddings and common queries to reduce latency and cost
- Monitor and iterate: Continuously evaluate your system and add edge cases to your test set
- Consider hybrid search: Combine semantic search with keyword matching for better recall
Key Takeaways
- Evaluate retrieval and generation separately to identify bottlenecks in your RAG pipeline. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
- Summary indexing improves recall by creating condensed representations that capture the essence of document sections, making retrieval more effective for complex queries.
- Re-ranking with Claude significantly boosts MRR by ensuring the most relevant chunks appear first, which improves the quality of the final answer.
- Start simple, then iterate — a basic RAG pipeline can be surprisingly effective. Add complexity like summary indexing and re-ranking only when evaluation shows they're needed.
- Build a robust evaluation dataset with diverse questions, including those requiring synthesis across multiple chunks, to stress-test your system.