How to Build a Contextual Retrieval System with Claude: A Practical Guide
Learn to improve RAG accuracy by 35% using Contextual Embeddings and BM25 with Claude. Step-by-step guide with code examples, evaluation metrics, and cost optimization tips.
This guide shows you how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. You'll learn to add context to document chunks before embedding, use Contextual BM25, and apply reranking—all with practical code examples and cost-saving prompt caching techniques.
How to Build a Contextual Retrieval System with Claude: A Practical Guide
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the surrounding context, leading to poor search results and inaccurate answers.
Contextual Retrieval solves this by adding relevant context to each chunk before embedding. The result? Anthropic's testing shows a 35% reduction in top-20-chunk retrieval failure rates across multiple datasets. In this guide, you'll learn how to implement this technique with Claude, complete with code examples and performance benchmarks.What You'll Need
Prerequisites
- Intermediate Python skills
- Basic understanding of RAG and vector databases
- Command-line proficiency
System Requirements
- Python 3.8+
- 4GB+ RAM
- ~5–10 GB disk space for vector databases
- Docker (optional, for BM25 search)
API Keys
- Anthropic API key (free tier works)
- Voyage AI API key (for embeddings)
- Cohere API key (for reranking)
Step 1: Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere numpy pandas
Then initialize your clients:
import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client("YOUR_COHERE_KEY")
Step 2: Building a Basic RAG Baseline
Before improving retrieval, you need a baseline. We'll use a dataset of 9 codebases (248 queries with golden chunks) to measure performance.
Load and Chunk Your Documents
import json
Load pre-chunked codebase data
with open("data/codebase_chunks.json", "r") as f:
chunks = json.load(f)
Load evaluation queries
with open("data/evaluation_set.jsonl", "r") as f:
eval_data = [json.loads(line) for line in f]
Create Embeddings and Index
# Generate embeddings for each chunk
chunk_texts = [chunk["text"] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple vector index (use FAISS or Chroma for production)
import numpy as np
index = {}
for i, emb in enumerate(embeddings):
index[i] = {
"text": chunk_texts[i],
"embedding": np.array(emb)
}
Evaluate with Pass@k
We use Pass@k—does the golden chunk appear in the top-k results? Here's how to compute it:
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = []
for i, item in index.items():
sim = cosine_similarity(query_emb, item["embedding"])
scores.append((i, sim))
scores.sort(key=lambda x: x[1], reverse=True)
return [index[idx]["text"] for idx, _ in scores[:k]]
Evaluate
pass_at_10 = 0
for item in eval_data:
results = retrieve(item["query"], k=10)
if item["golden_chunk"] in results:
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10/len(eval_data)*100:.1f}%")
Expect a baseline around 87%—good, but we can do better.
Step 3: Implementing Contextual Embeddings
Contextual Embeddings add surrounding context to each chunk before embedding. This prevents chunks from being retrieved out of context.
How It Works
For each chunk, you ask Claude to generate a concise context snippet that includes:
- The document title or section heading
- The preceding content summary
- The chunk's role in the overall document
Generate Context with Claude
def generate_chunk_context(chunk_text, surrounding_text, doc_title):
prompt = f"""
Document: {doc_title}
Surrounding text: {surrounding_text}
Chunk: {chunk_text}
Provide a brief context (1-2 sentences) explaining what this chunk is about
and how it fits into the document. Focus on key entities, topics, and purpose.
"""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Optimize with Prompt Caching
Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the system prompt across multiple calls:
def generate_context_cached(chunks_batch, doc_title):
system_prompt = f"You are a context generator. For each chunk from '{doc_title}', provide a 1-2 sentence context."
response = claude.messages.create(
model="claude-3-haiku-20240307",
system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": f"Chunk: {chunk}"} for chunk in chunks_batch],
max_tokens=100
)
return response.content[0].text
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex.
Embed Contextual Chunks
contextual_chunks = []
for chunk in chunks:
context = generate_chunk_context(
chunk["text"],
chunk.get("surrounding_text", ""),
chunk.get("doc_title", "Unknown")
)
contextual_chunks.append(f"{context}\n\n{chunk['text']}")
Re-embed with context
contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings
Evaluate Again
Re-run the evaluation with your new contextual embeddings. You should see Pass@10 jump from ~87% to ~95%—a 35% reduction in retrieval failures.
Step 4: Adding Contextual BM25
BM25 is a text-based retrieval method that complements embeddings. By applying the same contextual prefix to BM25, you get Contextual BM25—a hybrid approach that captures both semantic and keyword matches.
Set Up BM25
from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_corpus = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query, k=10, alpha=0.5):
# Vector search
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
vector_scores = [cosine_similarity(query_emb, emb) for emb in contextual_embeddings]
# BM25 search
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
vector_scores = np.array(vector_scores) / max(vector_scores)
bm25_scores = bm25_scores / max(bm25_scores)
combined = alpha vector_scores + (1 - alpha) bm25_scores
top_indices = np.argsort(combined)[-k:][::-1]
return [chunks[i]["text"] for i in top_indices]
Hybrid search typically yields another 2–5% improvement over embeddings alone.
Step 5: Reranking for Final Precision
Reranking reorders your top-k results using a more powerful model. Cohere's rerank API works well here:
def rerank(query, results, k=5):
reranked = co.rerank(
query=query,
documents=results,
top_n=k,
model="rerank-english-v2.0"
)
return [result.document["text"] for result in reranked.results]
Use in pipeline
def final_retrieve(query):
initial_results = hybrid_search(query, k=20) # Get more candidates
return rerank(query, initial_results, k=5) # Rerank to top 5
Reranking can push Pass@5 above 97% in many cases.
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function (provided in the Anthropic cookbook) as a custom chunking option. The function adds context to each document chunk before storage.
Cost Management
- Use Claude 3 Haiku for context generation (fastest/cheapest)
- Enable prompt caching to reduce API calls by up to 90%
- Batch your context generation requests
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by adding document context to each chunk before embedding, solving the "lost in the middle" problem common in basic RAG.
- Hybrid search (Contextual Embeddings + Contextual BM25) outperforms either method alone, capturing both semantic meaning and keyword precision.
- Reranking adds a final precision boost, pushing Pass@5 accuracy above 97% in many enterprise datasets.
- Prompt caching makes Contextual Retrieval cost-effective for production, reducing API overhead by reusing system prompts across batch operations.
- The technique works across cloud platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex—making it accessible regardless of your infrastructure.