Mastering Contextual Retrieval: Boost RAG Accuracy by 35% with Claude
Learn how to implement Contextual Embeddings and Contextual BM25 with Claude to dramatically improve RAG retrieval accuracy. Includes code examples, cost optimization with prompt caching, and AWS Bedrock deployment.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% using Claude, Voyage AI, and prompt caching for cost efficiency.
Mastering Contextual Retrieval: Boost RAG Accuracy by 35% with Claude
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering customer support bots, internal knowledge base Q&A, legal document analysis, and code generation. But there's a persistent problem: chunked documents lose context. When you split a 50-page technical manual into 500-character chunks, each chunk becomes an orphan—a fragment without its parent document's narrative.
Anthropic's research team discovered a powerful fix: Contextual Retrieval. By prepending relevant context to each chunk before embedding, they reduced top-20-chunk retrieval failure rates by an average of 35% across diverse datasets. This guide walks you through implementing this technique with Claude, including cost-saving strategies using prompt caching and deployment options for AWS Bedrock.
What You'll Build
By the end of this guide, you'll have:
- A basic RAG pipeline with baseline performance metrics
- A Contextual Embeddings system that adds chunk-specific context
- A Contextual BM25 hybrid search for even better retrieval
- A reranking layer to maximize accuracy
Prerequisites
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG concepts
- Familiarity with vector databases and embeddings
- Python 3.8+
- Docker installed (optional, for BM25 search)
- 4GB+ available RAM
- ~5-10 GB disk space for vector databases
- Anthropic API key
- Voyage AI API key
- Cohere API key (for reranking)
- Expected completion: 30-45 minutes
- API costs: ~$5-10 for the full dataset
Step 1: Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere pandas numpy
Initialize your clients:
import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client(api_key="YOUR_COHERE_KEY")
Step 2: Building a Basic RAG Baseline
Before improving retrieval, establish a baseline. We'll use a dataset of 9 codebases (248 queries with known "golden chunks") and measure Pass@k—whether the correct chunk appears in the top-k results.
import json
Load your chunked dataset
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_queries = [json.loads(line) for line in f]
Basic chunking (character-based split)
def basic_chunk(text, chunk_size=500, overlap=50):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
return chunks
Embed chunks using Voyage AI
chunk_embeddings = vo.embed(
texts=chunks,
model="voyage-2",
input_type="document"
).embeddings
Simple cosine similarity search
def search(query, k=10):
query_emb = vo.embed(
texts=[query],
model="voyage-2",
input_type="query"
).embeddings[0]
similarities = [
cosine_similarity(query_emb, chunk_emb)
for chunk_emb in chunk_embeddings
]
top_indices = sorted(
range(len(similarities)),
key=lambda i: similarities[i],
reverse=True
)[:k]
return [chunks[i] for i in top_indices]
Baseline Result: Pass@10 ≈ 87%—decent, but we can do better.
Step 3: Implementing Contextual Embeddings
The core idea is simple: before embedding each chunk, prepend a short context that explains where the chunk comes from. This context is generated by Claude itself.
def generate_chunk_context(chunk, document_title, surrounding_text):
"""Use Claude to generate context for a chunk."""
prompt = f"""You are helping to improve a RAG system.
Document: {document_title}
Here is a chunk from this document:
<chunk>{chunk}</chunk>
Here is the surrounding text (100 chars before and after):
<context>{surrounding_text}</context>
Generate a brief context (2-3 sentences) that explains what this chunk is about
and how it fits into the larger document. Focus on:
- What topic or concept this chunk covers
- How it relates to adjacent content
- Any key entities or references
Return ONLY the context, no additional text."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Apply to all chunks
contextual_chunks = []
for chunk in chunks:
context = generate_chunk_context(
chunk["text"],
chunk["document_title"],
chunk["surrounding_text"]
)
contextual_chunks.append(f"{context}\n\n{chunk['text']}")
Embed the contextualized chunks
contextual_embeddings = vo.embed(
texts=contextual_chunks,
model="voyage-2",
input_type="document"
).embeddings
Result: Pass@10 jumps from ~87% to ~95%—a 62% reduction in retrieval failures.
Cost Optimization with Prompt Caching
Generating context for thousands of chunks can be expensive. Claude's prompt caching feature dramatically reduces costs by reusing shared prompt prefixes.
# With prompt caching, the system prompt is cached
response = claude.messages.create(
model="claude-3-haiku-20240307",
system=[{
"type": "text",
"text": "You are a context generator for a RAG system...",
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
Prompt caching reduces API costs by up to 90% for this use case, making Contextual Embeddings practical for production.
Step 4: Contextual BM25 Hybrid Search
BM25 (a text-based retrieval algorithm) can also benefit from contextualized chunks. Combine it with vector search for a hybrid approach.
from rank_bm25 import BM25Okapi
from functools import lru_cache
Tokenize contextual chunks for BM25
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
Hybrid search: combine BM25 and vector scores
def hybrid_search(query, k=10, alpha=0.5):
# Vector search
query_emb = vo.embed(
texts=[query],
model="voyage-2",
input_type="query"
).embeddings[0]
vector_scores = [
cosine_similarity(query_emb, emb)
for emb in contextual_embeddings
]
# BM25 search
bm25_scores = bm25.get_scores(query.split())
# Normalize and combine
combined = [
alpha * (v / max(vector_scores)) +
(1 - alpha) * (b / max(bm25_scores))
for v, b in zip(vector_scores, bm25_scores)
]
top_indices = sorted(
range(len(combined)),
key=lambda i: combined[i],
reverse=True
)[:k]
return [chunks[i] for i in top_indices]
Result: Hybrid Contextual Retrieval further improves Pass@10 by 2-3% over Contextual Embeddings alone.
Step 5: Adding a Reranking Layer
For maximum accuracy, add a Cohere reranker to reorder the top-20 results:
def rerank(query, candidates, top_k=10):
results = co.rerank(
model="rerank-english-v2.0",
query=query,
documents=candidates,
top_n=top_k
)
return [candidates[r.index] for r in results.results]
Full pipeline
query = "How does the authentication module handle token refresh?"
top_20 = hybrid_search(query, k=20)
final_results = rerank(query, top_20, top_k=10)
Final Result: Pass@10 reaches ~97%—near-perfect retrieval.
Deploying to AWS Bedrock
Anthropic provides a ready-to-deploy Lambda function for AWS Bedrock Knowledge Bases. The code is in contextual-rag-lambda-function/lambda_function.py. Deploy it as a custom chunking option:
- Create a Lambda function with the provided code
- Configure your Bedrock Knowledge Base to use it
- Select "Custom chunking" and point to your Lambda ARN
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by adding chunk-specific context before embedding, solving the "orphaned chunk" problem in traditional RAG.
- Prompt caching makes it cost-effective—Claude's ephemeral caching reduces API costs by up to 90% for context generation, making this practical for production.
- Hybrid search (Contextual Embeddings + Contextual BM25) outperforms either alone—combining semantic and keyword retrieval yields 2-3% additional improvement.
- Reranking adds the final polish—a Cohere reranker on top-20 results pushes Pass@10 to ~97%.
- AWS Bedrock deployment is straightforward—use the provided Lambda function for custom chunking in Bedrock Knowledge Bases.
Next Steps
- Read the full Anthropic blog post on Contextual Retrieval for more performance evaluations
- Experiment with different chunk sizes and overlap ratios
- Try Claude 3.5 Sonnet for context generation (higher quality, slightly higher cost)
- Explore the complete notebook at Anthropic's Cookbook