Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude
Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Includes code examples, cost optimization with prompt caching, and production deployment tips for Claude AI.
This guide shows you how to improve RAG retrieval accuracy by adding context to document chunks before embedding. Using Claude with Contextual Embeddings reduces top-20 retrieval failure by 35%, and combining with Contextual BM25 and reranking pushes Pass@10 from 87% to 95%.
Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering customer support bots, internal knowledge base Q&A, financial analysis tools, and code generation workflows. But there's a persistent problem: chunks lack context.
When you split a document into chunks for vector search, each chunk becomes an island. A chunk containing "the function returns True" loses the crucial context of which function and why. This leads to missed retrievals, hallucinated answers, and frustrated users.
In this guide, you'll learn how to implement Contextual Retrieval with Claude, optimize costs using prompt caching, and combine it with BM25 search and reranking for production-grade performance.
What You'll Build
By the end of this guide, you'll have a complete Contextual Retrieval pipeline that:
- Adds context to each chunk using Claude
- Embeds contextualized chunks for vector search
- Combines with Contextual BM25 for hybrid retrieval
- Reranks results for maximum accuracy
Prerequisites
Skills:- Intermediate Python
- Basic RAG understanding
- Familiarity with embeddings and vector databases
- Anthropic API key (free tier works)
- Voyage AI API key (for embeddings)
- Cohere API key (for reranking)
- 30-45 minutes to complete
- ~$5-10 in API costs for the full dataset
Step 1: Setting Up the Baseline RAG
First, let's establish a performance baseline with standard RAG.
import json
import voyageai
from anthropic import Anthropic
Load your data
with open('data/codebase_chunks.json') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl') as f:
eval_queries = [json.loads(line) for line in f]
Initialize clients
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
Embed all chunks (baseline - no context)
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Simple retrieval function
def retrieve(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
# Compute cosine similarity (simplified)
scores = [cosine_similarity(query_emb, emb) for emb in embeddings]
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [chunks[i] for i in top_indices]
def cosine_similarity(a, b):
return sum(ai*bi for ai, bi in zip(a, b)) / (
(sum(aiai for ai in a)0.5) (sum(bi*bi for bi in b)0.5)
)
Evaluate Pass@10
correct = 0
for query in eval_queries:
results = retrieve(query['query'], k=10)
if query['golden_chunk_id'] in [r['id'] for r in results]:
correct += 1
print(f"Baseline Pass@10: {correct/len(eval_queries)*100:.1f}%")
Expected: ~87%
Step 2: Implementing Contextual Embeddings
The core idea is simple: for each chunk, ask Claude to generate a brief context that explains what this chunk is about, then prepend that context before embedding.
The Context Generation Prompt
CONTEXT_PROMPT = """You are helping to improve a retrieval system.
Given a document and a chunk from that document, write a brief context
(2-3 sentences) explaining what this chunk is about and how it relates
to the broader document. Focus on:
- What topic or concept this chunk covers
- How it connects to surrounding content
- Any important entities, functions, or terms mentioned
Document: {full_document}
Chunk: {chunk_text}
Context:"""
def generate_context(full_document, chunk_text):
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{
"role": "user",
"content": CONTEXT_PROMPT.format(
full_document=full_document,
chunk_text=chunk_text
)
}]
)
return response.content[0].text
Optimizing with Prompt Caching
Generating context for every chunk individually would be expensive. Prompt caching makes this practical by caching the full document across multiple chunk requests.
def generate_context_cached(full_document, chunk_text):
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
system=[{
"type": "text",
"text": "You are helping to improve a retrieval system.",
"cache_control": {"type": "ephemeral"}
}],
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"Document:\n{full_document}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"Chunk:\n{chunk_text}\n\nContext:"
}
]
}]
)
return response.content[0].text
With prompt caching, you pay the full document context cost once, then subsequent chunks only incur the chunk-specific token cost. This typically reduces costs by 60-80% for large documents.
Building the Contextual Embedding Pipeline
# Generate contextualized chunks
contextualized_chunks = []
for doc_id, doc in enumerate(documents):
full_text = doc['text']
for chunk in doc['chunks']:
context = generate_context_cached(full_text, chunk['text'])
contextualized_text = f"{context}\n\n{chunk['text']}"
contextualized_chunks.append({
'id': chunk['id'],
'text': contextualized_text,
'original_chunk': chunk
})
Embed contextualized chunks
contextual_embeddings = vo.embed(
[c['text'] for c in contextualized_chunks],
model="voyage-2"
).embeddings
Evaluate (same retrieval function, new embeddings)
correct = 0
for query in eval_queries:
results = retrieve(query['query'], k=10)
if query['golden_chunk_id'] in [r['id'] for r in results]:
correct += 1
print(f"Contextual Embeddings Pass@10: {correct/len(eval_queries)*100:.1f}%")
Expected: ~93-95%
Step 3: Adding Contextual BM25
Contextual BM25 applies the same idea to keyword search. Use the generated context as input to a BM25 index alongside the original chunk.
from rank_bm25 import BM25Okapi
Build BM25 index with contextualized text
bm25 = BM25Okapi([c['text'].split() for c in contextualized_chunks])
def hybrid_search(query, k=10, alpha=0.5):
# Vector search
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
vector_scores = [cosine_similarity(query_emb, emb) for emb in contextual_embeddings]
# BM25 search
bm25_scores = bm25.get_scores(query.split())
# Normalize and combine
vector_scores = normalize(vector_scores)
bm25_scores = normalize(bm25_scores)
combined = [alpha v + (1-alpha) b for v, b in zip(vector_scores, bm25_scores)]
top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:k]
return [chunks[i] for i in top_indices]
def normalize(scores):
min_s, max_s = min(scores), max(scores)
return [(s - min_s) / (max_s - min_s) for s in scores]
Hybrid search typically adds another 2-3% improvement over vector search alone.
Step 4: Reranking for Maximum Accuracy
For the final polish, add a reranking step using Cohere's reranker:
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def retrieve_with_rerank(query, k=10):
# Get initial candidates (e.g., top 50)
candidates = hybrid_search(query, k=50)
# Rerank
reranked = co.rerank(
query=query,
documents=[c['text'] for c in candidates],
model="rerank-english-v2.0",
top_n=k
)
return [candidates[r.index] for r in reranked.results]
Reranking typically adds another 1-2% improvement, pushing Pass@10 to ~95%.
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy Contextual Retrieval as a Lambda function for custom chunking:
# contextual-rag-lambda-function/lambda_function.py
def lambda_handler(event, context):
"""
Custom chunking Lambda for Bedrock Knowledge Base.
Expects event with 'document' and 'chunks' fields.
Returns chunks with context prepended.
"""
document = event['document']
chunks = event['chunks']
contextualized = []
for chunk in chunks:
context = generate_context_cached(document, chunk['text'])
contextualized.append({
**chunk,
'text': f"{context}\n\n{chunk['text']}"
})
return {'chunks': contextualized}
Cost Management
| Technique | Cost Impact | Performance Gain |
|---|---|---|
| Contextual Embeddings | +$0.01-0.05/chunk | +6-8% Pass@10 |
| Prompt Caching | -60-80% context cost | Same performance |
| Contextual BM25 | Free (compute only) | +2-3% Pass@10 |
| Reranking | +$0.001/query | +1-2% Pass@10 |
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by adding document-level context to each chunk before embedding, solving the "chunk isolation" problem that plagues standard RAG.
- Prompt caching makes Contextual Retrieval production-ready by caching the full document across chunk requests, reducing costs by 60-80% compared to naive implementation.
- Hybrid search with Contextual BM25 adds 2-3% improvement by combining semantic and keyword-based retrieval on the same contextualized chunks.
- Reranking provides the final polish (1-2% improvement) and is worth implementing for production systems where every retrieval matters.
- The technique works across platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex AI—making it accessible regardless of your cloud provider.