Contextual Retrieval: How to Reduce RAG Failure Rates by 35% with Claude
Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Includes code examples, cost optimization with prompt caching, and deployment tips for AWS Bedrock.
This guide shows you how to add relevant context to each document chunk before embedding, reducing top-20 retrieval failure rates by 35%. You'll implement Contextual Embeddings, Contextual BM25, and reranking using Claude, Voyage AI, and Cohere.
Contextual Retrieval: How to Reduce RAG Failure Rates by 35% with Claude
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful.
Imagine a code chunk that says def calculate_total(): return subtotal + tax. Without knowing it's part of an e-commerce checkout module, that chunk is nearly useless for retrieval. Contextual Retrieval solves this by prepending a short, chunk-specific context before embedding.
In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, Voyage AI, and Cohere. We'll walk through a complete pipeline, evaluate performance, and show how prompt caching makes this practical for production.
What You'll Need
Skills: Intermediate Python, basic RAG knowledge, familiarity with vector databases. System: Python 3.8+, Docker (optional for BM25), 4GB+ RAM, ~5–10 GB disk space. API Keys:- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key
1. Setup and Basic RAG Baseline
First, install the required libraries:
pip install anthropic voyageai cohere numpy pandas
Load the dataset (pre-chunked codebases from 9 repositories) and evaluation queries:
import json
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_data = [json.loads(line) for line in f]
We'll use Pass@k as our metric—it checks whether the correct "golden chunk" appears in the top-k retrieved results. Our baseline uses Voyage AI embeddings and cosine similarity search:
import voyageai
vo = voyageai.Client(api_key='YOUR_VOYAGE_API_KEY')
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model='voyage-2').embeddings
For each query, find top-10 chunks
for query in eval_data:
q_emb = vo.embed([query['query']], model='voyage-2').embeddings[0]
scores = [cosine_similarity(q_emb, e) for e in embeddings]
top_indices = np.argsort(scores)[-10:][::-1]
# Check if golden chunk is in top-10
Baseline Pass@10: ~87%. Not bad, but we can do better.
2. Contextual Embeddings: The Core Technique
The idea is simple: before embedding each chunk, ask Claude to generate a short context that explains what the chunk is about and where it fits in the larger document.
The Prompt
CONTEXT_PROMPT = """<document>
WHOLE_DOCUMENT
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
CHUNK_CONTENT
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context."""
Implementation with Prompt Caching
Prompt caching dramatically reduces costs when generating context for thousands of chunks. The whole document is cached once and reused for every chunk in that document.
import anthropic
client = anthropic.Anthropic(api_key='YOUR_ANTHROPIC_API_KEY')
def generate_context(document_text, chunk_text):
prompt = CONTEXT_PROMPT.replace('WHOLE_DOCUMENT', document_text)
prompt = prompt.replace('CHUNK_CONTENT', chunk_text)
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{"type": "text", "text": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Cost Tip: With prompt caching, generating context for 1,000 chunks from the same document costs roughly $0.10 instead of $3.00+.
Embed the Contextualized Chunks
contextualized_chunks = []
for chunk in chunks:
context = generate_context(chunk['document'], chunk['content'])
contextualized_chunks.append(f"{context}\n\n{chunk['content']}")
Embed as before
ctx_embeddings = vo.embed(contextualized_chunks, model='voyage-2').embeddings
Result: Pass@10 jumps from ~87% to ~95%—a 35% reduction in retrieval failures.
3. Contextual BM25: Hybrid Search for Even Better Results
BM25 is a classic keyword-based retrieval method. By applying the same contextual prefix to BM25, we get Contextual BM25, which combines the best of semantic and keyword search.
Implementation
from rank_bm25 import BM25Okapi
Tokenize contextualized chunks
tokenized_corpus = [chunk.split() for chunk in contextualized_chunks]
bm25 = BM25Okapi(tokenized_corpus)
Hybrid search: combine BM25 and embedding scores
def hybrid_search(query, alpha=0.5):
# Embedding score
q_emb = vo.embed([query], model='voyage-2').embeddings[0]
emb_scores = [cosine_similarity(q_emb, e) for e in ctx_embeddings]
# BM25 score
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
combined = alpha normalize(emb_scores) + (1 - alpha) normalize(bm25_scores)
return np.argsort(combined)[-10:][::-1]
Contextual BM25 typically adds another 2–3% improvement over Contextual Embeddings alone.
4. Reranking for Maximum Precision
Finally, use Cohere's reranker to reorder the top-20 results from hybrid search:
import cohere
co = cohere.Client('YOUR_COHERE_API_KEY')
def rerank(query, candidates):
results = co.rerank(
model='rerank-english-v2.0',
query=query,
documents=candidates,
top_n=10
)
return [r.index for r in results]
Reranking pushes Pass@10 close to 98%.
5. Production Deployment on AWS Bedrock
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The code is available in the contextual-rag-lambda-function folder of the cookbook repository.
Key steps:
- Create a Lambda function using
lambda_function.py - Set it as a custom chunking strategy in your Bedrock Knowledge Base
- The function calls Claude (via Bedrock) to generate context for each chunk
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by prepending chunk-specific context before embedding. This is a simple, high-impact improvement for any RAG system.
- Prompt caching makes this cost-effective. Caching the whole document across chunk generations reduces API costs by 10–30x.
- Hybrid search with Contextual BM25 adds another 2–3% improvement. Combining semantic and keyword retrieval captures more relevant chunks.
- Reranking pushes accuracy to ~98% Pass@10. Use a dedicated reranker (like Cohere) as a final precision layer.
- Deployable on AWS Bedrock. The included Lambda function lets you use Contextual Retrieval as a custom chunking strategy in Bedrock Knowledge Bases.