Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI
Learn how to implement Contextual Embeddings and Contextual BM25 to improve RAG accuracy by 35% using Claude AI, Voyage AI, and Cohere.
This guide shows you how to implement Contextual Retrieval — adding full-document context to each chunk before embedding — to reduce retrieval failure rates by 35% in your RAG applications with Claude AI.
Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications — from customer support bots to internal knowledge base assistants. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful.
Enter Contextual Retrieval, a technique pioneered by Anthropic that adds relevant context to each chunk before embedding. The results speak for themselves: a 35% reduction in top-20 retrieval failure rates across diverse datasets.
In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, Voyage AI, and Cohere — and see measurable improvements in your RAG pipeline.
What You'll Need
Prerequisites
- Intermediate Python skills
- Basic understanding of RAG and vector databases
- Docker installed (optional, for BM25)
API Keys
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key
Time & Cost
- Setup time: 30–45 minutes
- API costs: ~$5–10 for the full dataset
The Problem: Lost Context in Chunked Retrieval
Traditional RAG splits documents into fixed-size chunks. This works well for many cases, but consider this scenario:
A chunk contains the code snippet def calculate_interest(): — but without the surrounding context, the retriever can't tell if this is for a banking app, a savings calculator, or a loan amortization tool.
When chunks lack context, retrieval accuracy suffers. The retriever returns irrelevant results, and Claude's responses become less reliable.
Solution: Contextual Embeddings
Contextual Embeddings solve this by prepending a concise, chunk-specific context to each chunk before generating the embedding vector. This context is generated by Claude itself, which understands the full document and can summarize what the chunk is about.
How It Works
- Split your documents into chunks (e.g., 512 characters)
- Generate context for each chunk using Claude (prompt: "What is this chunk about, given the full document?")
- Prepend context to the chunk text
- Embed the enriched chunk
- Retrieve using standard vector search
Implementation
Here's how to generate context for each chunk using Claude:
import anthropic
client = anthropic.Anthropic()
def generate_chunk_context(chunk_text: str, full_document: str) -> str:
"""Generate context for a single chunk using Claude."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
system="You are a document analysis assistant. Your task is to generate a brief, specific context for a chunk of text based on the full document it comes from.",
messages=[
{
"role": "user",
"content": f"Here is the full document:\n\n<document>{full_document}</document>\n\nHere is the chunk we need context for:\n\n<chunk>{chunk_text}</chunk>\n\nGenerate a concise context (1-2 sentences) that explains what this chunk is about in the context of the full document. Focus on the subject matter, not the structure."
}
]
)
return response.content[0].text
Making It Production-Ready with Prompt Caching
Generating context for thousands of chunks can be expensive. Prompt caching reduces costs by caching the full document across multiple chunk requests.
# Enable prompt caching by marking the document as a "cache_control" breakpoint
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
system=[
{
"type": "text",
"text": "You are a document analysis assistant.",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Full document:\n\n{full_document}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"Chunk:\n\n{chunk_text}\n\nGenerate context."
}
]
}
]
)
With caching, the full document is only processed once. Subsequent chunks reuse the cached representation, reducing API costs by up to 90%.
Step 2: Embedding with Context
Once you have context for each chunk, prepend it to the chunk text before embedding:
import voyageai
vo = voyageai.Client(api_key="your-voyage-api-key")
def embed_with_context(chunks_with_context: list[dict]) -> list[list[float]]:
"""Embed chunks with their context prepended."""
enriched_texts = [
f"{chunk['context']}\n\n{chunk['text']}"
for chunk in chunks_with_context
]
response = vo.embed(
texts=enriched_texts,
model="voyage-3-large",
input_type="document"
)
return response.embeddings
Step 3: Contextual BM25 for Hybrid Search
The same chunk-specific context can also improve BM25 (keyword-based) search. This creates a powerful hybrid search system:
- Vector search captures semantic meaning
- Contextual BM25 captures keyword relevance with enriched context
Implementation
from rank_bm25 import BM25Okapi
def build_contextual_bm25(chunks_with_context: list[dict]) -> BM25Okapi:
"""Build a BM25 index using context-enriched chunks."""
enriched_texts = [
f"{chunk['context']} {chunk['text']}"
for chunk in chunks_with_context
]
tokenized_corpus = [text.split() for text in enriched_texts]
return BM25Okapi(tokenized_corpus)
Step 4: Reranking for Final Precision
Even with Contextual Embeddings, the top-10 results may contain irrelevant chunks. Adding a reranker (like Cohere's) improves final accuracy:
import cohere
co = cohere.Client("your-cohere-api-key")
def rerank_results(query: str, chunks: list[str], top_k: int = 5) -> list[dict]:
"""Rerank retrieved chunks using Cohere's reranker."""
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=chunks,
top_n=top_k
)
return response.results
Performance Results
On a dataset of 9 codebases with 248 queries, Contextual Retrieval delivered:
| Method | Pass@10 | Improvement |
|---|---|---|
| Basic RAG | 87% | — |
| Contextual Embeddings | 95% | +8% absolute |
| Contextual Embeddings + BM25 | 97% | +10% absolute |
| Full Pipeline (with reranking) | 99% | +12% absolute |
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, Anthropic provides a Lambda function that adds context to each document during ingestion. Deploy the function from the contextual-rag-lambda-function directory and select it as a custom chunking option when configuring your knowledge base.
Cost Management
- Prompt caching is essential for large document sets
- Batch processing chunks reduces API calls
- Voyage AI offers competitive embedding pricing
- Cohere reranker is optional but adds ~$0.001 per query
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% — adding full-document context to each chunk before embedding dramatically improves accuracy.
- Prompt caching makes it practical — by caching the full document, you can generate context for thousands of chunks at minimal cost.
- Hybrid search with Contextual BM25 boosts results further — combining vector and keyword search with enriched context yields the best performance.
- Reranking adds the final polish — a lightweight reranker can push Pass@10 from 95% to 99%.
- Production-ready on any cloud — the technique works on Anthropic's API, AWS Bedrock, and GCP Vertex AI with minimal customization.