Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings
Learn how to implement Contextual Retrieval with Claude AI to reduce retrieval failure rates by 35%. A practical guide with code examples for production RAG systems.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% in RAG systems using Claude AI.
Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context, leading to poor search results.
Anthropic's Contextual Retrieval technique solves this by adding relevant context to each chunk before embedding. The results are impressive: a 35% reduction in retrieval failure rates across tested datasets. This guide walks you through implementing this technique with Claude AI, complete with code examples and production considerations.
What You'll Learn
- Why standard chunking fails and how Contextual Embeddings fix it
- How to implement Contextual Retrieval with Claude and Voyage AI
- How to use Contextual BM25 for hybrid search improvements
- How prompt caching makes this approach cost-effective at scale
- How to evaluate retrieval performance with Pass@k metrics
Prerequisites
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG concepts
- Familiarity with vector databases and embeddings
- Anthropic API key (free tier sufficient)
- Voyage AI API key
- Cohere API key (for reranking)
- Setup and implementation: 30-45 minutes
- API costs: ~$5-10 for the full dataset
The Problem with Traditional Chunking
In standard RAG pipelines, documents are split into smaller chunks for efficient vector search. This works well when chunks are self-contained, but fails when:
- A chunk contains a variable name like
process_data()without explaining what it does - A chunk references "the algorithm" without specifying which algorithm
- A chunk contains a code snippet without the function signature or imports
def process():
return transform(data, config)
Without context, the embedding for this chunk captures nothing about what transform does, what config contains, or what domain this code belongs to. A query like "How do I configure data transformation?" would likely miss this chunk entirely.
What Are Contextual Embeddings?
Contextual Embeddings solve this by prepending a short, chunk-specific context to each chunk before generating the embedding vector. This context is generated by Claude, which understands the full document and can summarize the chunk's relevance.
The process:- Split your documents into chunks (as usual)
- For each chunk, ask Claude: "What context does a reader need to understand this chunk?"
- Prepend Claude's context to the chunk
- Embed the context-augmented chunk
- Store in your vector database
Implementation: Contextual Embeddings with Claude
Step 1: Generate Context for Each Chunk
Here's how to generate context using Claude's API:
import anthropic
client = anthropic.Anthropic()
def generate_chunk_context(chunk_text, full_document):
"""Generate context for a single chunk using Claude."""
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=100,
messages=[
{
"role": "user",
"content": f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
}
]
)
return response.content[0].text
Step 2: Create Contextual Embeddings
Once you have the context, prepend it to the chunk before embedding:
import voyageai
vo = voyageai.Client()
def create_contextual_embedding(chunk_text, context):
"""Create an embedding for a context-augmented chunk."""
augmented_text = f"{context}\n\n{chunk_text}"
embedding = vo.embed(
texts=[augmented_text],
model="voyage-2"
)
return embedding.embeddings[0]
Step 3: Store and Retrieve
Store the contextual embeddings in your vector database (e.g., Pinecone, Weaviate, or Chroma). During retrieval, query as usual—the enriched embeddings will naturally match relevant queries better.
Making It Production-Ready with Prompt Caching
Generating context for every chunk can be expensive. For a codebase with 10,000 chunks, you'd send the full document 10,000 times. Prompt caching makes this practical.
Anthropic's prompt caching allows you to cache the full document and reference it across multiple context generation calls:
# First call: cache the document
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=100,
system=[
{
"type": "text",
"text": full_document,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": f"<chunk>{chunk_1}</chunk>\n\nProvide context..."
}
]
)
Subsequent calls: use cached document
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=100,
system=[
{
"type": "text",
"text": full_document,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": f"<chunk>{chunk_2}</chunk>\n\nProvide context..."
}
]
)
This reduces API costs by up to 90% for large document collections.
Contextual BM25: Hybrid Search Enhancement
The same chunk-specific context can also improve BM25 (keyword) search. Traditional BM25 struggles with chunks that lack distinctive keywords. By adding context, you introduce relevant terms that improve keyword matching.
from rank_bm25 import BM25Okapi
def create_contextual_bm25_index(chunks_with_context):
"""Create a BM25 index from context-augmented chunks."""
# Tokenize the context-augmented chunks
tokenized_chunks = [
f"{ctx['context']} {ctx['text']}".split()
for ctx in chunks_with_context
]
return BM25Okapi(tokenized_chunks)
Combine contextual embeddings with contextual BM25 for hybrid search:
def hybrid_search(query, vector_db, bm25_index, alpha=0.5):
"""Combine vector and keyword search results."""
# Vector search
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
vector_results = vector_db.similarity_search(query_embedding, k=20)
# BM25 search
bm25_scores = bm25_index.get_scores(query.split())
bm25_results = sorted(
range(len(bm25_scores)),
key=lambda i: bm25_scores[i],
reverse=True
)[:20]
# Combine scores (implementation varies by vector DB)
# ...
return combined_results
Performance Evaluation
Anthropic tested Contextual Retrieval on a dataset of 9 codebases with 248 queries. Results:
| Method | Pass@10 | Improvement |
|---|---|---|
| Basic RAG | ~87% | Baseline |
| Contextual Embeddings | ~95% | +8% absolute |
| + Contextual BM25 | ~97% | +10% absolute |
| + Reranking | ~98% | +11% absolute |
Production Considerations
For AWS Bedrock Users
Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option in Bedrock Knowledge Bases. This allows you to use Contextual Retrieval without managing your own infrastructure.
Cost Optimization
- Prompt caching is essential for large document collections
- Batch context generation during off-peak hours
- Consider using smaller Claude models (Claude 3 Haiku) for context generation
- Cache results to avoid regenerating context for unchanged documents
When to Use Contextual Retrieval
Best for:- Codebases with implicit dependencies
- Legal documents with cross-references
- Technical documentation with domain-specific terminology
- Any corpus where chunks lose meaning in isolation
- Self-contained chunks (e.g., individual FAQ entries)
- Very short documents where the entire document fits in one chunk
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the "lost in the middle" problem of traditional chunking
- Prompt caching makes this practical at scale—cache the full document once and reuse it across hundreds of context generation calls, reducing API costs by up to 90%
- Hybrid search with Contextual BM25 further improves results by combining semantic and keyword matching, each enriched with the same contextual information
- Production-ready on major platforms—Anthropic provides Lambda functions for AWS Bedrock, and the technique works on GCP Vertex AI with minimal customization
- Start with Pass@k evaluation—measure your baseline retrieval performance before and after implementing Contextual Retrieval to quantify the improvement in your specific use case