Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users
Learn how to improve RAG performance using Contextual Embeddings and BM25 with Claude AI. Step-by-step guide with code examples, evaluation metrics, and cost-saving tips.
This guide teaches you how to implement Contextual Retrieval—adding relevant context to document chunks before embedding—to reduce retrieval failure rates by up to 35% and improve Pass@10 accuracy from ~87% to ~95% in RAG systems using Claude AI.
Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users
Retrieval Augmented Generation (RAG) is a powerful pattern that enables Claude to answer questions using your internal knowledge bases, codebases, or any document corpus. However, traditional RAG systems often suffer from a fundamental problem: when documents are split into smaller chunks for efficient retrieval, individual chunks can lose their surrounding context, leading to poor retrieval accuracy.
In this guide, we'll explore Contextual Retrieval—a technique developed by Anthropic that significantly improves RAG performance by adding relevant context to each chunk before embedding. According to Anthropic's internal testing, this method reduces the top-20-chunk retrieval failure rate by an average of 35% across various data sources.
What You'll Learn
By the end of this guide, you'll know how to:
- Set up a basic RAG pipeline with Claude
- Implement Contextual Embeddings to improve chunk quality
- Use Contextual BM25 for hybrid search
- Apply reranking to further boost performance
- Leverage prompt caching to manage costs
Prerequisites
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG concepts
- Familiarity with vector databases and embeddings
- Basic command-line proficiency
- Python 3.8+
- Docker installed and running (optional, for BM25 search)
- 4GB+ available RAM
- ~5-10 GB disk space for vector databases
- Anthropic API key (free tier sufficient)
- Voyage AI API key (for embeddings)
- Cohere API key (for reranking)
- Expected completion: 30-45 minutes
- API costs: ~$5-10 to run through the full dataset
Step 1: Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere
Initialize your clients:
import anthropic
import voyageai
import cohere
Initialize clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client(api_key="YOUR_COHERE_KEY")
Step 2: Understanding the Problem with Basic RAG
In a basic RAG setup, documents are split into chunks using simple character or token splitting. While this works for many applications, it creates a critical issue: individual chunks lack surrounding context.
Consider a codebase chunk containing just def calculate_total():. Without context, an embedding model might not understand this is part of a financial calculation function. The result? Poor retrieval when a user asks about "financial calculations."
Step 3: Implementing Contextual Embeddings
Contextual Embeddings solve this by prepending relevant context to each chunk before embedding. Here's how it works:
3.1 Generate Context for Each Chunk
Use Claude to generate context for each chunk. The prompt should include the full document and the specific chunk:
def generate_chunk_context(document, chunk):
"""Generate context for a single chunk using Claude."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[
{
"role": "user",
"content": f"""<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context."""
}
]
)
return response.content[0].text
3.2 Create Contextual Embeddings
Once you have the context, prepend it to the chunk before embedding:
def create_contextual_embedding(context, chunk):
"""Create an embedding for a chunk with its context."""
contextual_chunk = f"{context}\n\n{chunk}"
embedding = vo.embed(
texts=[contextual_chunk],
model="voyage-2"
).embeddings[0]
return embedding
3.3 Optimize Costs with Prompt Caching
Generating context for thousands of chunks can be expensive. Use prompt caching to reduce costs by up to 90%:
def generate_chunk_context_cached(document, chunk):
"""Generate context using prompt caching for efficiency."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[
{
"type": "text",
"text": f"You are helping to situate chunks within this document:\n\n{document}",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": f"Here is the chunk: {chunk}"
}
]
)
return response.content[0].text
Note: Prompt caching is currently available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
Step 4: Implementing Contextual BM25
Contextual BM25 extends the same idea to keyword-based search. Instead of using raw chunks, you use the same context-prepended chunks for BM25 indexing:
from rank_bm25 import BM25Okapi
def build_contextual_bm25_index(chunks_with_context):
"""Build a BM25 index using contextual chunks."""
tokenized_chunks = [chunk.split() for chunk in chunks_with_context]
bm25 = BM25Okapi(tokenized_chunks)
return bm25
def search_contextual_bm25(bm25_index, query, top_k=10):
"""Search using contextual BM25."""
tokenized_query = query.split()
scores = bm25_index.get_scores(tokenized_query)
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return top_indices
Step 5: Hybrid Search with Reranking
For best results, combine Contextual Embeddings and Contextual BM25, then rerank:
def hybrid_search(query, embedding_index, bm25_index, alpha=0.5, top_k=20):
"""Perform hybrid search combining embeddings and BM25."""
# Get embedding results
query_embedding = vo.embed(texts=[query], model="voyage-2").embeddings[0]
emb_scores = cosine_similarity(query_embedding, embedding_index)
# Get BM25 results
tokenized_query = query.split()
bm25_scores = bm25_index.get_scores(tokenized_query)
# Combine scores
combined_scores = alpha emb_scores + (1 - alpha) bm25_scores
top_indices = sorted(range(len(combined_scores)),
key=lambda i: combined_scores[i],
reverse=True)[:top_k]
return top_indices
def rerank_results(query, chunks, indices, top_k=10):
"""Rerank results using Cohere's reranker."""
candidates = [chunks[i] for i in indices]
reranked = co.rerank(
query=query,
documents=candidates,
top_n=top_k
)
return [indices[r.index] for r in reranked.results]
Step 6: Measuring Performance
Use Pass@k as your evaluation metric. This measures whether the "golden chunk" (the correct answer) appears in the top-k retrieved results:
def evaluate_pass_at_k(retrieval_results, golden_chunks, k=10):
"""Calculate Pass@k accuracy."""
correct = 0
for query_results, golden in zip(retrieval_results, golden_chunks):
if golden in query_results[:k]:
correct += 1
return correct / len(retrieval_results)
Anthropic's testing showed that Contextual Embeddings improved Pass@10 from ~87% to ~95% on a codebase dataset.
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock, you can deploy a Lambda function to add context to documents automatically. The Anthropic cookbook includes a contextual-rag-lambda-function directory with ready-to-use code. Deploy this Lambda and select it as a custom chunking option when configuring a Bedrock Knowledge Base.
Cost Management
- Use Claude 3 Haiku for context generation (fastest and cheapest)
- Leverage prompt caching to avoid reprocessing the full document for each chunk
- Batch your API calls where possible
Key Takeaways
- Contextual Embeddings dramatically improve retrieval accuracy: By prepending context to each chunk before embedding, you can reduce retrieval failure rates by 35% and improve Pass@10 from ~87% to ~95%.
- Contextual BM25 boosts hybrid search: Applying the same context to BM25 indexing improves keyword-based retrieval, making hybrid search even more effective.
- Prompt caching makes it practical: Without caching, generating context for thousands of chunks would be cost-prohibitive. Prompt caching reduces costs by up to 90%.
- Reranking adds the final polish: Combining Contextual Embeddings, Contextual BM25, and a reranker creates a robust RAG pipeline that handles edge cases well.
- Production-ready on major platforms: The technique works on Anthropic's API, AWS Bedrock (via Lambda), and GCP Vertex AI, making it accessible for enterprise deployments.