Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings
Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. Step-by-step guide with code examples for Contextual Embeddings and BM25.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to improve RAG performance by 35% using Claude, Voyage AI, and Cohere APIs.
Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But traditional RAG has a critical flaw: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context, leading to poor search results.
Enter Contextual Retrieval, a technique pioneered by Anthropic that adds relevant context to each chunk before embedding. The results speak for themselves: a 35% reduction in retrieval failure rates across diverse datasets.
In this guide, you'll learn how to implement Contextual Retrieval using Claude, Voyage AI embeddings, and Cohere reranking. We'll walk through building a complete system from scratch, with production-ready code and cost optimization strategies.
What You'll Build
By the end of this guide, you'll have a fully functional Contextual Retrieval system that:
- Improves Pass@10 accuracy from ~87% to ~95%
- Reduces retrieval failure rates by 35%
- Works with both embedding-based and BM25 search
- Includes reranking for maximum precision
Prerequisites
Skills:- Intermediate Python (3.8+)
- Basic RAG understanding
- Familiarity with vector embeddings
- Anthropic API key (~$5-10 total cost)
- Voyage AI API key
- Cohere API key
- 4GB+ RAM
- 5-10GB disk space
- Docker (optional, for BM25)
1. Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere numpy pandas
Initialize your clients:
import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="sk-ant-...")
vo = voyageai.Client(api_key="pa-...")
co = cohere.Client(api_key="...")
2. The Problem: Context-Less Chunks
Traditional RAG splits documents into chunks like this:
def basic_chunk(text, chunk_size=500, overlap=50):
"""Simple character-based chunking"""
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunk = text[i:i + chunk_size]
chunks.append(chunk)
return chunks
The problem? A chunk containing "def calculate_interest(principal, rate, time):" loses meaning when separated from the function's docstring and surrounding context.
3. Implementing Contextual Embeddings
Contextual Embeddings solve this by prepending relevant context to each chunk before embedding:
def generate_chunk_context(chunk, full_document, claude_client):
"""Generate context for a single chunk using Claude"""
prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
response = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Production Optimization with Prompt Caching
For large codebases, generating context for every chunk individually is expensive. Use prompt caching to reduce costs:
def generate_contexts_with_caching(chunks, full_document, claude_client):
"""Generate contexts using prompt caching for efficiency"""
contexts = []
# Cache the full document
cached_doc = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1,
system=[{"type": "text", "text": full_document, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "Cache this document."}]
)
for chunk in chunks:
response = claude_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{"type": "text", "text": full_document, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": f"Context for chunk: {chunk}"}]
)
contexts.append(response.content[0].text)
return contexts
Cost Savings: Prompt caching reduces API costs by ~70-80% for large document collections.
4. Building the Retrieval Pipeline
Step 1: Create Contextual Embeddings
def create_contextual_embeddings(chunks, contexts):
"""Create embeddings for context-enriched chunks"""
contextual_chunks = [
f"{context}\n\n{chunk}"
for context, chunk in zip(contexts, chunks)
]
# Generate embeddings using Voyage AI
embeddings = vo.embed(
texts=contextual_chunks,
model="voyage-2",
input_type="document"
).embeddings
return contextual_chunks, embeddings
Step 2: Implement Hybrid Search with Contextual BM25
Combine embedding search with BM25 for better results:
from rank_bm25 import BM25Okapi
def hybrid_search(query, embeddings, bm25, chunks, alpha=0.5):
"""Hybrid search combining embeddings and BM25"""
# Embedding search
query_embedding = vo.embed(
texts=[query],
model="voyage-2",
input_type="query"
).embeddings[0]
# Cosine similarity
embedding_scores = np.dot(embeddings, query_embedding)
# BM25 search
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
emb_norm = (embedding_scores - embedding_scores.min()) / (embedding_scores.max() - embedding_scores.min())
bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
hybrid_scores = alpha emb_norm + (1 - alpha) bm25_norm
# Return top-k results
top_indices = np.argsort(hybrid_scores)[::-1][:10]
return [chunks[i] for i in top_indices]
Step 3: Add Reranking for Precision
def rerank_results(query, candidates, cohere_client):
"""Rerank candidates using Cohere's rerank model"""
results = cohere_client.rerank(
query=query,
documents=candidates,
model="rerank-english-v2.0",
top_n=5
)
return [candidates[r.index] for r in results.results]
5. Complete Pipeline in Action
def contextual_rag_pipeline(query, chunks, full_document):
"""Complete Contextual Retrieval pipeline"""
# 1. Generate contexts
contexts = generate_contexts_with_caching(chunks, full_document, claude)
# 2. Create contextual embeddings
contextual_chunks, embeddings = create_contextual_embeddings(chunks, contexts)
# 3. Build BM25 index
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
# 4. Hybrid search
candidates = hybrid_search(query, embeddings, bm25, contextual_chunks)
# 5. Rerank
final_results = rerank_results(query, candidates, co)
return final_results
Example usage
query = "How does the interest calculation function work?"
results = contextual_rag_pipeline(query, codebase_chunks, full_codebase)
print(f"Top result: {results[0]}")
Performance Results
On a dataset of 9 codebases with 248 queries:
| Method | Pass@10 | Improvement |
|---|---|---|
| Basic RAG | 87.1% | Baseline |
| Contextual Embeddings | 94.8% | +7.7% |
| Contextual Embeddings + BM25 | 96.2% | +9.1% |
| Full Pipeline (with reranking) | 97.5% | +10.4% |
Production Considerations
AWS Bedrock Integration
For AWS users, deploy a Lambda function for automatic contextual chunking:
# lambda_function.py (simplified)
def lambda_handler(event, context):
"""AWS Lambda for contextual chunking in Bedrock Knowledge Bases"""
document = event['document']
chunks = event['chunks']
contexts = generate_contexts_with_caching(chunks, document, claude)
return {
'statusCode': 200,
'chunks': [
{'chunk': chunk, 'context': context}
for chunk, context in zip(chunks, contexts)
]
}
Cost Optimization Tips
- Use Claude Haiku for context generation (cheapest model)
- Batch context generation to minimize API calls
- Cache embeddings for static documents
- Set appropriate chunk sizes (500-1000 tokens recommended)
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by adding document-level context to each chunk before embedding
- Prompt caching makes this practical by reducing API costs by 70-80% for large document collections
- Hybrid search (embeddings + BM25) outperforms either method alone, especially for codebases and technical documentation
- Reranking adds 1-2% additional improvement and is worth implementing for production systems
- The technique works across platforms—Anthropic API, AWS Bedrock, and GCP Vertex AI all support contextual retrieval with minor customization