Mastering Contextual Retrieval: Boost RAG Accuracy with Claude & Contextual Embeddings
Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. A practical guide with code examples for building production-ready RAG systems.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to improve RAG performance by 35% using Claude, Voyage AI, and prompt caching.
Mastering Contextual Retrieval: Boost RAG Accuracy with Claude & Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context, leading to poor search results and inaccurate answers.
Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.In this guide, you'll learn how to build a Contextual Retrieval system using Claude, Voyage AI embeddings, and BM25 search—complete with code examples and production-ready optimization strategies.
What You'll Need
Prerequisites
- Intermediate Python knowledge
- Basic understanding of RAG and vector databases
- Familiarity with command-line tools
API Keys & Tools
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key (for reranking)
- Python 3.8+, Docker (optional for BM25), 4GB+ RAM
1. The Problem: Lost Context in Chunked Documents
Traditional RAG pipelines split documents into fixed-size chunks. While efficient, this approach creates a critical flaw:
A chunk about a Python functioncalculate_interest()might not mention it belongs to aBankAccountclass—so a query about "bank account interest calculation" may miss it entirely.
This is where Contextual Embeddings shine.
2. What Are Contextual Embeddings?
Contextual Embeddings add a short, chunk-specific context string to each chunk before embedding. This context explains what the chunk is about and where it fits in the broader document.
How it works:- Split your document into chunks (e.g., 500 characters)
- For each chunk, use Claude to generate a 50–100 word context description
- Prepend this context to the chunk text
- Embed the combined text (context + chunk)
- Store in your vector database
Why It Works So Well
- Semantic clarity: The embedding vector captures both the chunk's content and its role in the document
- Disambiguation: Two similar-looking code snippets from different classes become distinct
- Improved recall: Queries that reference high-level concepts now match relevant sub-chunks
3. Implementation: Building Contextual Retrieval with Claude
Let's walk through the implementation using a dataset of 9 codebases (248 queries with golden chunks).
Step 1: Generate Context for Each Chunk
import anthropic
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
def generate_chunk_context(document_text, chunk_text):
"""Generate context for a single chunk using Claude."""
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=150,
system="You are a document analyzer. Given a document and a chunk from it, "
"provide a brief context (50-100 words) explaining what this chunk is about "
"and how it relates to the broader document.",
messages=[
{
"role": "user",
"content": f"Document:\n{document_text[:2000]}\n\n"
f"Chunk:\n{chunk_text}\n\n"
f"Provide context for this chunk:"
}
]
)
return response.content[0].text
Step 2: Create Contextual Embeddings
import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
def create_contextual_embedding(chunk_text, context):
"""Create embedding from context + chunk."""
enriched_text = f"{context}\n\n{chunk_text}"
result = vo.embed([enriched_text], model="voyage-2")
return result.embeddings[0]
Step 3: Build Your Retrieval Pipeline
import chromadb
from chromadb.utils import embedding_functions
Initialize ChromaDB with voyage embeddings
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
name="contextual_rag",
embedding_function=embedding_functions.VoyageEmbeddingFunction(
api_key="YOUR_VOYAGE_API_KEY",
model_name="voyage-2"
)
)
Add chunks with context
for i, (chunk, context) in enumerate(zip(chunks, contexts)):
enriched = f"{context}\n\n{chunk}"
collection.add(
documents=[enriched],
ids=[f"chunk_{i}"],
metadatas=[{"original_chunk": chunk, "context": context}]
)
4. Optimizing Costs with Prompt Caching
Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the document prefix across multiple context generation calls.
# Enable prompt caching for the document prefix
response = client.messages.create(
model="claude-3-sonnet-20241022",
max_tokens=150,
system=[
{
"type": "text",
"text": "You are a document analyzer...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Document:\n{document_text}\n\nChunk:\n{chunk_text}",
"cache_control": {"type": "ephemeral"}
}
]
}
]
)
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
5. Going Further: Contextual BM25 & Hybrid Search
Context isn't just for embeddings. You can apply the same context to BM25 (keyword-based search) for even better results.
Contextual BM25 Implementation
from rank_bm25 import BM25Okapi
def build_contextual_bm25(chunks, contexts):
"""Build BM25 index with contextualized chunks."""
enriched_chunks = [f"{context}\n\n{chunk}" for chunk, context in zip(chunks, contexts)]
tokenized = [chunk.split() for chunk in enriched_chunks]
return BM25Okapi(tokenized)
Hybrid search: combine embedding similarity + BM25 scores
bm25 = build_contextual_bm25(chunks, contexts)
def hybrid_search(query, embedding_results, bm25_results, alpha=0.5):
"""Combine embedding and BM25 scores."""
combined_scores = {}
for doc_id, score in embedding_results:
combined_scores[doc_id] = alpha * score
for doc_id, score in bm25_results:
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + (1 - alpha) * score
return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
6. Performance Results & Reranking
In Anthropic's tests with 9 codebases and 248 queries:
| Technique | Pass@10 | Improvement |
|---|---|---|
| Basic RAG | 87% | Baseline |
| Contextual Embeddings | 95% | +8% |
| Contextual Embeddings + BM25 | 97% | +10% |
| + Reranking (Cohere) | 98% | +11% |
Adding Reranking
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank_results(query, candidates, top_k=5):
"""Rerank retrieved chunks using Cohere's reranker."""
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=candidates,
top_n=top_k
)
return [(r.index, r.relevance_score) for r in results.results]
7. Production Deployment on AWS Bedrock
For AWS users, Anthropic provides a Lambda function that implements contextual chunking for Bedrock Knowledge Bases. Deploy it as a custom chunking option:
- Deploy the Lambda using
contextual-rag-lambda-function/lambda_function.py - In Bedrock Knowledge Base creation, select "Custom chunking"
- Point to your deployed Lambda
- Your chunks are now automatically contextualized
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by enriching chunks with document-level context before embedding
- Combine with BM25 for hybrid search that leverages both semantic and keyword matching
- Use prompt caching to make context generation cost-effective at scale (caches document prefixes across chunks)
- Reranking adds another 1–2% improvement—worth implementing for high-stakes applications
- Production-ready on AWS Bedrock via custom Lambda chunking—no need to rebuild your infrastructure