Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings
Learn how to improve RAG accuracy by 35% using Contextual Embeddings and BM25 with Claude. A practical guide with code examples, cost-saving tips, and evaluation metrics.
This guide shows you how to add context to each document chunk before embedding, reducing retrieval failure rates by 35%. You'll implement Contextual Embeddings, Contextual BM25, and reranking using Claude, Voyage AI, and Cohere.
Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful. A chunk that says "the API key is invalid" is useless if you don't know it's referring to the authentication step of a payment gateway.
Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets. In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, and how prompt caching makes this approach practical and cost-effective.What You'll Need
Before diving in, make sure you have:
- Python 3.8+ installed
- Docker (optional, for BM25 search)
- 4GB+ RAM and ~5-10 GB free disk space
- API keys for Anthropic, Voyage AI, and Cohere
- Basic familiarity with RAG, embeddings, and vector databases
1. Setting Up the Baseline: Basic RAG
We'll start by establishing a performance baseline using a dataset of 9 codebases, pre-chunked into smaller pieces. The evaluation set contains 248 queries, each with a "golden chunk"—the correct answer. Our metric is Pass@k, which checks whether the golden chunk appears in the top-k retrieved results.
First, install the required libraries:
pip install anthropic voyageai cohere
Load the dataset and create a simple vector store:
import json
from voyageai import Client
Load chunks and evaluation set
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_data = [json.loads(line) for line in f]
Initialize Voyage AI for embeddings
voyage_client = Client(api_key='your-voyage-api-key')
Embed all chunks (simplified for illustration)
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = voyage_client.embed(chunk_texts, model='voyage-2').embeddings
Store in a vector DB (e.g., Chroma, Pinecone)
For brevity, we'll use a simple cosine similarity search
With basic RAG, our Pass@10 was ~87%. That's decent, but we can do better.
2. Contextual Embeddings: Adding Context Before Embedding
The core idea is simple: for each chunk, use Claude to generate a short piece of context that explains what the chunk is about, then prepend that context to the chunk text before embedding.
Why It Works
When you embed a chunk like "def authenticate(): ...", the vector captures only the code syntax. But if you first prepend "This function handles user authentication for the payment gateway API", the embedding becomes much richer and more retrievable for relevant queries.
Implementation with Claude and Prompt Caching
Here's how to generate context for each chunk using Claude:
import anthropic
client = anthropic.Anthropic(api_key='your-anthropic-api-key')
def generate_context(chunk_text, full_document):
"""Generate context for a single chunk."""
prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
system="You are a helpful assistant that provides context for document chunks.",
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Apply to all chunks
for chunk in chunks:
context = generate_context(chunk['text'], chunk['full_document'])
chunk['contextual_text'] = f"{context}\n\n{chunk['text']}"
But wait—this is expensive! Generating context for thousands of chunks could cost a fortune. That's where prompt caching comes in.
Making It Practical with Prompt Caching
Prompt caching allows you to reuse the full document context across multiple chunk requests. Since the full document is the same for all chunks within a document, you cache it once and pay only for the chunk-specific tokens.
# Enable prompt caching by adding the cache_control parameter
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
system=[{
"type": "text",
"text": "You are a helpful assistant that provides context for document chunks.",
"cache_control": {"type": "ephemeral"}
}],
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>{full_document}</document>",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"<chunk>{chunk_text}</chunk>\n\nPlease give a short succinct context..."
}
]
}]
)
With caching, the cost drops dramatically—often by 90% or more.
Results
After embedding the contextualized chunks and re-running our evaluation, Pass@10 improved from ~87% to ~95%—a 35% reduction in retrieval failures.
3. Contextual BM25: Hybrid Search Gets Smarter
BM25 is a classic keyword-based retrieval algorithm. By applying the same contextual prefix to chunks before indexing them with BM25, we get Contextual BM25. This hybrid approach (vector + keyword) captures both semantic meaning and exact term matching.
from rank_bm25 import BM25Okapi
Tokenize contextual chunks for BM25
tokenized_corpus = [chunk['contextual_text'].split() for chunk in chunks]
bm25 = BM25Okapi(tokenized_corpus)
Search function
def hybrid_search(query, top_k=10, alpha=0.5):
# Vector search
query_embedding = voyage_client.embed([query]).embeddings[0]
vector_scores = cosine_similarity(query_embedding, all_embeddings)
# BM25 search
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Combine scores
combined_scores = alpha vector_scores + (1 - alpha) bm25_scores
top_indices = np.argsort(combined_scores)[-top_k:][::-1]
return [chunks[i] for i in top_indices]
Combining Contextual Embeddings with Contextual BM25 often yields the best results, especially for codebases where exact function names matter.
4. Reranking for Final Precision
Even with great retrieval, the top-10 results may contain irrelevant chunks. A reranker (like Cohere's) re-scores the top candidates for maximum precision.
import cohere
co = cohere.Client('your-cohere-api-key')
def rerank(query, candidates, top_k=5):
results = co.rerank(
model='rerank-english-v2.0',
query=query,
documents=[c['contextual_text'] for c in candidates],
top_n=top_k
)
return [candidates[r.index] for r in results.results]
Full pipeline
query = "How do I authenticate a user?"
top_chunks = hybrid_search(query, top_k=20)
final_chunks = rerank(query, top_chunks, top_k=5)
Reranking adds minimal latency but can push Pass@5 from 85% to 95%+.
5. Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The code is available in the contextual-rag-lambda-function folder of the cookbook. This lets you use Contextual Retrieval without changing your existing Bedrock pipeline.
Cost Management
- Prompt caching is your best friend. Use it aggressively.
- Claude 3 Haiku is fast and cheap for context generation.
- Batch your context generation to avoid rate limits.
When to Use Contextual Retrieval
This technique shines when:
- Your chunks are small (under 500 tokens)
- Documents have a clear hierarchical structure (code files, legal contracts, technical manuals)
- Queries are specific and context-dependent
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context to each chunk before embedding.
- Prompt caching makes Contextual Retrieval cost-effective, reducing API costs by up to 90% for large document sets.
- Combine Contextual Embeddings with Contextual BM25 for a hybrid search that captures both semantic meaning and exact keyword matches.
- Reranking further boosts precision, pushing Pass@5 scores above 95% in many cases.
- AWS Bedrock users can deploy a Lambda function to add context during ingestion, enabling this technique without architectural changes.