Mastering Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings
Learn how to implement Contextual Retrieval with Claude AI to reduce retrieval failure rates by 35%. A step-by-step guide with code examples for production-ready RAG systems.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to dramatically improve RAG accuracy. You'll learn to reduce retrieval failure rates by 35% using Claude, Voyage AI, and BM25 search.
Mastering Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support chatbots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A chunk containing "the revenue increased by 20%" is useless if the system doesn't know which company or quarter it refers to.
Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets. In this guide, you'll learn how to implement this technique using Claude, Voyage AI embeddings, and BM25 search—with practical code you can adapt for production.What You'll Build
By the end of this guide, you'll have built a complete Contextual Retrieval pipeline that:
- Uses Claude to generate context for each document chunk
- Embeds chunks with their context for more accurate vector search
- Combines contextual embeddings with contextual BM25 for hybrid search
- Optionally adds a reranking step for maximum precision
Prerequisites
Technical Skills:- Intermediate Python
- Basic understanding of RAG and vector databases
- Command-line proficiency
- Python 3.8+
- Docker (optional, for BM25)
- 4GB+ RAM, 5-10 GB disk space
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key (for reranking)
- ~30-45 minutes to complete
- ~$5-10 in API costs for the full dataset
1. Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere numpy pandas
Initialize your clients:
import anthropic
import voyageai
Initialize API clients
claude = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyage-key")
Load your dataset (example structure)
import json
with open("data/codebase_chunks.json", "r") as f:
chunks = json.load(f)
with open("data/evaluation_set.jsonl", "r") as f:
eval_queries = [json.loads(line) for line in f]
2. The Problem: Contextless Chunks
In traditional RAG, you split documents into chunks and embed each chunk independently. Consider this chunk from a codebase:
def calculate_metrics(data):
return precision_score(data.y_true, data.y_pred)
Without context, the embedding doesn't capture that this function is part of a classification evaluation module. A query like "How do I evaluate my classifier?" might miss this chunk entirely.
Contextual Embeddings fix this by asking Claude to generate a concise context for each chunk:def generate_chunk_context(chunk_text, full_document):
"""Use Claude to generate context for a chunk."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context."""
}]
)
return response.content[0].text
3. Implementing Contextual Embeddings
Now, let's build the full pipeline. We'll process each chunk, generate its context, and embed the combined text:
import numpy as np
from typing import List, Dict
def build_contextual_embeddings(chunks: List[Dict], documents: Dict[str, str]) -> np.ndarray:
"""Generate contextual embeddings for all chunks."""
contextual_chunks = []
for chunk in chunks:
doc_id = chunk["doc_id"]
full_doc = documents[doc_id]
# Generate context using Claude
context = generate_chunk_context(chunk["text"], full_doc)
# Prepend context to chunk
contextual_text = f"{context}\n\n{chunk['text']}"
contextual_chunks.append(contextual_text)
# Batch embed with Voyage AI
embeddings = vo.embed(
contextual_chunks,
model="voyage-2",
input_type="document"
).embeddings
return np.array(embeddings)
Performance Results
On a dataset of 9 codebases with 248 queries, Contextual Embeddings improved Pass@10 from ~87% to ~95%—a 35% reduction in retrieval failures.
4. Managing Costs with Prompt Caching
Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the document prefix across multiple context-generation calls:
# Use prompt caching for the full document
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{
"type": "text",
"text": f"<document>{full_document}</document>",
"cache_control": {"type": "ephemeral"}
}],
messages=[{
"role": "user",
"content": f"<chunk>{chunk_text}</chunk>\n\nPlease give a short succinct context..."
}]
)
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
5. Contextual BM25: Hybrid Search
The same context you generated for embeddings can also improve BM25 (keyword-based) search. This creates a powerful hybrid system:
from rank_bm25 import BM25Okapi
def build_contextual_bm25(contextual_chunks: List[str]):
"""Build BM25 index from contextual chunks."""
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
return BM25Okapi(tokenized_chunks)
Hybrid search: combine vector and BM25 scores
def hybrid_search(query: str, vector_embeddings, bm25_index, alpha=0.5):
"""Combine vector and BM25 scores."""
# Vector similarity
query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
vector_scores = cosine_similarity([query_embedding], vector_embeddings)[0]
# BM25 scores
bm25_scores = bm25_index.get_scores(query.split())
# Normalize and combine
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
combined = alpha vector_scores + (1 - alpha) bm25_scores
return combined.argsort()[::-1]
6. Adding Reranking for Maximum Precision
For production systems, add a reranking step using Cohere's rerank API:
import cohere
co = cohere.Client("your-cohere-key")
def rerank_results(query: str, candidates: List[str], top_k: int = 10):
"""Rerank retrieved chunks for maximum relevance."""
results = co.rerank(
query=query,
documents=candidates,
model="rerank-english-v2.0",
top_n=top_k
)
return [r.document for r in results.results]
7. Putting It All Together
Here's the complete pipeline:
def contextual_rag_pipeline(query: str, chunks, documents, vector_db, bm25_index):
# 1. Generate context for the query (optional, but helps)
query_context = generate_chunk_context(query, "")
contextual_query = f"{query_context}\n\n{query}"
# 2. Hybrid retrieval
top_indices = hybrid_search(contextual_query, vector_db, bm25_index)
retrieved_chunks = [chunks[i] for i in top_indices[:20]]
# 3. Rerank
reranked = rerank_results(query, [c["text"] for c in retrieved_chunks], top_k=5)
# 4. Generate answer with Claude
context = "\n\n".join(reranked)
response = claude.messages.create(
model="claude-3-sonnet-20240229",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
Deployment Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context during chunking. The code is available in the contextual-rag-lambda-function directory of the cookbook repository. Configure it as a custom chunking option when creating your knowledge base.
Cost Optimization
- Use Claude 3 Haiku for context generation (fastest, cheapest)
- Batch your context generation calls
- Cache generated contexts in a database for reuse
- Use prompt caching to reduce token usage by up to 90%
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context to each chunk before embedding, solving the "lost context" problem in traditional RAG
- Hybrid search with Contextual BM25 combines semantic and keyword matching for more robust retrieval—use both vector and BM25 scores
- Prompt caching makes this practical for production by dramatically reducing the cost of generating context for thousands of chunks
- Reranking adds a final precision boost—use Cohere's rerank API or Claude itself to reorder retrieved chunks before generation
- Start simple, then layer complexity: Begin with Contextual Embeddings alone, then add BM25 and reranking as needed for your use case