Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI
Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude AI. Includes code examples, evaluation metrics, and production tips.
This guide walks you through implementing Contextual Retrieval—adding relevant context to each document chunk before embedding—to reduce retrieval failure rates by up to 35% with Claude AI. You'll learn setup, evaluation, and production deployment tips.
Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI
Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to answer questions based on your internal knowledge bases, codebases, or document repositories. However, traditional RAG systems often stumble when individual document chunks lack sufficient context—a problem that leads to missed retrievals and incomplete answers.
In this guide, we'll introduce Contextual Retrieval, a technique that dramatically improves retrieval accuracy by adding relevant context to each chunk before embedding. Based on Anthropic's research, this method reduces the top-20-chunk retrieval failure rate by an average of 35% across diverse datasets. We'll walk through implementation using Python, Claude, and supporting APIs, with practical code examples and evaluation metrics.
What You'll Learn
- How to set up a baseline RAG pipeline for evaluation
- The theory behind Contextual Embeddings and why they work
- Step-by-step implementation of Contextual Embeddings with prompt caching
- How to extend the technique to BM25 search (Contextual BM25)
- How to further boost performance with reranking
Prerequisites
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG concepts
- Familiarity with vector databases and embeddings
- Python 3.8+
- Docker (optional, for BM25 search)
- 4GB+ RAM, ~5-10 GB disk space
- Anthropic API key (free tier works)
- Voyage AI API key for embeddings
- Cohere API key for reranking
- Setup: 30–45 minutes
- API costs: ~$5–10 for the full dataset
1. Setting Up the Baseline RAG Pipeline
Before improving retrieval, we need a baseline. We'll use a pre-chunked dataset of 9 codebases (248 queries with golden chunks) and evaluate using Pass@k—whether the correct chunk appears in the top-k retrieved results.
Install Dependencies
pip install anthropic voyageai cohere
Load and Prepare Data
import json
Load chunks and evaluation set
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_queries = [json.loads(line) for line in f]
Create Embeddings and Vector Store
import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple vector index (use FAISS or Chroma for production)
import numpy as np
embedding_matrix = np.array(embeddings)
Evaluate Baseline Performance
def search(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = np.dot(embedding_matrix, query_emb)
top_k_indices = np.argsort(scores)[-k:][::-1]
return [chunks[i] for i in top_k_indices]
Pass@10 evaluation
pass_at_10 = 0
for q in eval_queries:
results = search(q['query'], k=10)
golden_id = q['golden_chunk_id']
if any(r['id'] == golden_id for r in results):
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.2%}")
Typical baseline: ~87% Pass@10.
2. Understanding Contextual Embeddings
The core problem: when you split a document into chunks, each chunk loses its surrounding context. A code snippet like def calculate_total(): might be meaningless without knowing it belongs to an invoice processing module.
- Original chunk:
"def calculate_total(): return subtotal + tax" - Contextual chunk:
"This function is part of the Invoice class in the billing module. It calculates the total amount including tax. Code: def calculate_total(): return subtotal + tax"
Why It Works
- Semantic enrichment: The embedding captures both the chunk's content and its role in the larger document.
- Disambiguation: Similar chunks from different contexts become distinguishable.
- Improved matching: Queries that reference the broader topic now match more accurately.
Managing Costs with Prompt Caching
Generating context for every chunk with Claude could be expensive. Prompt caching (available on Anthropic's first-party API) reduces costs by reusing the full document context across multiple chunk requests. The system prompt and document are cached once, and only the chunk-specific instruction changes.
import anthropic
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
Full document text
document = """... (entire document content) ..."""
Generate context for each chunk using prompt caching
contexts = []
for chunk in chunks:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
system=[{
"type": "text",
"text": "You are a document context generator. Given a document and a chunk, provide a brief context (1-2 sentences) describing the chunk's role.",
"cache_control": {"type": "ephemeral"}
}],
messages=[
{"role": "user", "content": f"Document: {document}\n\nChunk: {chunk['content']}\n\nProvide context:"}
]
)
contexts.append(response.content[0].text)
Note: Prompt caching is available on Anthropic's API and coming soon to AWS Bedrock and GCP Vertex. For Bedrock, AWS provides a Lambda function for custom chunking (see contextual-rag-lambda-function/lambda_function.py in the cookbook).
3. Implementing Contextual Embeddings
Now let's implement the full pipeline.
Step 1: Generate Context for Each Chunk
def generate_context(document, chunk, client):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
system=[{
"type": "text",
"text": "Provide a concise context (1-2 sentences) describing this chunk's role in the document. Focus on what the chunk does and where it fits.",
"cache_control": {"type": "ephemeral"}
}],
messages=[
{"role": "user", "content": f"Document: {document}\n\nChunk: {chunk}\n\nContext:"}
]
)
return response.content[0].text
Step 2: Create Contextual Embeddings
contextual_chunks = []
for chunk in chunks:
context = generate_context(document, chunk['content'], client)
contextual_text = f"{context}\n\n{chunk['content']}"
contextual_chunks.append(contextual_text)
Embed contextual chunks
contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings
Step 3: Evaluate
# Re-run evaluation with contextual embeddings
contextual_matrix = np.array(contextual_embeddings)
def contextual_search(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = np.dot(contextual_matrix, query_emb)
top_k_indices = np.argsort(scores)[-k:][::-1]
return [chunks[i] for i in top_k_indices]
Evaluate Pass@10
pass_at_10_contextual = 0
for q in eval_queries:
results = contextual_search(q['query'], k=10)
if any(r['id'] == q['golden_chunk_id'] for r in results):
pass_at_10_contextual += 1
print(f"Contextual Embeddings Pass@10: {pass_at_10_contextual / len(eval_queries):.2%}")
Expected improvement: ~87% → ~95% Pass@10.
4. Contextual BM25: Hybrid Search
Contextual retrieval isn't limited to embeddings. You can apply the same context to BM25 (a keyword-based search algorithm) and combine it with embeddings for hybrid search.
Implementing Contextual BM25
from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_contextual = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_contextual)
def bm25_search(query, k=10):
tokenized_query = query.split()
scores = bm25.get_scores(tokenized_query)
top_k_indices = np.argsort(scores)[-k:][::-1]
return [chunks[i] for i in top_k_indices]
Hybrid Search (Embeddings + BM25)
def hybrid_search(query, k=10, alpha=0.5):
# Get scores from both methods
emb_scores = np.dot(contextual_matrix, vo.embed([query]).embeddings[0])
bm25_scores = bm25.get_scores(query.split())
# Normalize scores
emb_scores = (emb_scores - emb_scores.min()) / (emb_scores.max() - emb_scores.min())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
# Combine
combined = alpha emb_scores + (1 - alpha) bm25_scores
top_k_indices = np.argsort(combined)[-k:][::-1]
return [chunks[i] for i in top_k_indices]
Hybrid search often yields the best results, capturing both semantic and keyword matches.
5. Improving Performance with Reranking
Reranking adds a final layer of precision. After retrieving top-k candidates, a cross-encoder model re-scores them based on deeper semantic relevance to the query.
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query, candidates, top_n=5):
# Prepare documents for reranking
docs = [c['content'] for c in candidates]
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=top_n
)
return [candidates[r.index] for r in results.results]
Use with hybrid search
initial_results = hybrid_search(query, k=20)
final_results = rerank(query, initial_results, top_n=5)
Reranking typically adds 2–5% to Pass@k and significantly improves user-perceived relevance.
Production Considerations
- Prompt Caching: Essential for cost-effective context generation at scale. Cache the full document and system prompt; only vary the chunk.
- AWS Bedrock Integration: Use the provided Lambda function (
contextual-rag-lambda-function/lambda_function.py) as a custom chunking option in Bedrock Knowledge Bases.
- Vector Database Choice: For production, use FAISS, Pinecone, or Weaviate with proper indexing.
- Batch Processing: Generate contexts in batches to reduce API calls.
- Monitoring: Track Pass@k over time to detect drift.
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by enriching chunks with surrounding document context before embedding.
- Prompt caching makes this technique cost-effective for production by caching the document and system prompt across multiple chunk requests.
- Contextual BM25 extends the same idea to keyword search, and hybrid search (embeddings + BM25) often yields the best results.
- Reranking adds a final precision layer, improving user-perceived relevance by 2–5%.
- The technique works across platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex—making it accessible for enterprise deployments.