Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users
Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude AI. Includes code examples, evaluation metrics, and production tips.
This guide teaches you how to implement Contextual Retrieval—adding relevant context to document chunks before embedding—to reduce retrieval failure rates by up to 35% and improve RAG accuracy with Claude AI.
Introduction
Retrieval Augmented Generation (RAG) is a powerful technique that enables Claude AI to answer questions using your internal knowledge bases, codebases, or any document corpus. However, traditional RAG systems often struggle when individual document chunks lack sufficient context—a problem that leads to missed retrievals and lower-quality answers.
Contextual Retrieval solves this by adding relevant context to each chunk before embedding. This simple but effective method improves the quality of each embedded chunk, allowing for more accurate retrieval and better overall performance. In tests across multiple data sources, Contextual Embeddings reduced the top-20-chunk retrieval failure rate by an average of 35%.In this guide, you'll learn how to build and optimize a Contextual Retrieval system using Claude AI. We'll cover:
- Setting up a basic retrieval pipeline as a baseline
- Implementing Contextual Embeddings with prompt caching for cost efficiency
- Enhancing BM25 search with contextual information
- Improving results further with reranking
Prerequisites
Before starting, ensure you have:
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG concepts
- Familiarity with vector databases and embeddings
- Python 3.8+
- Docker (optional, for BM25 search)
- 4GB+ available RAM
- ~5-10 GB disk space for vector databases
- Anthropic API key (free tier sufficient)
- Voyage AI API key
- Cohere API key (for reranking)
- Expected completion: 30-45 minutes
- API costs: ~$5-10 for the full dataset
Step 1: Setting Up the Basic RAG Pipeline
First, let's establish a baseline. We'll use a dataset of 9 codebases, pre-chunked using character splitting. The evaluation dataset contains 248 queries, each with a "golden chunk" that represents the correct answer.
import json
from typing import List, Dict
import voyageai
from anthropic import Anthropic
Load data
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_data = [json.loads(line) for line in f]
Initialize clients
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Simple retrieval function
def retrieve(query: str, top_k: int = 10) -> List[Dict]:
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Compute cosine similarity (simplified)
scores = [cosine_similarity(query_embedding, emb) for emb in embeddings]
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return [chunks[i] for i in top_indices]
Evaluation Metric: Pass@k
We'll use Pass@k to measure performance—whether the golden chunk appears in the first k retrieved documents. Our baseline Pass@10 is approximately 87%.
Step 2: Implementing Contextual Embeddings
Contextual Embeddings add relevant context to each chunk before embedding. This context typically includes:
- The document title or source
- Surrounding chunk summaries
- Key entities or concepts from the broader document
def generate_chunk_context(chunk: Dict, full_document: str) -> str:
"""Generate context for a chunk using Claude."""
prompt = f"""Given the following chunk from a codebase document, provide a brief context (2-3 sentences) that explains what this chunk is about and how it fits into the larger document.
Full document context:
{full_document[:2000]} # First 2000 chars for context
Chunk content:
{chunk['content']}
Context:"""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Apply to all chunks (with prompt caching for efficiency)
contextual_chunks = []
for chunk in chunks[:10]: # Example: first 10 chunks
context = generate_chunk_context(chunk, chunk.get('document', ''))
contextual_chunks.append({
'original': chunk,
'context': context,
'contextual_content': f"{context}\n\n{chunk['content']}"
})
Why Prompt Caching Matters
Generating context for every chunk individually can be expensive. Prompt caching (available on Anthropic's first-party API) dramatically reduces costs by reusing cached prompts across similar chunks. This makes Contextual Embeddings practical for production.
# Example with prompt caching
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=150,
system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": chunk_prompt}]
)
Performance Improvement: After implementing Contextual Embeddings, our Pass@10 improved from ~87% to ~95%—a significant reduction in retrieval failures.
Step 3: Contextual BM25
BM25 is a traditional keyword-based retrieval method that complements embedding-based search. By applying the same chunk-specific context to BM25, we can further improve hybrid search performance.
from rank_bm25 import BM25Okapi
Tokenize contextual chunks for BM25
tokenized_contextual = [chunk['contextual_content'].split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_contextual)
Hybrid search: combine BM25 and embedding scores
def hybrid_search(query: str, top_k: int = 10, alpha: float = 0.5) -> List[Dict]:
# Embedding score
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
emb_scores = [cosine_similarity(query_embedding, emb) for emb in contextual_embeddings]
# BM25 score
bm25_scores = bm25.get_scores(query.split())
# Normalize and combine
combined = [
alpha * (emb_scores[i] / max(emb_scores)) +
(1 - alpha) * (bm25_scores[i] / max(bm25_scores))
for i in range(len(chunks))
]
top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:top_k]
return [chunks[i] for i in top_indices]
Step 4: Reranking for Final Precision
Reranking adds a final layer of accuracy by using a cross-encoder model to reorder the top-k results. This step is especially useful when you need high precision.
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c['content'] for c in candidates],
top_n=top_k
)
return [candidates[r.index] for r in results.results]
Full pipeline
query = "How does the authentication module handle JWT tokens?"
initial_results = hybrid_search(query, top_k=20)
final_results = rerank(query, initial_results, top_k=5)
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can implement Contextual Retrieval as a custom Lambda function for chunking. The AWS team provides a reference implementation in the contextual-rag-lambda-function directory.
# Lambda function skeleton (from contextual-rag-lambda-function/lambda_function.py)
def lambda_handler(event, context):
# Extract document chunks from event
chunks = event['chunks']
# Generate context for each chunk using Claude
contextual_chunks = []
for chunk in chunks:
context = generate_context(chunk, event['document'])
contextual_chunks.append({
**chunk,
'content': f"{context}\n\n{chunk['content']}"
})
return {'chunks': contextual_chunks}
Cost Optimization
- Prompt caching reduces context generation costs by up to 90%
- Batch processing chunks from the same document together
- Use Claude Haiku for context generation (fastest/cheapest model)
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% on average by adding relevant context to each chunk before embedding, significantly improving RAG accuracy.
- Prompt caching makes Contextual Retrieval cost-effective for production by reusing cached prompts across similar chunks, reducing API costs by up to 90%.
- Hybrid search with Contextual BM25 combines the strengths of semantic and keyword-based retrieval, further improving performance over embeddings alone.
- Reranking adds final precision to your retrieval pipeline, ensuring the most relevant results appear at the top.
- AWS Bedrock and GCP Vertex AI support custom chunking via Lambda functions, making Contextual Retrieval deployable in enterprise environments.