Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users
Learn how to improve RAG performance using Contextual Embeddings and BM25 with Claude AI. Includes code examples, evaluation metrics, and production tips.
This guide shows you how to boost RAG accuracy by adding context to document chunks before embedding. You'll learn Contextual Embeddings, Contextual BM25, and reranking techniques to reduce retrieval failure rates by up to 35%.
Introduction
Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to answer questions using your internal knowledge bases, codebases, or document repositories. However, traditional RAG systems often stumble when individual document chunks lack sufficient context—a single code function or paragraph snippet may be meaningless on its own.
Contextual Retrieval solves this problem by enriching each chunk with relevant context before embedding. The result? More accurate retrieval, better answers, and a 35% reduction in top-20 retrieval failures across tested datasets.In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude and supporting APIs. We'll walk through a complete pipeline using a dataset of 9 codebases, evaluate performance with Pass@k metrics, and show you how prompt caching makes this approach production-ready.
What You'll Need
Prerequisites
- Intermediate Python skills
- Basic understanding of RAG and vector databases
- Command-line proficiency
System Requirements
- Python 3.8+
- Docker (optional, for BM25 search)
- 4GB+ RAM
- 5–10 GB disk space for vector databases
API Keys
- Anthropic API key (free tier works)
- Voyage AI API key for embeddings
- Cohere API key for reranking
Time & Cost
- Setup: 30–45 minutes
- API costs: ~$5–10 for the full dataset
Step 1: Basic RAG Pipeline (Baseline)
Before improving retrieval, establish a baseline. We'll split documents into chunks, embed them, and measure Pass@10 performance.
import voyageai
import numpy as np
from typing import List, Dict
Initialize Voyage AI client
vo = voyageai.Client(api_key="your-voyage-api-key")
Load pre-chunked dataset (from data/codebase_chunks.json)
Each chunk has: id, content, source_file
Generate embeddings for all chunks
chunks = load_chunks() # Your loading logic here
chunk_texts = [chunk["content"] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Build a simple vector store (using numpy for demo)
vector_store = np.array(embeddings)
def retrieve(query: str, k: int = 10) -> List[Dict]:
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
scores = np.dot(vector_store, query_embedding)
top_indices = np.argsort(scores)[-k:][::-1]
return [chunks[i] for i in top_indices]
Evaluate with Pass@10
Load evaluation_set.jsonl (contains queries + golden chunk IDs)
Check if golden chunk appears in top 10 results
Expected baseline: ~87% Pass@10 on the codebase dataset.
Step 2: Contextual Embeddings
Contextual Embeddings prepend a chunk-specific context to each chunk before embedding. This context is generated by Claude, which understands the chunk's role in the broader document.
How It Works
- For each chunk, send the full document + chunk to Claude with a prompt like:
- Prepend the generated context to the chunk text.
- Embed the enriched chunk.
- At query time, search against enriched embeddings.
Implementation
import anthropic
client = anthropic.Anthropic(api_key="your-anthropic-api-key")
def generate_chunk_context(document: str, chunk: str) -> str:
"""Generate context for a single chunk using Claude."""
prompt = f"""<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Apply to all chunks
enriched_chunks = []
for chunk in chunks:
context = generate_chunk_context(chunk["document"], chunk["content"])
enriched_text = f"{context}\n\n{chunk['content']}"
enriched_chunks.append({**chunk, "enriched_text": enriched_text})
Embed enriched chunks
enriched_embeddings = vo.embed(
[c["enriched_text"] for c in enriched_chunks],
model="voyage-2"
).embeddings
Performance Boost
After implementing Contextual Embeddings, re-run your evaluation. Expect Pass@10 to jump from ~87% to ~95%—a significant reduction in retrieval failures.
Step 3: Optimizing Costs with Prompt Caching
Generating context for every chunk can be expensive. Prompt caching reduces costs by reusing the document prefix across chunk requests.
# Enable prompt caching by marking the document as a cache control point
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
system=[
{
"type": "text",
"text": document,
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": f"<chunk>{chunk}</chunk>\n\nProvide context..."}]
)
With caching, you only pay the full document token cost once. Subsequent chunks reuse the cached prefix, slashing API costs by 50–80%.
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex.
Step 4: Contextual BM25
BM25 is a keyword-based retrieval method that complements dense embeddings. Contextual BM25 applies the same chunk context to BM25 indexing.
Implementation
from rank_bm25 import BM25Okapi
Tokenize enriched texts
enriched_texts = [c["enriched_text"] for c in enriched_chunks]
tokenized_corpus = [text.split() for text in enriched_texts]
bm25 = BM25Okapi(tokenized_corpus)
def bm25_retrieve(query: str, k: int = 10) -> List[Dict]:
tokenized_query = query.split()
scores = bm25.get_scores(tokenized_query)
top_indices = np.argsort(scores)[-k:][::-1]
return [chunks[i] for i in top_indices]
Hybrid search: combine dense + BM25 scores
def hybrid_retrieve(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]:
dense_results = retrieve(query, k*2) # Get more candidates
bm25_results = bm25_retrieve(query, k*2)
# Combine scores (simplified)
combined_scores = {}
for i, chunk in enumerate(dense_results):
combined_scores[chunk["id"]] = alpha (1 - i/(k2))
for i, chunk in enumerate(bm25_results):
if chunk["id"] in combined_scores:
combined_scores[chunk["id"]] += (1-alpha) (1 - i/(k2))
else:
combined_scores[chunk["id"]] = (1-alpha) (1 - i/(k2))
sorted_ids = sorted(combined_scores, key=combined_scores.get, reverse=True)[:k]
return [chunks_by_id[i] for i in sorted_ids]
Hybrid search with Contextual BM25 typically yields another 2–5% improvement in Pass@10.
Step 5: Reranking for Final Precision
Reranking applies a cross-encoder model to reorder the top-k results from your hybrid search. This adds a small latency cost but can push accuracy even higher.
import cohere
co = cohere.Client("your-cohere-api-key")
def rerank(query: str, candidates: List[Dict], top_k: int = 10) -> List[Dict]:
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c["enriched_text"] for c in candidates],
top_n=top_k
)
return [candidates[r.index] for r in results.results]
Full pipeline
query = "How does the authentication module handle token refresh?"
candidates = hybrid_retrieve(query, k=20)
final_results = rerank(query, candidates, top_k=10)
Reranking can push Pass@10 beyond 97% on well-structured datasets.
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context during chunking. The Anthropic cookbook includes a ready-to-use Lambda in the contextual-rag-lambda-function folder. Configure it as a custom chunking option in your Bedrock Knowledge Base.
Latency vs. Accuracy Trade-offs
- Contextual Embeddings: Adds upfront processing time but zero query-time latency.
- Contextual BM25: Minimal query-time overhead.
- Reranking: Adds 100–500ms per query but delivers the highest accuracy.
Scaling
For large corpora (millions of chunks), consider:
- Batching context generation with Claude
- Using approximate nearest neighbor (ANN) indexes like FAISS
- Pre-computing BM25 indices
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by enriching chunks with document-level context before embedding.
- Prompt caching cuts costs by 50–80% when generating context for many chunks from the same document.
- Hybrid search (dense + BM25) outperforms either method alone—Contextual BM25 adds another 2–5% improvement.
- Reranking pushes accuracy to 97%+ but adds latency; use it when precision is critical.
- Production-ready on AWS Bedrock via a custom Lambda function for chunking—no vendor lock-in.