Contextual Retrieval: Supercharge Your RAG System with Claude and Context-Aware Chunks
Learn how to reduce retrieval failure by 35% using Contextual Embeddings and BM25 with Claude. A practical guide to building high-performance RAG systems.
This guide shows you how to improve RAG retrieval by adding context to each chunk before embedding. Using Contextual Embeddings and BM25, you can reduce top-20 retrieval failure by 35% and boost Pass@10 from 87% to 95%.
Introduction
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A. But traditional RAG has a blind spot: when you split documents into chunks, those chunks often lose the surrounding context. A code snippet like def process(): means nothing without knowing it's part of a payment processing module.
What You'll Need
Technical Skills
- Intermediate Python
- Basic RAG understanding
- Familiarity with vector databases
- Command-line basics
System & API Requirements
- Python 3.8+
- 4GB+ RAM, ~5-10 GB disk space
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key (for reranking)
Time & Cost
- Time: 30–45 minutes
- Cost: ~$5–10 for the full dataset
Step 1: Basic RAG Pipeline (Baseline)
Let's start with a simple RAG pipeline to establish a performance baseline. We'll use a pre-chunked dataset of 9 codebases (248 queries with golden chunks).
import voyageai
from anthropic import Anthropic
Initialize clients
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
Load chunks and queries
import json
with open("data/codebase_chunks.json") as f:
chunks = json.load(f)
with open("data/evaluation_set.jsonl") as f:
queries = [json.loads(line) for line in f]
Embed all chunks
chunk_texts = [chunk["content"] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
For each query, find top-10 chunks
def search(query, k=10):
q_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = [cosine_similarity(q_emb, e) for e in embeddings]
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [chunks[i] for i in top_indices]
Evaluate Pass@10
pass_at_10 = 0
for q in queries:
results = search(q["query"])
if any(r["id"] == q["golden_chunk_id"] for r in results):
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10/len(queries)*100:.1f}%")
Expected: ~87%
Step 2: Contextual Embeddings
The Problem
When you split a document, each chunk loses its broader context. A chunk containingdef calculate_tax(): from a financial report is ambiguous—is it for payroll, sales, or corporate tax? Without context, the embedding vector is less precise.
The Solution
Before embedding, prepend a short context snippet to each chunk. Claude generates this context using the full document and the chunk itself.def generate_context(chunk_content, full_document):
prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_content}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Then embed the augmented chunk:
augmented_chunks = []
for chunk in chunks:
context = generate_context(chunk["content"], chunk["document"])
augmented_text = f"{context}\n\n{chunk['content']}"
augmented_chunks.append(augmented_text)
Embed augmented chunks
contextual_embeddings = vo.embed(augmented_chunks, model="voyage-2").embeddings
Why Prompt Caching Matters
Generating context for thousands of chunks can be expensive. Prompt caching reduces cost by reusing the full document prefix across chunks from the same document.# With prompt caching (Anthropic API)
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
system=[{"type": "text", "text": full_document, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": f"<chunk>{chunk_content}</chunk>..."}]
)
This reduces API costs by ~50–70% for large document sets.
Performance Lift
After implementing Contextual Embeddings, re-run the evaluation:# Same search function, but using contextual_embeddings
pass_at_10 = 0
for q in queries:
results = contextual_search(q["query"])
if any(r["id"] == q["golden_chunk_id"] for r in results):
pass_at_10 += 1
print(f"Contextual Pass@10: {pass_at_10/len(queries)*100:.1f}%")
Expected: ~95%
Result: Pass@10 jumps from ~87% to ~95%.
Step 3: Contextual BM25
BM25 is a keyword-based retrieval method. It benefits from context too. Use the same generated context to augment chunks for BM25 indexing.
from rank_bm25 import BM25Okapi
Tokenize augmented chunks for BM25
tokenized_corpus = [augmented_text.split() for augmented_text in augmented_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def bm25_search(query, k=10):
tokenized_query = query.split()
scores = bm25.get_scores(tokenized_query)
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [chunks[i] for i in top_indices]
Hybrid Search: Combine Embeddings + BM25
For best results, combine both methods with reciprocal rank fusion (RRF):
def hybrid_search(query, k=10, alpha=0.5):
# Get scores from both methods
emb_scores = cosine_similarity(query_emb, contextual_embeddings)
bm25_scores = bm25.get_scores(query.split())
# Normalize and combine
combined = alpha normalize(emb_scores) + (1-alpha) normalize(bm25_scores)
top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:k]
return [chunks[i] for i in top_indices]
Step 4: Reranking for Final Precision
Even with 95% Pass@10, you can push further. Use Cohere's reranker to reorder the top-20 results:
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query, candidates, top_k=10):
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c["content"] for c in candidates],
top_n=top_k
)
return [candidates[r.index] for r in results.results]
Full pipeline
query = "How does the payment module handle refunds?"
initial_results = hybrid_search(query, k=20)
final_results = rerank(query, initial_results, top_k=10)
Reranking typically adds 2–5% to Pass@10, pushing it toward 97–99%.
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, deploy the provided Lambda function (contextual-rag-lambda-function/lambda_function.py) as a custom chunking option. This automates context generation during ingestion.
Cost Optimization
- Prompt caching: Essential for large document sets
- Batch processing: Generate context in parallel for multiple chunks
- Model choice: Use Claude 3 Haiku for context generation (fast, cheap); use Sonnet/Opus for final answer generation
Evaluation
Always measure Pass@k on your own dataset. The 35% failure reduction is an average—your mileage may vary. Build a golden dataset of at least 100 queries with known correct chunks.Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by prepending document-level context to each chunk before embedding.
- Prompt caching makes this approach cost-effective—reusing the full document across chunks cuts API costs by 50–70%.
- Combine Contextual Embeddings with Contextual BM25 for hybrid search that outperforms either method alone.
- Reranking adds the final polish—a Cohere reranker on top-20 results can push Pass@10 beyond 97%.
- Production-ready on any cloud—the technique works on Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex AI.