Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude
Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG retrieval accuracy using Claude and Anthropic's prompt caching.
This guide teaches you how to implement Contextual Retrieval—a technique that adds chunk-specific context before embedding—to reduce RAG retrieval failure rates by 35% using Claude, Voyage AI, and prompt caching.
Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful.
Imagine searching through a codebase and finding a function called process_data(). Without knowing which module it belongs to or what data it expects, that chunk is nearly useless. Contextual Retrieval solves this by prepending relevant context to each chunk before embedding it.
In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, Voyage AI, and Anthropic's prompt caching. The results speak for themselves: a 35% reduction in top-20 retrieval failure rates across tested datasets.
What You'll Need
Prerequisites
- Intermediate Python skills
- Basic understanding of RAG and vector databases
- Docker installed (optional, for BM25 search)
- 4GB+ RAM and ~5-10GB disk space
API Keys
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key (for reranking)
1. Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere pandas numpy
Initialize your clients:
import anthropic
import voyageai
Initialize API clients
claude_client = anthropic.Anthropic(api_key="your-anthropic-key")
vo_client = voyageai.Client(api_key="your-voyage-key")
Test connection
response = claude_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": "Hello"}]
)
print("Claude ready:", response.content[0].text)
2. The Problem: Context-Starved Chunks
Traditional RAG splits documents into fixed-size chunks, embeds them, and stores them in a vector database. When a query comes in, it retrieves the most similar chunks. But consider this code chunk:
def calculate_metrics(data):
return np.mean(data), np.std(data)
Without context, the retriever doesn't know:
- This is from a financial analysis module
datarepresents stock price arrays- The function is used for risk assessment
3. Contextual Embeddings: The Fix
Contextual Embeddings solve this by asking Claude to generate a brief context for each chunk before embedding. Here's the prompt:
CONTEXT_PROMPT = """
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
"""
Implementation with Prompt Caching
Anthropic's prompt caching makes this practical by caching the full document prefix across multiple chunk requests:
def generate_chunk_context(chunk_text, full_document, chunk_index):
"""Generate context for a single chunk using Claude with prompt caching."""
prompt = CONTEXT_PROMPT.replace("{{WHOLE_DOCUMENT}}", full_document)
prompt = prompt.replace("{{CHUNK_CONTENT}}", chunk_text)
response = claude_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
temperature=0,
system=[{
"type": "text",
"text": "You are a context-generation assistant.",
"cache_control": {"type": "ephemeral"} # Enable caching
}],
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Process all chunks with caching
full_doc = "..." # Your full document
chunks = [...] # Your pre-split chunks
contextual_chunks = []
for i, chunk in enumerate(chunks):
context = generate_chunk_context(chunk, full_doc, i)
contextual_chunks.append(f"{context}\n\n{chunk}")
Why prompt caching matters: Without caching, generating context for 1,000 chunks would cost ~$15. With caching, it drops to ~$2-3 because the full document is cached and only the chunk changes between requests.
4. Embedding and Storing Contextual Chunks
Now embed the contextualized chunks using Voyage AI:
def embed_chunks(chunks, batch_size=128):
"""Batch embed chunks using Voyage AI."""
all_embeddings = []
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
result = vo_client.embed(
texts=batch,
model="voyage-2",
input_type="document"
)
all_embeddings.extend(result.embeddings)
return all_embeddings
Generate embeddings for contextual chunks
contextual_embeddings = embed_chunks(contextual_chunks)
Store these in your vector database (e.g., Pinecone, Weaviate, or Chroma):
import chromadb
client = chromadb.Client()
collection = client.create_collection("contextual_rag")
Add chunks with metadata
collection.add(
embeddings=contextual_embeddings,
documents=contextual_chunks,
ids=[f"chunk_{i}" for i in range(len(contextual_chunks))],
metadatas=[{"source": "codebase", "index": i} for i in range(len(contextual_chunks))]
)
5. Contextual BM25: Hybrid Search
BM25 is a text-based retrieval method that works well for exact keyword matches. You can apply the same contextual prefix to improve BM25 performance:
from rank_bm25 import BM25Okapi
def build_contextual_bm25(contextual_chunks):
"""Build BM25 index from contextual chunks."""
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
return bm25
Search with contextual BM25
bm25 = build_contextual_bm25(contextual_chunks)
query = "stock volatility calculation"
query_tokens = query.split()
bm25_scores = bm25.get_scores(query_tokens)
top_indices = sorted(range(len(bm25_scores)),
key=lambda i: bm25_scores[i],
reverse=True)[:10]
Hybrid search combines vector similarity and BM25 scores:
def hybrid_search(query, vector_db, bm25, alpha=0.5):
"""Combine vector and BM25 scores."""
# Get vector scores
query_embedding = vo_client.embed([query], model="voyage-2", input_type="query")[0]
vector_results = vector_db.query(query_embeddings=[query_embedding], n_results=20)
# Get BM25 scores
bm25_scores = bm25.get_scores(query.split())
# Normalize and combine
combined_scores = {}
for i in range(len(vector_results['ids'][0])):
idx = int(vector_results['ids'][0][i].split('_')[1])
vector_score = 1 - vector_results['distances'][0][i] # Convert distance to similarity
bm25_score = bm25_scores[idx]
combined_scores[idx] = alpha vector_score + (1 - alpha) bm25_score
# Return top k
top_k = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:10]
return [idx for idx, score in top_k]
6. Measuring Performance: Pass@k
We evaluate using Pass@k—whether the correct chunk appears in the top-k results:
def evaluate_pass_at_k(retriever, queries, golden_chunks, k=10):
"""Calculate Pass@k metric."""
successes = 0
for query, golden in zip(queries, golden_chunks):
results = retriever.search(query, k=k)
if golden in results:
successes += 1
return successes / len(queries)
Example results
print(f"Baseline Pass@10: {0.87:.2%}") # ~87%
print(f"Contextual Embeddings Pass@10: {0.95:.2%}") # ~95%
In tests across 9 codebases with 248 queries, Contextual Embeddings improved Pass@10 from ~87% to ~95%.
7. Boosting Further with Reranking
For production systems, add a reranking step using Cohere:
import cohere
co_client = cohere.Client("your-cohere-key")
def rerank_results(query, candidates, top_k=5):
"""Rerank retrieved chunks using Cohere's reranker."""
rerank_results = co_client.rerank(
query=query,
documents=candidates,
top_n=top_k,
model="rerank-english-v2.0"
)
return [result.document for result in rerank_results.results]
Full pipeline
query = "How do we calculate portfolio risk?"
initial_results = hybrid_search(query, collection, bm25, alpha=0.5)
final_results = rerank_results(query, initial_results, top_k=5)
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, deploy the provided Lambda function (contextual-rag-lambda-function/lambda_function.py) as a custom chunking option. This allows you to add context to each document chunk before it enters your knowledge base.
Cost Optimization
| Method | Cost per 1,000 chunks |
|---|---|
| Without caching | ~$15 |
| With prompt caching | ~$2-3 |
Key Takeaways
- Contextual Embeddings reduce retrieval failure rates by 35% by adding chunk-specific context before embedding, solving the "context-starved chunk" problem in traditional RAG.
- Prompt caching makes this practical for production, reducing costs by 80%+ by caching the full document across multiple chunk context generations.
- Contextual BM25 extends the same idea to text-based retrieval, enabling hybrid search that combines vector similarity and keyword matching for even better results.
- Reranking adds a final accuracy boost—use Cohere's reranker to refine top results after initial retrieval.
- Start with your evaluation set—measure Pass@k before and after implementing Contextual Retrieval to quantify the improvement for your specific use case.