Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude
Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Step-by-step guide with code examples, cost optimization tips, and production-ready strategies.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35%. You'll learn Contextual Embeddings, Contextual BM25, and how to use prompt caching to keep costs practical.
Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to code analysis tools. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet that says def process() means nothing without knowing it's part of a payment processing module. A paragraph about "the merger" is useless if the chunk doesn't mention which companies are involved.
What You'll Build
By the end of this guide, you'll have:
- A basic RAG pipeline with performance baselines
- Contextual Embeddings implementation that boosts Pass@10 from ~87% to ~95%
- Contextual BM25 for hybrid search optimization
- A reranking layer for final precision
- Cost optimization strategies using prompt caching
Prerequisites
Skills: Intermediate Python, basic RAG knowledge, familiarity with vector databases System: Python 3.8+, Docker (optional for BM25), 4GB+ RAM, ~5-10GB disk space API Keys:- Anthropic API key (free tier works)
- Voyage AI API key for embeddings
- Cohere API key for reranking
1. Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere pandas numpy
Initialize your clients:
import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyage-key")
co = cohere.Client(api_key="your-cohere-key")
For this guide, we'll use a dataset of 9 codebases with 248 queries, each containing a "golden chunk"—the correct document that should be retrieved. You can find the data at data/codebase_chunks.json and data/evaluation_set.jsonl.
2. Building a Basic RAG Pipeline (Baseline)
Let's establish a performance baseline using standard chunking and embedding:
import json
from typing import List, Dict
Load your chunks
with open("data/codebase_chunks.json", "r") as f:
chunks = json.load(f)
Generate embeddings for each chunk
def embed_chunks(chunks: List[str]) -> List[List[float]]:
response = vo.embed(chunks, model="voyage-2", input_type="document")
return response.embeddings
chunk_embeddings = embed_chunks([c["content"] for c in chunks])
Create a simple vector store (in-memory for demo)
vector_store = list(zip(chunks, chunk_embeddings))
Search function
def search(query: str, k: int = 10) -> List[Dict]:
query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
# Cosine similarity search
scores = []
for chunk, emb in vector_store:
similarity = sum(a*b for a,b in zip(query_embedding, emb))
scores.append((similarity, chunk))
scores.sort(reverse=True)
return [chunk for _, chunk in scores[:k]]
Evaluate baseline performance using Pass@k (whether the golden chunk appears in the top-k results):
def evaluate_pass_at_k(queries: List[Dict], k: int = 10) -> float:
correct = 0
for query in queries:
results = search(query["question"], k=k)
if query["golden_chunk_id"] in [r["id"] for r in results]:
correct += 1
return correct / len(queries)
Load evaluation set
with open("data/evaluation_set.jsonl", "r") as f:
eval_queries = [json.loads(line) for line in f]
baseline_pass_10 = evaluate_pass_at_k(eval_queries, k=10)
print(f"Baseline Pass@10: {baseline_pass_10:.2%}")
Expected: ~87%
3. Implementing Contextual Embeddings
The core insight is simple: before embedding each chunk, prepend a short context snippet that explains what the chunk is about. You generate this context using Claude.
Step 1: Generate Context for Each Chunk
def generate_chunk_context(chunk: Dict, full_document: str) -> str:
"""Use Claude to generate context for a chunk."""
prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk['content']}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Generate context for each chunk (this is the expensive part)
for chunk in chunks:
chunk["context"] = generate_chunk_context(chunk, chunk["document"])
Step 2: Embed with Context
# Create contextual chunks
contextual_chunks = [
f"{chunk['context']}\n\n{chunk['content']}"
for chunk in chunks
]
Embed the contextualized versions
contextual_embeddings = embed_chunks(contextual_chunks)
Rebuild vector store
contextual_vector_store = list(zip(chunks, contextual_embeddings))
Step 3: Evaluate Improvement
def contextual_search(query: str, k: int = 10) -> List[Dict]:
query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
scores = []
for chunk, emb in contextual_vector_store:
similarity = sum(a*b for a,b in zip(query_embedding, emb))
scores.append((similarity, chunk))
scores.sort(reverse=True)
return [chunk for _, chunk in scores[:k]]
contextual_pass_10 = evaluate_pass_at_k(eval_queries, k=10)
print(f"Contextual Pass@10: {contextual_pass_10:.2%}")
Expected: ~95% (up from ~87%)
4. Cost Optimization with Prompt Caching
Generating context for every chunk can be expensive. Prompt caching reduces costs by ~85% by caching the full document and only sending the changing chunk:
def generate_chunk_context_cached(chunk: Dict, full_document: str) -> str:
"""Use prompt caching to reduce costs."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[
{
"type": "text",
"text": f"<document>{full_document}</document>",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{
"role": "user",
"content": f"<chunk>{chunk['content']}</chunk>\n\nGive succinct context for this chunk."
}]
)
return response.content[0].text
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
5. Contextual BM25: Hybrid Search
The same chunk context can improve BM25 (keyword-based) search. Combine it with embeddings for a hybrid approach:
from rank_bm25 import BM25Okapi
Tokenize contextual chunks for BM25
tokenized_corpus = [contextual_chunk.split() for contextual_chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]:
# Get embedding scores
query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
emb_scores = []
for chunk, emb in contextual_vector_store:
similarity = sum(a*b for a,b in zip(query_embedding, emb))
emb_scores.append(similarity)
# Get BM25 scores
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
combined_scores = []
for i in range(len(chunks)):
normalized_emb = emb_scores[i] / max(emb_scores)
normalized_bm25 = bm25_scores[i] / max(bm25_scores)
combined = alpha normalized_emb + (1 - alpha) normalized_bm25
combined_scores.append((combined, chunks[i]))
combined_scores.sort(reverse=True)
return [chunk for _, chunk in combined_scores[:k]]
6. Adding a Reranking Layer
For final precision, add a Cohere reranker:
def rerank(query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
# Prepare documents for reranking
docs = [f"{c['context']}\n\n{c['content']}" for c in candidates]
results = co.rerank(
query=query,
documents=docs,
top_n=top_k,
model="rerank-english-v2.0"
)
return [candidates[r.index] for r in results.results]
Full pipeline
def advanced_search(query: str) -> List[Dict]:
# Step 1: Hybrid search for initial candidates
candidates = hybrid_search(query, k=20)
# Step 2: Rerank for precision
final_results = rerank(query, candidates, top_k=5)
return final_results
Production Considerations
For AWS Bedrock Users
Anthropic and AWS have provided a Lambda function for Contextual Retrieval that integrates directly with Bedrock Knowledge Bases. You can find the code in the contextual-rag-lambda-function directory of the cookbook repository. Deploy this Lambda and select it as a custom chunking option when configuring your knowledge base.
Performance Summary
| Technique | Pass@10 | Improvement |
|---|---|---|
| Basic RAG | ~87% | Baseline |
| Contextual Embeddings | ~95% | +8% |
| + Contextual BM25 | ~96% | +9% |
| + Reranking | ~97% | +10% |
Key Takeaways
- Contextual Embeddings reduce retrieval failure rates by 35% by adding document-level context to each chunk before embedding, solving the "lost context" problem in traditional RAG.
- Prompt caching makes this practical by reducing the cost of generating context for thousands of chunks by approximately 85%.
- Contextual BM25 provides complementary improvements—combining it with contextual embeddings in a hybrid search yields the best results.
- A reranking layer adds final precision but comes with additional latency and cost; use it only when you need top-5 accuracy.
- AWS Bedrock users can deploy this as a Lambda function for seamless integration with existing knowledge bases, making production deployment straightforward.