Supercharge Your RAG Pipeline: A Practical Guide to Contextual Retrieval with Claude
Learn how to implement Contextual Embeddings and Contextual BM25 to reduce retrieval failure rates by 35% in your Claude RAG applications. Step-by-step guide with code.
This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% in Claude-powered RAG systems, using prompt caching to keep costs practical.
Supercharge Your RAG Pipeline: A Practical Guide to Contextual Retrieval with Claude
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering everything from customer support chatbots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet without its function name, a paragraph without its section header—these orphans lead to failed retrievals and poor Claude responses.
Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%. In this guide, you'll learn exactly how to implement it.What You'll Build
By the end of this guide, you'll have a production-ready Contextual Retrieval pipeline that:
- Establishes a baseline RAG system for performance measurement
- Implements Contextual Embeddings to boost retrieval accuracy
- Adds Contextual BM25 for hybrid search improvements
- Applies reranking for the final polish
Prerequisites
Skills:- Intermediate Python
- Basic RAG understanding
- Familiarity with vector databases and embeddings
- Python 3.8+
- Docker (optional, for BM25)
- 4GB+ RAM, ~5-10GB disk space
- Anthropic API key
- Voyage AI API key (for embeddings)
- Cohere API key (for reranking)
Step 1: Setting Up the Basic RAG Pipeline
First, let's establish our baseline. We'll load the codebase chunks and evaluation dataset, then implement a simple retrieval system.
import json
import voyageai
from typing import List, Dict
Load data
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_data = [json.loads(line) for line in f]
Initialize Voyage AI client
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
chunk_embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Simple cosine similarity search
def search(query: str, k: int = 10) -> List[Dict]:
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
similarities = [
(i, cosine_similarity(query_embedding, chunk_embeddings[i]))
for i in range(len(chunk_embeddings))
]
top_k = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
return [chunks[i] for i, _ in top_k]
Baseline Performance: With this basic setup, you'll likely see Pass@10 around 87%—meaning 13% of queries fail to retrieve the correct chunk in the top 10 results. Let's improve that.
Step 2: Implementing Contextual Embeddings
The core idea is simple: before embedding each chunk, prepend a short context snippet that explains where the chunk came from. For codebases, this context might include:
- The file path
- The function or class name
- A brief description of the surrounding module
def create_contextual_chunk(chunk: Dict) -> str:
"""Add context to a chunk before embedding."""
context_parts = []
if chunk.get('file_path'):
context_parts.append(f"File: {chunk['file_path']}")
if chunk.get('function_name'):
context_parts.append(f"Function: {chunk['function_name']}")
if chunk.get('class_name'):
context_parts.append(f"Class: {chunk['class_name']}")
if chunk.get('description'):
context_parts.append(f"Description: {chunk['description']}")
context_str = " | ".join(context_parts)
return f"{context_str}\n\n{chunk['content']}"
Generate contextual chunks
contextual_chunks = [create_contextual_chunk(chunk) for chunk in chunks]
Embed the contextual chunks
contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings
Why This Works
When you search for "how to handle authentication errors," a bare chunk containing raise AuthenticationError("Invalid token") might not rank highly. But with context like File: auth/handler.py | Function: validate_token | Description: Handles JWT token validation, the same chunk becomes highly relevant to authentication-related queries.
Managing Costs with Prompt Caching
The obvious concern: embedding longer strings costs more. Prompt caching is your solution. Available on Anthropic's first-party API (and coming soon to AWS Bedrock and GCP Vertex), prompt caching lets you reuse the system prompt and context across multiple requests.
import anthropic
client = anthropic.Anthropic()
Cache the contextual chunks
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=[
{
"type": "text",
"text": "You are a retrieval assistant. Use the following context to answer questions.",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "What is the authentication flow?"
}
]
)
Performance Boost: Contextual Embeddings alone improved Pass@10 from ~87% to ~95% in Anthropic's tests—a 62% reduction in retrieval failures.
Step 3: Adding Contextual BM25
BM25 is a traditional text-search algorithm that works well for exact keyword matching. By applying the same contextual prefix to chunks before BM25 indexing, you get Contextual BM25—a powerful complement to your embedding-based search.
from rank_bm25 import BM25Okapi
from typing import List
Tokenize contextual chunks for BM25
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]:
"""Combine embedding similarity and BM25 scores."""
# Get embedding scores
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
emb_scores = [cosine_similarity(query_embedding, e) for e in contextual_embeddings]
# Get BM25 scores
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
emb_scores = normalize(emb_scores)
bm25_scores = normalize(bm25_scores)
combined = [
alpha emb + (1 - alpha) bm25
for emb, bm25 in zip(emb_scores, bm25_scores)
]
top_indices = sorted(
range(len(combined)),
key=lambda i: combined[i],
reverse=True
)[:k]
return [chunks[i] for i in top_indices]
Why Both? Embeddings capture semantic meaning ("how to fix login issues"), while BM25 excels at exact matches ("AuthenticationError"). Together, they cover more retrieval scenarios.
Step 4: Reranking for Precision
Even with hybrid search, your top-10 results might include near-misses. A reranker (like Cohere's) takes your top-20 results and re-scores them based on deeper semantic understanding.
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query: str, candidates: List[Dict], top_k: int = 10) -> List[Dict]:
"""Rerank candidates using Cohere's reranker."""
candidate_texts = [c['content'] for c in candidates]
results = co.rerank(
query=query,
documents=candidate_texts,
model="rerank-english-v2.0",
top_n=top_k
)
return [candidates[r.index] for r in results]
Final Pipeline:
- Retrieve top-20 using hybrid Contextual Embeddings + Contextual BM25
- Rerank to get the final top-10
- Pass to Claude for answer generation
AWS Bedrock Implementation
For AWS customers, Anthropic's team has provided a Lambda function that implements Contextual Retrieval as a custom chunking strategy for Bedrock Knowledge Bases. You'll find the code in the contextual-rag-lambda-function directory of the cookbook repository.
# lambda_function.py (simplified)
def lambda_handler(event, context):
chunk = event['chunk']
context = generate_context(chunk)
contextual_chunk = f"{context}\n\n{chunk['content']}"
return {
'chunkId': chunk['id'],
'content': contextual_chunk,
'metadata': chunk.get('metadata', {})
}
Deploy this Lambda, select it as your custom chunking option when creating a Knowledge Base, and you're set.
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context (file path, function name, description) to each chunk before embedding
- Pair with Contextual BM25 for hybrid search that combines semantic understanding with exact keyword matching
- Use prompt caching to manage the increased token costs of contextual chunks—available on Anthropic's API, coming to Bedrock and Vertex
- Add a reranker (like Cohere) as a final precision layer to eliminate near-misses from your top-k results
- Start simple: even basic context (just the file path) provides significant improvements over bare chunks