Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking
Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25. This guide covers setup, implementation, and optimization with Claude AI and Anthropic's ecosystem.
This guide teaches you how to enhance RAG systems by adding context to document chunks before embedding, reducing retrieval failure rates by 35% using Contextual Embeddings and Contextual BM25 with Claude AI.
Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking
Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to tap into your internal knowledge bases, code repositories, and document libraries. But traditional RAG has a blind spot: when you split documents into chunks for embedding, those chunks often lose the surrounding context that makes them meaningful. A chunk that reads "the function returns True" is useless without knowing which function or what condition it checks.
Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. This guide walks you through implementing Contextual Embeddings and Contextual BM25, showing how to reduce retrieval failure rates by up to 35%—all using Anthropic's ecosystem, including Claude and prompt caching to keep costs manageable.What You'll Learn
- How to set up a basic RAG pipeline as a baseline
- What Contextual Embeddings are and why they work
- How to implement Contextual Embeddings with Claude and Voyage AI
- How to combine Contextual Embeddings with Contextual BM25 for hybrid search
- How to further improve results with reranking
Prerequisites
Before diving in, make sure you have:
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG and vector databases
- Familiarity with command-line tools
- Python 3.8+
- Docker (optional, for BM25 search)
- 4GB+ RAM, ~5-10 GB disk space
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key (for reranking)
- Setup: 30-45 minutes
- API costs: ~$5-10 for the full dataset
Step 1: Setting Up a Basic RAG Pipeline
First, let's establish a baseline. We'll use a pre-chunked dataset of nine codebases (available in data/codebase_chunks.json) and 248 evaluation queries with known "golden chunks" (in data/evaluation_set.jsonl). Our metric is Pass@k—whether the golden chunk appears in the top-k retrieved results.
import voyageai
import numpy as np
from typing import List, Dict
Initialize Voyage AI client
vo = voyageai.Client(api_key="your-voyage-api-key")
Load chunks and evaluation data
(Assume chunks and eval_queries are loaded from JSON files)
Embed all chunks
chunk_texts = [chunk["content"] for chunk in chunks]
chunk_embeddings = vo.embed(
chunk_texts,
model="voyage-2",
input_type="document"
).embeddings
For each query, embed and find top-k matches
def search(query: str, k: int = 10) -> List[int]:
query_emb = vo.embed(
[query],
model="voyage-2",
input_type="query"
).embeddings[0]
# Compute cosine similarity
similarities = np.dot(chunk_embeddings, query_emb)
top_indices = np.argsort(similarities)[-k:][::-1]
return top_indices
Evaluate Pass@10
pass_at_10 = 0
for query in eval_queries:
results = search(query["query"], k=10)
if query["golden_chunk_id"] in results:
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.2%}")
Expected: ~87%
This baseline gives us ~87% Pass@10. Not bad, but we can do better.
Step 2: Understanding Contextual Embeddings
The problem with basic chunking is context loss. A chunk from a function definition might say "def calculate_interest(principal, rate, time):" but the next chunk starts with "return principal rate time / 100"—and without the function signature, that chunk is meaningless.
Contextual Embeddings fix this by using Claude to generate a short, chunk-specific context that explains what the chunk is about. This context is prepended to the chunk text before embedding. For example:- Original chunk:
return principal rate time / 100 - With context:
This is from a function called 'calculate_interest' that computes simple interest. The code returns: return principal rate time / 100
Step 3: Implementing Contextual Embeddings
Here's where Claude shines. We'll use Claude to generate context for each chunk, and prompt caching to reduce costs by reusing the system prompt across multiple chunks.
import anthropic
client = anthropic.Anthropic(api_key="your-anthropic-api-key")
System prompt for context generation
SYSTEM_PROMPT = """You are a document context generator. Given a document and a chunk from it, generate a concise context (2-3 sentences) that explains what this chunk is about, including relevant surrounding information like function names, class names, or section headers."""
Use prompt caching for the system prompt
cached_system = client.beta.prompt_caching.create(
model="claude-3-sonnet-20241022",
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}
}
]
)
def generate_context(document: str, chunk: str) -> str:
response = client.messages.create(
model="claude-3-sonnet-20241022",
system=cached_system,
max_tokens=150,
messages=[
{"role": "user", "content": f"Document: {document}\n\nChunk: {chunk}\n\nGenerate context:"}
]
)
return response.content[0].text
Generate contexts for all chunks
contextual_chunks = []
for chunk in chunks:
context = generate_context(chunk["document"], chunk["content"])
contextual_chunks.append(f"{context}\n\n{chunk['content']}")
Now embed contextual chunks
contextual_embeddings = vo.embed(
contextual_chunks,
model="voyage-2",
input_type="document"
).embeddings
Re-evaluate
pass_at_10_contextual = 0
for query in eval_queries:
query_emb = vo.embed([query["query"]], model="voyage-2", input_type="query").embeddings[0]
similarities = np.dot(contextual_embeddings, query_emb)
top_indices = np.argsort(similarities)[-10:][::-1]
if query["golden_chunk_id"] in top_indices:
pass_at_10_contextual += 1
print(f"Contextual Embeddings Pass@10: {pass_at_10_contextual / len(eval_queries):.2%}")
Expected: ~95%
Why prompt caching matters: Without caching, generating context for thousands of chunks would be expensive. With prompt caching, the system prompt is cached after the first request, reducing cost by ~90% for subsequent chunks.
Step 4: Adding Contextual BM25 for Hybrid Search
Contextual Embeddings improve semantic search, but BM25 (a keyword-based algorithm) can catch exact matches that embeddings miss. By applying the same context to BM25, we get Contextual BM25.
# Using a simple BM25 implementation (e.g., rank_bm25 library)
from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_corpus = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[int]:
# Semantic search scores
query_emb = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
semantic_scores = np.dot(contextual_embeddings, query_emb)
# BM25 scores
bm25_scores = bm25.get_scores(query.split())
# Normalize and combine
semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
combined = alpha semantic_scores + (1 - alpha) bm25_scores
top_indices = np.argsort(combined)[-k:][::-1]
return top_indices
Evaluate hybrid search
pass_at_10_hybrid = 0
for query in eval_queries:
results = hybrid_search(query["query"], k=10)
if query["golden_chunk_id"] in results:
pass_at_10_hybrid += 1
print(f"Hybrid Contextual Search Pass@10: {pass_at_10_hybrid / len(eval_queries):.2%}")
Expected: ~96-97%
Step 5: Improving with Reranking
For even better results, add a reranking step using Cohere's rerank API. This reorders the top-20 results to push the most relevant chunks to the top.
import cohere
co = cohere.Client("your-cohere-api-key")
def rerank(query: str, chunks: List[str], top_k: int = 10) -> List[int]:
results = co.rerank(
query=query,
documents=chunks,
top_n=top_k,
model="rerank-english-v2.0"
)
return [result.index for result in results.results]
For each query, get top-20 from hybrid search, then rerank to top-10
pass_at_10_reranked = 0
for query in eval_queries:
top_20 = hybrid_search(query["query"], k=20)
top_20_chunks = [contextual_chunks[i] for i in top_20]
reranked_indices = rerank(query["query"], top_20_chunks, top_k=10)
final_indices = [top_20[i] for i in reranked_indices]
if query["golden_chunk_id"] in final_indices:
pass_at_10_reranked += 1
print(f"Reranked Contextual Search Pass@10: {pass_at_10_reranked / len(eval_queries):.2%}")
Expected: ~98-99%
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option. This automates context generation for new documents added to your knowledge base.
Cost Management
- Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
- For large corpora, consider generating context only once and storing it alongside your chunks.
- Use smaller models (Claude 3 Haiku) for context generation if accuracy requirements are lower.
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the context-loss problem in traditional RAG.
- Prompt caching makes this practical by reducing the cost of generating context for thousands of chunks by ~90%.
- Hybrid search with Contextual BM25 combines semantic and keyword matching for even better results, pushing Pass@10 from 87% to 96%+.
- Reranking adds the final polish, boosting Pass@10 to 98-99% by reordering the top candidates.
- This technique works with major cloud platforms—Anthropic provides ready-to-deploy Lambda functions for AWS Bedrock, with GCP Vertex AI support coming soon.