Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings
Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. Step-by-step guide with code examples for Contextual Embeddings and BM25.
This guide teaches you how to implement Contextual Retrieval—a technique that adds chunk-specific context before embedding—to dramatically improve RAG accuracy. You'll build a pipeline using Claude, Voyage AI, and Cohere, achieving up to 35% fewer retrieval failures.
Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to codebase Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet like def calculate_total(): means nothing without knowing it belongs to an Order class in an e-commerce system.
In this guide, you'll learn how to build a Contextual Retrieval system using Claude, Voyage AI embeddings, and Cohere reranking. We'll walk through the full pipeline—from basic RAG to production-ready Contextual Embeddings with BM25 hybrid search.
What You'll Need
Prerequisites
- Intermediate Python skills
- Basic understanding of RAG and vector databases
- Docker installed (optional, for BM25)
API Keys & Costs
- Anthropic API key (free tier works)
- Voyage AI API key
- Cohere API key (for reranking)
- Estimated API cost: $5–10 for the full dataset
- Time: 30–45 minutes
1. Setting Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere numpy pandas
Load your API keys and prepare the dataset. We'll use Anthropic's pre-chunked codebase dataset (9 codebases, 248 queries with golden chunks):
import json
import os
from anthropic import Anthropic
import voyageai
Initialize clients
anthropic = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
Load data
with open("data/codebase_chunks.json") as f:
chunks = json.load(f)
with open("data/evaluation_set.jsonl") as f:
eval_queries = [json.loads(line) for line in f]
2. Building a Basic RAG Baseline
Before improving retrieval, establish a baseline. We'll use Pass@k as our metric—does the golden chunk appear in the top-k retrieved results?
def basic_rag_retrieve(query, chunks, top_k=10):
# Generate embedding for query
query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
# Compute cosine similarity with all chunk embeddings
# (Assume chunks have pre-computed embeddings)
scores = []
for chunk in chunks:
chunk_emb = chunk["embedding"]
similarity = cosine_similarity(query_embedding, chunk_emb)
scores.append((similarity, chunk))
# Return top-k
scores.sort(reverse=True, key=lambda x: x[0])
return [s[1] for s in scores[:top_k]]
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Evaluate Pass@10:
pass_at_10 = 0
for query in eval_queries:
results = basic_rag_retrieve(query["query"], chunks)
golden_id = query["golden_chunk_id"]
if any(r["id"] == golden_id for r in results):
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.1%}")
Output: ~87%
3. Implementing Contextual Embeddings
The core idea is simple: before embedding each chunk, prepend context that explains what the chunk is about.
How It Works
For each chunk, you ask Claude to generate a concise context (1–2 sentences) that situates the chunk within its parent document. This context is then prepended to the chunk text before embedding.
def generate_chunk_context(chunk_text, full_document):
"""Use Claude to generate context for a single chunk."""
prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Making It Production-Ready with Prompt Caching
Generating context for every chunk individually would be expensive. Prompt caching makes this practical by caching the full document prompt:
def generate_context_with_caching(chunk_text, full_document, document_id):
"""Use prompt caching to reduce costs."""
response = anthropic.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{
"type": "text",
"text": f"You are helping to generate context for chunks of this document: {full_document}",
"cache_control": {"type": "ephemeral"} # Cache the document
}],
messages=[{
"role": "user",
"content": f"Generate context for this chunk: {chunk_text}"
}]
)
return response.content[0].text
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
Embedding with Context
Once you have context for each chunk, prepend it before embedding:
def embed_with_context(chunks_with_context):
contextual_texts = [
f"{c['context']}\n\n{c['text']}"
for c in chunks_with_context
]
embeddings = vo.embed(contextual_texts, model="voyage-2").embeddings
return embeddings
Performance Results
After implementing Contextual Embeddings on our codebase dataset:
| Metric | Basic RAG | Contextual Embeddings |
|---|---|---|
| Pass@10 | 87% | 95% |
| Failure rate reduction | — | 35% |
4. Contextual BM25: Hybrid Search
Contextual Embeddings work with dense vectors. But you can also apply the same context to BM25 (a keyword-based retrieval method) for even better results.
Why BM25 + Context?
BM25 excels at exact keyword matching. By adding context to chunks before indexing, you give BM25 more relevant terms to match against queries.
# Install BM25 (requires Docker for production, or use rank-bm25 library)
pip install rank-bm25
from rank_bm25 import BM25Okapi
def build_contextual_bm25_index(chunks_with_context):
# Tokenize contextualized chunks
tokenized_corpus = [
f"{c['context']} {c['text']}".split()
for c in chunks_with_context
]
return BM25Okapi(tokenized_corpus)
Hybrid search: combine BM25 and embedding scores
bm25 = build_contextual_bm25_index(chunks_with_context)
def hybrid_search(query, chunks, bm25, alpha=0.5):
# Get BM25 scores
bm25_scores = bm25.get_scores(query.split())
# Get embedding scores (normalized)
query_emb = vo.embed([query]).embeddings[0]
emb_scores = [
cosine_similarity(query_emb, c["embedding"])
for c in chunks
]
# Combine scores
combined = [
alpha bm25_scores[i] + (1 - alpha) emb_scores[i]
for i in range(len(chunks))
]
# Return top-k
top_indices = np.argsort(combined)[-10:][::-1]
return [chunks[i] for i in top_indices]
5. Improving with Reranking
For maximum accuracy, add a reranking step using Cohere's rerank API:
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
def rerank_results(query, candidates, top_k=5):
# Prepare documents for reranking
docs = [c["text"] for c in candidates]
# Rerank
results = co.rerank(
query=query,
documents=docs,
top_n=top_k,
model="rerank-english-v2.0"
)
# Map back to original chunks
return [candidates[r.index] for r in results.results]
Full pipeline
def contextual_rag_pipeline(query):
# Step 1: Hybrid retrieval (top 20)
candidates = hybrid_search(query, chunks, bm25, alpha=0.3)
# Step 2: Rerank (top 5)
top_results = rerank_results(query, candidates, top_k=5)
# Step 3: Generate answer with Claude
context = "\n\n".join([r["text"] for r in top_results])
response = anthropic.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=500,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].text
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The code is available in the contextual-rag-lambda-function directory of the cookbook repository.
Cost Optimization
- Prompt caching reduces context generation costs by ~50%
- Use Claude 3 Haiku for context generation (fastest/cheapest)
- Batch process chunks per document to maximize cache hits
- Consider using Claude 3.5 Sonnet only for the final answer generation
Key Takeaways
- Contextual Embeddings reduce retrieval failures by 35% by prepending chunk-specific context before embedding, solving the "lost-in-the-middle" problem for RAG systems.
- Combine Contextual Embeddings with BM25 for hybrid search that leverages both semantic and keyword matching, further improving accuracy.
- Prompt caching makes this practical at scale by caching the parent document, reducing API costs by approximately 50% for context generation.
- Reranking adds a final accuracy boost—using Cohere's rerank API on your top-20 results can push Pass@k performance even higher.
- Production-ready on major cloud platforms—the technique works with AWS Bedrock Knowledge Bases (via Lambda custom chunking) and GCP Vertex AI.