Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings
Learn how to improve RAG accuracy by 35% using Contextual Embeddings and Contextual BM25 with Claude. A practical guide with code examples and evaluation metrics.
This guide shows you how to enhance RAG systems by adding context to document chunks before embedding, reducing retrieval failure rates by 35%. You'll implement Contextual Embeddings, Contextual BM25, and reranking using Claude and Voyage AI.
Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support chatbots to internal knowledge base Q&A. But traditional RAG has a blind spot: when you split documents into chunks, individual pieces often lose the surrounding context, leading to poor retrieval accuracy.
Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. In tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%. This guide walks you through implementing Contextual Embeddings and Contextual BM25 with Claude, complete with code examples and performance benchmarks.What You'll Learn
- How to set up a basic RAG pipeline as a baseline
- What Contextual Embeddings are and why they work
- How to implement Contextual Embeddings with prompt caching to manage costs
- How to combine Contextual Embeddings with Contextual BM25 for hybrid search
- How to further improve performance with reranking
Prerequisites
Before diving in, make sure you have:
- Python 3.8+ installed
- API keys for Anthropic, Voyage AI, and Cohere
- Basic familiarity with RAG, vector databases, and embeddings
- Docker installed (optional, for BM25 search)
- About 30–45 minutes and ~$5–10 in API costs
1. Setting Up a Basic RAG Pipeline
We'll start with a simple RAG pipeline to establish a performance baseline. The dataset consists of 9 codebases, chunked using character splitting, with 248 evaluation queries—each with a "golden chunk" that should be retrieved.
Install Dependencies
pip install anthropic voyageai cohere numpy pandas
Load and Chunk Documents
import json
Load pre-chunked codebase data
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
Load evaluation queries
with open('data/evaluation_set.jsonl', 'r') as f:
eval_data = [json.loads(line) for line in f]
Generate Embeddings
We'll use Voyage AI's embedding model to vectorize each chunk.
import voyageai
vo = voyageai.Client(api_key='YOUR_VOYAGE_API_KEY')
Embed all chunks
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model='voyage-2').embeddings
Perform Retrieval
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def retrieve(query, embeddings, chunk_texts, k=10):
query_emb = vo.embed([query], model='voyage-2').embeddings[0]
similarities = cosine_similarity([query_emb], embeddings)[0]
top_k_indices = np.argsort(similarities)[-k:][::-1]
return [chunk_texts[i] for i in top_k_indices]
Evaluate Pass@10
pass_at_10 = 0
for item in eval_data:
results = retrieve(item['query'], embeddings, chunk_texts, k=10)
if item['golden_chunk'] in results:
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_data):.2%}")
Expected output: Baseline Pass@10: ~87%
2. Contextual Embeddings: Adding Context to Each Chunk
The problem with basic RAG is that chunks are isolated. A chunk containing def calculate_interest(principal, rate, years): might be meaningless without knowing it's from a loan calculator app. Contextual Embeddings fix this by prepending a short context snippet to each chunk before embedding.
How It Works
For each chunk, we ask Claude to generate a concise context that explains what the chunk is about, based on the full document. This context is prepended to the chunk text before embedding.
Implementation with Prompt Caching
Generating context for thousands of chunks can be expensive. Prompt caching (available on Anthropic's API) dramatically reduces costs by reusing the system prompt across multiple calls.
import anthropic
client = anthropic.Anthropic(api_key='YOUR_ANTHROPIC_API_KEY')
def generate_context(chunk_text, full_document, cache_key):
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{
"type": "text",
"text": "You are a document context generator. Given a chunk of text from a larger document, provide a brief (1-2 sentence) context that explains what this chunk is about and where it fits in the document.",
"cache_control": {"type": "ephemeral", "cache_key": cache_key}
}],
messages=[
{"role": "user", "content": f"Full document: {full_document[:2000]}\n\nChunk: {chunk_text}\n\nContext:"}
]
)
return response.content[0].text
Generate contexts for all chunks
contexts = []
for i, chunk in enumerate(chunks):
ctx = generate_context(chunk['text'], chunk['full_document'], cache_key=f"doc_{chunk['doc_id']}")
contexts.append(ctx)
print(f"Generated context for chunk {i+1}/{len(chunks)}")
Embed with Context
# Prepend context to each chunk before embedding
contextual_chunks = [f"{ctx}\n\n{chunk['text']}" for ctx, chunk in zip(contexts, chunks)]
contextual_embeddings = vo.embed(contextual_chunks, model='voyage-2').embeddings
Evaluate Contextual Embeddings
def contextual_retrieve(query, embeddings, contextual_chunks, k=10):
query_emb = vo.embed([query], model='voyage-2').embeddings[0]
similarities = cosine_similarity([query_emb], embeddings)[0]
top_k_indices = np.argsort(similarities)[-k:][::-1]
return [contextual_chunks[i] for i in top_k_indices]
pass_at_10 = 0
for item in eval_data:
results = contextual_retrieve(item['query'], contextual_embeddings, contextual_chunks, k=10)
if item['golden_chunk'] in results:
pass_at_10 += 1
print(f"Contextual Embeddings Pass@10: {pass_at_10 / len(eval_data):.2%}")
Expected output: Contextual Embeddings Pass@10: ~95% — a significant jump from 87%.
3. Contextual BM25: Hybrid Search for Even Better Results
BM25 is a text-based retrieval method that works well for exact keyword matching. By applying the same contextual prefix to BM25, we get Contextual BM25.
Setting Up BM25
from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
def bm25_retrieve(query, k=10):
tokenized_query = query.split()
scores = bm25.get_scores(tokenized_query)
top_k_indices = np.argsort(scores)[-k:][::-1]
return [contextual_chunks[i] for i in top_k_indices]
Hybrid Search: Combine Embeddings + BM25
def hybrid_retrieve(query, emb_embeddings, bm25, contextual_chunks, k=10, alpha=0.5):
# Get embedding scores
query_emb = vo.embed([query], model='voyage-2').embeddings[0]
emb_scores = cosine_similarity([query_emb], emb_embeddings)[0]
# Get BM25 scores
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
emb_scores = (emb_scores - emb_scores.min()) / (emb_scores.max() - emb_scores.min())
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
combined = alpha emb_scores + (1 - alpha) bm25_scores
top_k_indices = np.argsort(combined)[-k:][::-1]
return [contextual_chunks[i] for i in top_k_indices]
Evaluate hybrid search—you should see another 1–2% improvement.
4. Reranking for Precision
Even with contextual retrieval, the top-10 results may contain irrelevant chunks. Reranking using Cohere's rerank API or Claude itself can push the golden chunk to position 1.
import cohere
co = cohere.Client('YOUR_COHERE_API_KEY')
def rerank(query, candidates, top_n=5):
results = co.rerank(
model='rerank-english-v2.0',
query=query,
documents=candidates,
top_n=top_n
)
return [candidates[r.index] for r in results.results]
Use reranking on top-20 results from hybrid search
pass_at_1 = 0
for item in eval_data:
candidates = hybrid_retrieve(item['query'], contextual_embeddings, bm25, contextual_chunks, k=20)
reranked = rerank(item['query'], candidates, top_n=5)
if item['golden_chunk'] in reranked[:1]:
pass_at_1 += 1
print(f"Pass@1 with Reranking: {pass_at_1 / len(eval_data):.2%}")
Cost Optimization with Prompt Caching
Generating context for thousands of chunks can be costly. Prompt caching reduces costs by up to 90% by reusing the system prompt across multiple API calls. This feature is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
For AWS Bedrock users, Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option when configuring a Bedrock Knowledge Base.
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by adding document-level context to each chunk before embedding.
- Prompt caching makes Contextual Embeddings production-ready by slashing API costs for context generation.
- Hybrid search (Contextual Embeddings + Contextual BM25) yields the best results, combining semantic and keyword-based retrieval.
- Reranking further improves precision, pushing the most relevant chunk to the top of results.
- This technique works across platforms — use it with Anthropic's API, AWS Bedrock, or GCP Vertex AI with minor customization.