How to Build a Contextual Retrieval System with Claude: A Practical Guide
Learn how to implement Contextual Embeddings and Contextual BM25 to improve RAG retrieval accuracy by up to 35% using Claude, Voyage AI, and Cohere.
This guide shows you how to enhance RAG performance by adding context to document chunks before embedding, reducing retrieval failure rates by 35% using Claude, Voyage AI, and Cohere.
Introduction
Retrieval Augmented Generation (RAG) is a powerful pattern that lets Claude answer questions using your internal knowledge bases, codebases, or any document corpus. But traditional RAG has a fundamental weakness: when you split documents into small chunks for efficient retrieval, individual chunks often lose the surrounding context they need to be meaningful.
Imagine searching for "the function returns an error" in a codebase. Without knowing which file or module that chunk belongs to, the embedding model can't accurately represent its meaning. This is where Contextual Retrieval comes in.
In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 — two techniques that add relevant context to each chunk before embedding or indexing. According to Anthropic's evaluations, this approach reduces the top-20-chunk retrieval failure rate by 35% on average across diverse datasets.
We'll walk through a complete implementation using a dataset of 9 codebases, showing you how to:
- Set up a baseline RAG pipeline
- Implement Contextual Embeddings with prompt caching to manage costs
- Add Contextual BM25 for hybrid search
- Improve results further with reranking
Prerequisites
Before diving in, make sure you have:
Technical Skills:- Intermediate Python programming
- Basic understanding of RAG
- Familiarity with vector databases and embeddings
- Python 3.8+
- Docker (optional, for BM25 search)
- 4GB+ RAM
- ~5–10 GB disk space for vector databases
- 30–45 minutes to complete
- ~$5–10 in API costs for the full dataset
Step 1: Setting Up a Basic RAG Pipeline
First, let's establish a baseline. We'll use a pre-chunked dataset of 9 codebases with 248 queries, each containing a "golden chunk" — the correct answer. Our metric is Pass@k, which checks if the golden chunk appears in the top-k retrieved results.
Install Dependencies
pip install anthropic voyageai cohere numpy pandas
Load and Prepare Data
import json
Load chunks and evaluation set
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
eval_data = [json.loads(line) for line in f]
Create Embeddings and Index
import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Build a simple vector index (using numpy for demonstration)
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
embedding_matrix = np.array(embeddings)
Evaluate Baseline Performance
def retrieve(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
similarities = cosine_similarity([query_emb], embedding_matrix)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
return [chunks[i] for i in top_indices]
Evaluate Pass@10
correct = 0
for item in eval_data:
results = retrieve(item['query'], k=10)
if item['golden_chunk_id'] in [r['id'] for r in results]:
correct += 1
print(f"Baseline Pass@10: {correct/len(eval_data)*100:.1f}%")
Expected output: ~87%
Step 2: Implementing Contextual Embeddings
The core idea is simple: before embedding each chunk, prepend a short context snippet that describes the chunk's origin. For codebases, this might include the file path, function name, and class name. For documents, it could be the section title, chapter, or surrounding paragraphs.
Generate Context with Claude
We'll use Claude to generate a concise context for each chunk. With prompt caching, we can dramatically reduce costs by reusing the system prompt across multiple calls.
from anthropic import Anthropic
client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
def generate_context(chunk, full_document):
"""Generate context for a chunk using Claude."""
prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
temperature=0,
system=[{"type": "text", "text": "You are a helpful assistant that provides context for document chunks.", "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Apply Context to All Chunks
# Group chunks by their source document
documents = {}
for chunk in chunks:
doc_id = chunk['document_id']
if doc_id not in documents:
documents[doc_id] = []
documents[doc_id].append(chunk)
Generate context for each chunk
contextual_chunks = []
for doc_id, doc_chunks in documents.items():
full_document = "\n\n".join([c['content'] for c in doc_chunks])
for chunk in doc_chunks:
context = generate_context(chunk['content'], full_document)
contextual_chunks.append({
'id': chunk['id'],
'content': f"{context}\n\n{chunk['content']}",
'original_content': chunk['content']
})
Re-Embed and Evaluate
# Embed contextual chunks
contextual_embeddings = vo.embed(
[c['content'] for c in contextual_chunks],
model="voyage-2"
).embeddings
contextual_matrix = np.array(contextual_embeddings)
Evaluate again
def contextual_retrieve(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
similarities = cosine_similarity([query_emb], contextual_matrix)[0]
top_indices = np.argsort(similarities)[-k:][::-1]
return [contextual_chunks[i] for i in top_indices]
correct = 0
for item in eval_data:
results = contextual_retrieve(item['query'], k=10)
if item['golden_chunk_id'] in [r['id'] for r in results]:
correct += 1
print(f"Contextual Embeddings Pass@10: {correct/len(eval_data)*100:.1f}%")
Expected output: ~95%
Step 3: Adding Contextual BM25
BM25 is a keyword-based retrieval method that complements dense embeddings. By applying the same context to BM25 indexing, we get Contextual BM25 — a hybrid approach that further improves recall.
Set Up BM25 Index
# Using Elasticsearch with Docker
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
Index Contextual Chunks
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
Create index with BM25 similarity
index_settings = {
"settings": {
"similarity": {
"default": {
"type": "BM25",
"b": 0.75,
"k1": 1.2
}
}
},
"mappings": {
"properties": {
"content": {"type": "text"},
"original_content": {"type": "text"}
}
}
}
es.indices.create(index="contextual_chunks", body=index_settings)
Index chunks
for chunk in contextual_chunks:
es.index(index="contextual_chunks", id=chunk['id'], body=chunk)
Hybrid Search with Weighted Scores
def hybrid_search(query, k=10, alpha=0.5):
# Dense retrieval
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
dense_scores = cosine_similarity([query_emb], contextual_matrix)[0]
# Sparse retrieval (BM25)
bm25_results = es.search(
index="contextual_chunks",
body={"query": {"match": {"content": query}}, "size": k}
)
# Combine scores (simplified)
combined_scores = {}
for i, score in enumerate(dense_scores):
combined_scores[i] = alpha * score
for hit in bm25_results['hits']['hits']:
idx = next(i for i, c in enumerate(contextual_chunks) if c['id'] == hit['_id'])
combined_scores[idx] = combined_scores.get(idx, 0) + (1 - alpha) * hit['_score']
top_indices = sorted(combined_scores, key=combined_scores.get, reverse=True)[:k]
return [contextual_chunks[i] for i in top_indices]
Step 4: Improving with Reranking
Finally, we can use a cross-encoder reranker (like Cohere's) to refine the top-k results from hybrid search.
import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query, candidates, top_k=5):
results = co.rerank(
query=query,
documents=[c['content'] for c in candidates],
top_n=top_k,
model="rerank-english-v2.0"
)
return [candidates[r.index] for r in results.results]
Full pipeline
def advanced_retrieve(query, k=10):
initial_results = hybrid_search(query, k=20) # Get more candidates
reranked = rerank(query, initial_results, top_k=k)
return reranked
Cost Optimization with Prompt Caching
Generating context for thousands of chunks can be expensive. Use Claude's prompt caching to cache the system prompt and document context:
# Cache the full document once, then reuse for each chunk
cached_document = {
"type": "text",
"text": full_document,
"cache_control": {"type": "ephemeral"}
}
Subsequent calls reuse the cached document
response = client.messages.create(
model="claude-3-haiku-20240307",
system=[{"type": "text", "text": "You are a helpful assistant...", "cache_control": {"type": "ephemeral"}}],
messages=[
{"role": "user", "content": [cached_document, {"type": "text", "text": f"<chunk>{chunk}</chunk>..."}]}
]
)
This reduces API costs by up to 90% for large document sets.
Key Takeaways
- Contextual Embeddings reduce retrieval failure rates by 35% by adding relevant context to each chunk before embedding, making dense retrieval significantly more accurate.
- Combine Contextual Embeddings with Contextual BM25 for hybrid search that leverages both semantic meaning and keyword matching, further improving recall.
- Reranking with cross-encoders (like Cohere's) provides a final accuracy boost by re-scoring top candidates with a more powerful model.
- Prompt caching is essential for cost-effective implementation — it caches the system prompt and document context, reducing API costs by up to 90% when generating context for many chunks.
- This technique works on any platform — while demonstrated with Anthropic's API, you can implement Contextual Retrieval on AWS Bedrock (using the provided Lambda function) or GCP Vertex AI with minor customization.