How to Build a Contextual Retrieval System with Claude: A Practical Guide
Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude. Includes code examples, evaluation metrics, and production tips.
This guide shows you how to enhance your RAG system by adding context to document chunks before embedding and BM25 indexing. You'll learn to reduce retrieval failure rates by 35% using Claude, Voyage AI, and Cohere.
Introduction
Retrieval Augmented Generation (RAG) is a powerful pattern that lets Claude answer questions using your own documents—codebases, internal wikis, customer support tickets, or any text corpus. But there's a catch: when you split documents into small chunks for retrieval, those chunks often lose their surrounding context. A chunk containing the line def calculate_interest() might be meaningless without knowing it belongs to a banking application.
In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using a dataset of 9 codebases. We'll walk through setup, baseline evaluation, implementation, and optimization—including how prompt caching makes this approach cost-effective in production.
Prerequisites
Before diving in, make sure you have:
- Python 3.8+ installed
- Docker (optional, for BM25 search)
- 4GB+ RAM and ~5-10 GB free disk space
- API keys for Anthropic, Voyage AI, and Cohere
- Basic understanding of RAG and vector databases
Step 1: Setting Up the Basic RAG Pipeline
First, let's establish a baseline. We'll use a pre-chunked dataset of 9 codebases (available in data/codebase_chunks.json) and 248 test queries with known "golden chunks" (in data/evaluation_set.jsonl). Our metric is Pass@k—whether the correct chunk appears in the top-k retrieved results.
Install Dependencies
pip install anthropic voyageai cohere
Load and Embed Chunks
import json
import voyageai
Initialize Voyage AI client
vo = voyageai.Client(api_key="your-voyage-api-key")
Load chunks
with open("data/codebase_chunks.json", "r") as f:
chunks = json.load(f)
Embed all chunks (basic approach)
chunk_texts = [chunk["content"] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple vector index (in-memory for demo)
import numpy as np
index = np.array(embeddings)
Evaluate Baseline Performance
# Load evaluation queries
with open("data/evaluation_set.jsonl", "r") as f:
eval_data = [json.loads(line) for line in f]
For each query, find top-10 chunks by cosine similarity
def search(query, k=10):
q_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = np.dot(index, q_emb)
top_k = np.argsort(scores)[-k:][::-1]
return [chunks[i]["id"] for i in top_k]
pass_at_10 = 0
for item in eval_data:
retrieved = search(item["query"], k=10)
if item["golden_chunk_id"] in retrieved:
pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_data):.2%}")
Expected: ~87%
Step 2: Implementing Contextual Embeddings
The core idea is simple: for each chunk, ask Claude to generate a short piece of context that explains what the chunk is about and where it fits in the larger document. Then prepend that context to the chunk before embedding.
Generate Context for Each Chunk
import anthropic
client = anthropic.Anthropic(api_key="your-anthropic-api-key")
def generate_context(chunk_text, surrounding_text):
"""Ask Claude to generate context for a chunk."""
prompt = f"""<document>
{surrounding_text}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text.strip()
Example: generate context for a chunk
chunk = chunks[0]
context = generate_context(chunk["content"], chunk["surrounding_text"])
print(f"Context: {context}")
Use Prompt Caching to Reduce Costs
Generating context for thousands of chunks can get expensive. Anthropic's prompt caching feature lets you reuse the same document prefix across multiple calls, dramatically lowering costs.
# With prompt caching (Anthropic API)
response = client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{
"type": "text",
"text": "You are a context generator for a retrieval system.",
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": prompt}]
)
Re-embed with Context
# Prepend context to each chunk
contextual_chunks = [f"{context}\n\n{chunk['content']}" for chunk in chunks]
Re-embed
contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings
contextual_index = np.array(contextual_embeddings)
Re-evaluate
def contextual_search(query, k=10):
q_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = np.dot(contextual_index, q_emb)
top_k = np.argsort(scores)[-k:][::-1]
return [chunks[i]["id"] for i in top_k]
pass_at_10 = 0
for item in eval_data:
retrieved = contextual_search(item["query"], k=10)
if item["golden_chunk_id"] in retrieved:
pass_at_10 += 1
print(f"Contextual Embeddings Pass@10: {pass_at_10 / len(eval_data):.2%}")
Expected: ~95%
Step 3: Adding Contextual BM25
BM25 is a keyword-based retrieval method that complements dense embeddings. By applying the same chunk-specific context to BM25 indexing, you get a "Contextual BM25" that outperforms standard BM25.
Set Up BM25 with Elasticsearch (Docker)
docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.11.0
Index Contextual Chunks
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
Create index with BM25 similarity
mapping = {
"settings": {
"similarity": {
"default": {
"type": "BM25"
}
}
},
"mappings": {
"properties": {
"contextual_content": {"type": "text"}
}
}
}
es.indices.create(index="contextual_chunks", body=mapping)
Index each chunk with its context
for i, chunk in enumerate(chunks):
es.index(index="contextual_chunks", id=i, body={
"contextual_content": contextual_chunks[i]
})
Hybrid Search (Dense + BM25)
Combine dense retrieval scores with BM25 scores for best results:
def hybrid_search(query, k=10, alpha=0.5):
# Dense search
q_emb = vo.embed([query], model="voyage-2").embeddings[0]
dense_scores = np.dot(contextual_index, q_emb)
# BM25 search
bm25_results = es.search(index="contextual_chunks", body={
"query": {"match": {"contextual_content": query}},
"size": k
})
# Normalize and combine scores
bm25_scores = np.zeros(len(chunks))
for hit in bm25_results["hits"]["hits"]]:
bm25_scores[int(hit["_id"])] = hit["_score"]
combined = alpha dense_scores + (1 - alpha) bm25_scores
top_k = np.argsort(combined)[-k:][::-1]
return [chunks[i]["id"] for i in top_k]
Step 4: Reranking for Final Precision
Even with contextual retrieval, the top-10 results may contain irrelevant chunks. Adding a reranker (e.g., Cohere's rerank model) can boost Pass@1 significantly.
import cohere
co = cohere.Client("your-cohere-api-key")
def rerank(query, candidates, top_k=3):
results = co.rerank(
query=query,
documents=candidates,
model="rerank-english-v2.0",
top_n=top_k
)
return [r.document["text"] for r in results.results]
Example usage
query = "How does the authentication module work?"
candidate_chunks = [chunks[i]["content"] for i in top_10_indices]
final_results = rerank(query, candidate_chunks, top_k=3)
Production Considerations
Cost Management with Prompt Caching
Generating context for every chunk is the most expensive step. Prompt caching reduces this cost by ~50–70% because the full document only needs to be processed once per document, not once per chunk.
Deployment on AWS Bedrock
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function (provided in the cookbook's contextual-rag-lambda-function folder) as a custom chunking strategy. This lets you add context to chunks before they're indexed in Bedrock.
Choosing the Right Model
- Claude 3 Haiku is ideal for context generation—it's fast, cheap, and accurate enough for this task.
- Voyage 2 provides good general-purpose embeddings. For domain-specific data, consider Voyage's fine-tuned models.
- Cohere Rerank is recommended for the final reranking step.
Conclusion
Contextual Retrieval is a simple but powerful upgrade to any RAG system. By adding a small amount of context to each chunk before embedding and BM25 indexing, you can reduce retrieval failures by over a third—without changing your underlying infrastructure.
Key Takeaways
- Context matters: Adding chunk-specific context before embedding reduces retrieval failure rates by 35% on average.
- Dual retrieval is better: Combining Contextual Embeddings with Contextual BM25 (hybrid search) outperforms either method alone.
- Prompt caching makes it practical: Use Anthropic's prompt caching to cut context-generation costs by 50–70%.
- Reranking adds polish: A final reranking step (e.g., Cohere) can further improve top-1 accuracy.
- Works with existing infrastructure: You can deploy this on AWS Bedrock, GCP Vertex AI, or any vector database with a custom chunking Lambda.