How to Build a Contextual Retrieval System with Claude: A Practical Guide
Learn how to implement Contextual Embeddings and Contextual BM25 to reduce RAG retrieval failure rates by 35% using Claude, Voyage AI, and Cohere.
This guide shows you how to improve RAG performance by adding context to document chunks before embedding. Using Contextual Embeddings and BM25, you can reduce retrieval failure rates by 35% and boost Pass@10 accuracy from 87% to 95%.
How to Build a Contextual Retrieval System with Claude: A Practical Guide
Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context they need to be useful. A chunk containing "the revenue increased by 20%" is meaningless without knowing which company, quarter, or product line it refers to.
Contextual Retrieval solves this by prepending a short, chunk-specific context before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets, and a jump in Pass@10 accuracy from ~87% to ~95% on codebase queries.In this guide, you'll build a complete Contextual Retrieval system using Claude, Voyage AI embeddings, and Cohere reranking. You'll learn:
- How to set up a basic RAG pipeline as a baseline
- Why Contextual Embeddings work and how prompt caching makes them affordable
- How to implement Contextual BM25 for hybrid search
- How reranking further boosts performance
Prerequisites
Before starting, make sure you have:
- Python 3.8+ installed
- Docker (optional, for BM25 search)
- 4GB+ RAM and ~5-10 GB disk space
- API keys for Anthropic, Voyage AI, and Cohere
- Basic familiarity with RAG, vector databases, and embeddings
Step 1: Set Up Your Environment
First, install the required libraries:
pip install anthropic voyageai cohere pandas numpy
Then initialize your clients:
import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client("YOUR_COHERE_KEY")
Step 2: Build a Basic RAG Baseline
Before improving retrieval, you need a baseline. Load your chunked dataset and create a simple vector index.
import json
Load pre-chunked codebase data
with open("data/codebase_chunks.json", "r") as f:
chunks = json.load(f)
Generate embeddings for each chunk
chunk_texts = [chunk["content"] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple in-memory index (use FAISS or Pinecone for production)
import numpy as np
embedding_matrix = np.array(embeddings)
Now define a retrieval function:
def retrieve(query, k=10):
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
scores = np.dot(embedding_matrix, query_emb)
top_indices = np.argsort(scores)[-k:][::-1]
return [chunks[i] for i in top_indices]
Evaluate using Pass@k. With a dataset of 248 queries (each with a known "golden chunk"), your baseline Pass@10 should land around 87%.
Step 3: Implement Contextual Embeddings
The core idea is simple: before embedding each chunk, ask Claude to generate a short piece of context that explains what the chunk is about.
The Context Generation Prompt
def generate_context(chunk_text, document_text):
prompt = f"""<document>
{document_text}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
response = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Making It Cost-Effective with Prompt Caching
Generating context for every chunk individually would be expensive. Prompt caching solves this by reusing the document prefix across multiple chunk requests.
# Cache the document prefix once
cached_doc = claude.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
system=[{"type": "text", "text": document_text, "cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": f"<chunk>{chunk_text}</chunk>"}]
)
With caching, the cost drops dramatically—often by 90% or more—making Contextual Embeddings viable for production.
Embed the Contextualized Chunks
contextualized_chunks = []
for chunk in chunks:
context = generate_context(chunk["content"], chunk["document"])
contextualized_text = f"{context}\n\n{chunk['content']}"
contextualized_chunks.append(contextualized_text)
Re-embed with Voyage AI
new_embeddings = vo.embed(contextualized_chunks, model="voyage-2").embeddings
After re-evaluating, your Pass@10 should jump to ~95%.
Step 4: Add Contextual BM25 for Hybrid Search
BM25 is a text-based retrieval method that complements semantic search. You can apply the same context to BM25 by indexing the contextualized chunks instead of raw chunks.
# Using a simple BM25 implementation (e.g., rank_bm25)
from rank_bm25 import BM25Okapi
tokenized_corpus = [chunk.split() for chunk in contextualized_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query, k=10, alpha=0.5):
# Semantic search
query_emb = vo.embed([query], model="voyage-2").embeddings[0]
semantic_scores = np.dot(embedding_matrix, query_emb)
# BM25 search
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)
# Normalize and combine
combined = alpha * (semantic_scores / np.max(semantic_scores)) + \
(1 - alpha) * (bm25_scores / np.max(bm25_scores))
top_indices = np.argsort(combined)[-k:][::-1]
return [chunks[i] for i in top_indices]
Hybrid search with Contextual BM25 typically yields another 2–5% improvement over Contextual Embeddings alone.
Step 5: Rerank for Final Precision
Even with excellent retrieval, the top-10 results may contain irrelevant chunks. A reranker (like Cohere's) re-orders results based on deeper relevance scoring.
def rerank(query, retrieved_chunks, k=10):
results = co.rerank(
query=query,
documents=[chunk["content"] for chunk in retrieved_chunks],
top_n=k,
model="rerank-english-v2.0"
)
return [retrieved_chunks[r.index] for r in results.results]
Reranking can push Pass@5 to 98%+ and is especially valuable when you need high precision (e.g., legal or medical Q&A).
Production Considerations
AWS Bedrock Integration
If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The function code is included in the Anthropic cookbook under contextual-rag-lambda-function/lambda_function.py. Select this Lambda as a custom chunking option when configuring your knowledge base.
Cost Management
- Prompt caching is essential. It's available on Anthropic's first-party API and coming soon to Bedrock and Vertex AI.
- Use Claude 3 Haiku for context generation—it's fast and cheap.
- Batch your embedding calls to minimize API overhead.
Key Takeaways
- Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the "lost context" problem in traditional RAG.
- Prompt caching makes Contextual Embeddings production-ready, cutting context generation costs by up to 90%.
- Contextual BM25 + semantic search (hybrid retrieval) yields the best results, combining lexical and semantic matching.
- Reranking with Cohere pushes precision even higher, achieving Pass@5 above 98%.
- AWS Bedrock users can deploy this as a Lambda function for seamless integration with Knowledge Bases.