BeClaude
GuideBeginner2026-05-06

How to Build a Contextual Retrieval System with Claude: A Practical Guide

Learn how to implement Contextual Embeddings and Contextual BM25 to improve RAG retrieval accuracy by up to 35% using Claude, Voyage AI, and Cohere.

Quick Answer

This guide shows you how to enhance RAG performance by adding context to document chunks before embedding, reducing retrieval failure rates by 35% using Claude, Voyage AI, and Cohere.

ClaudeRAGContextual EmbeddingsRetrieval Augmented GenerationPrompt Caching

Introduction

Retrieval Augmented Generation (RAG) is a powerful pattern that lets Claude answer questions using your internal knowledge bases, codebases, or any document corpus. But traditional RAG has a fundamental weakness: when you split documents into small chunks for efficient retrieval, individual chunks often lose the surrounding context they need to be meaningful.

Imagine searching for "the function returns an error" in a codebase. Without knowing which file or module that chunk belongs to, the embedding model can't accurately represent its meaning. This is where Contextual Retrieval comes in.

In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 — two techniques that add relevant context to each chunk before embedding or indexing. According to Anthropic's evaluations, this approach reduces the top-20-chunk retrieval failure rate by 35% on average across diverse datasets.

We'll walk through a complete implementation using a dataset of 9 codebases, showing you how to:

  • Set up a baseline RAG pipeline
  • Implement Contextual Embeddings with prompt caching to manage costs
  • Add Contextual BM25 for hybrid search
  • Improve results further with reranking

Prerequisites

Before diving in, make sure you have:

Technical Skills:
  • Intermediate Python programming
  • Basic understanding of RAG
  • Familiarity with vector databases and embeddings
API Keys: System Requirements:
  • Python 3.8+
  • Docker (optional, for BM25 search)
  • 4GB+ RAM
  • ~5–10 GB disk space for vector databases
Time & Cost:
  • 30–45 minutes to complete
  • ~$5–10 in API costs for the full dataset

Step 1: Setting Up a Basic RAG Pipeline

First, let's establish a baseline. We'll use a pre-chunked dataset of 9 codebases with 248 queries, each containing a "golden chunk" — the correct answer. Our metric is Pass@k, which checks if the golden chunk appears in the top-k retrieved results.

Install Dependencies

pip install anthropic voyageai cohere numpy pandas

Load and Prepare Data

import json

Load chunks and evaluation set

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: eval_data = [json.loads(line) for line in f]

Create Embeddings and Index

import voyageai

vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")

Embed all chunks

chunk_texts = [chunk['content'] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Build a simple vector index (using numpy for demonstration)

import numpy as np from sklearn.metrics.pairwise import cosine_similarity

embedding_matrix = np.array(embeddings)

Evaluate Baseline Performance

def retrieve(query, k=10):
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    similarities = cosine_similarity([query_emb], embedding_matrix)[0]
    top_indices = np.argsort(similarities)[-k:][::-1]
    return [chunks[i] for i in top_indices]

Evaluate Pass@10

correct = 0 for item in eval_data: results = retrieve(item['query'], k=10) if item['golden_chunk_id'] in [r['id'] for r in results]: correct += 1

print(f"Baseline Pass@10: {correct/len(eval_data)*100:.1f}%")

Expected output: ~87%

Step 2: Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend a short context snippet that describes the chunk's origin. For codebases, this might include the file path, function name, and class name. For documents, it could be the section title, chapter, or surrounding paragraphs.

Generate Context with Claude

We'll use Claude to generate a concise context for each chunk. With prompt caching, we can dramatically reduce costs by reusing the system prompt across multiple calls.

from anthropic import Anthropic

client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

def generate_context(chunk, full_document): """Generate context for a chunk using Claude.""" prompt = f"""<document> {full_document} </document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string.""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, temperature=0, system=[{"type": "text", "text": "You are a helpful assistant that provides context for document chunks.", "cache_control": {"type": "ephemeral"}}], messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Apply Context to All Chunks

# Group chunks by their source document
documents = {}
for chunk in chunks:
    doc_id = chunk['document_id']
    if doc_id not in documents:
        documents[doc_id] = []
    documents[doc_id].append(chunk)

Generate context for each chunk

contextual_chunks = [] for doc_id, doc_chunks in documents.items(): full_document = "\n\n".join([c['content'] for c in doc_chunks]) for chunk in doc_chunks: context = generate_context(chunk['content'], full_document) contextual_chunks.append({ 'id': chunk['id'], 'content': f"{context}\n\n{chunk['content']}", 'original_content': chunk['content'] })

Re-Embed and Evaluate

# Embed contextual chunks
contextual_embeddings = vo.embed(
    [c['content'] for c in contextual_chunks], 
    model="voyage-2"
).embeddings

contextual_matrix = np.array(contextual_embeddings)

Evaluate again

def contextual_retrieve(query, k=10): query_emb = vo.embed([query], model="voyage-2").embeddings[0] similarities = cosine_similarity([query_emb], contextual_matrix)[0] top_indices = np.argsort(similarities)[-k:][::-1] return [contextual_chunks[i] for i in top_indices]

correct = 0 for item in eval_data: results = contextual_retrieve(item['query'], k=10) if item['golden_chunk_id'] in [r['id'] for r in results]: correct += 1

print(f"Contextual Embeddings Pass@10: {correct/len(eval_data)*100:.1f}%")

Expected output: ~95%

Step 3: Adding Contextual BM25

BM25 is a keyword-based retrieval method that complements dense embeddings. By applying the same context to BM25 indexing, we get Contextual BM25 — a hybrid approach that further improves recall.

Set Up BM25 Index

# Using Elasticsearch with Docker
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0

Index Contextual Chunks

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

Create index with BM25 similarity

index_settings = { "settings": { "similarity": { "default": { "type": "BM25", "b": 0.75, "k1": 1.2 } } }, "mappings": { "properties": { "content": {"type": "text"}, "original_content": {"type": "text"} } } }

es.indices.create(index="contextual_chunks", body=index_settings)

Index chunks

for chunk in contextual_chunks: es.index(index="contextual_chunks", id=chunk['id'], body=chunk)

Hybrid Search with Weighted Scores

def hybrid_search(query, k=10, alpha=0.5):
    # Dense retrieval
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    dense_scores = cosine_similarity([query_emb], contextual_matrix)[0]
    
    # Sparse retrieval (BM25)
    bm25_results = es.search(
        index="contextual_chunks",
        body={"query": {"match": {"content": query}}, "size": k}
    )
    
    # Combine scores (simplified)
    combined_scores = {}
    for i, score in enumerate(dense_scores):
        combined_scores[i] = alpha * score
    
    for hit in bm25_results['hits']['hits']:
        idx = next(i for i, c in enumerate(contextual_chunks) if c['id'] == hit['_id'])
        combined_scores[idx] = combined_scores.get(idx, 0) + (1 - alpha) * hit['_score']
    
    top_indices = sorted(combined_scores, key=combined_scores.get, reverse=True)[:k]
    return [contextual_chunks[i] for i in top_indices]

Step 4: Improving with Reranking

Finally, we can use a cross-encoder reranker (like Cohere's) to refine the top-k results from hybrid search.

import cohere

co = cohere.Client("YOUR_COHERE_API_KEY")

def rerank(query, candidates, top_k=5): results = co.rerank( query=query, documents=[c['content'] for c in candidates], top_n=top_k, model="rerank-english-v2.0" ) return [candidates[r.index] for r in results.results]

Full pipeline

def advanced_retrieve(query, k=10): initial_results = hybrid_search(query, k=20) # Get more candidates reranked = rerank(query, initial_results, top_k=k) return reranked

Cost Optimization with Prompt Caching

Generating context for thousands of chunks can be expensive. Use Claude's prompt caching to cache the system prompt and document context:

# Cache the full document once, then reuse for each chunk
cached_document = {
    "type": "text",
    "text": full_document,
    "cache_control": {"type": "ephemeral"}
}

Subsequent calls reuse the cached document

response = client.messages.create( model="claude-3-haiku-20240307", system=[{"type": "text", "text": "You are a helpful assistant...", "cache_control": {"type": "ephemeral"}}], messages=[ {"role": "user", "content": [cached_document, {"type": "text", "text": f"<chunk>{chunk}</chunk>..."}]} ] )

This reduces API costs by up to 90% for large document sets.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure rates by 35% by adding relevant context to each chunk before embedding, making dense retrieval significantly more accurate.
  • Combine Contextual Embeddings with Contextual BM25 for hybrid search that leverages both semantic meaning and keyword matching, further improving recall.
  • Reranking with cross-encoders (like Cohere's) provides a final accuracy boost by re-scoring top candidates with a more powerful model.
  • Prompt caching is essential for cost-effective implementation — it caches the system prompt and document context, reducing API costs by up to 90% when generating context for many chunks.
  • This technique works on any platform — while demonstrated with Anthropic's API, you can implement Contextual Retrieval on AWS Bedrock (using the provided Lambda function) or GCP Vertex AI with minor customization.