BeClaude
Guide2026-05-06

How to Build a Contextual Retrieval System with Claude: A Practical Guide

Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude. Includes code examples, evaluation metrics, and production tips.

Quick Answer

This guide shows you how to enhance your RAG system by adding context to document chunks before embedding and BM25 indexing. You'll learn to reduce retrieval failure rates by 35% using Claude, Voyage AI, and Cohere.

RAGContextual RetrievalClaudeEmbeddingsPrompt Caching

Introduction

Retrieval Augmented Generation (RAG) is a powerful pattern that lets Claude answer questions using your own documents—codebases, internal wikis, customer support tickets, or any text corpus. But there's a catch: when you split documents into small chunks for retrieval, those chunks often lose their surrounding context. A chunk containing the line def calculate_interest() might be meaningless without knowing it belongs to a banking application.

Contextual Retrieval solves this problem by prepending a short, chunk-specific context to each piece before embedding and indexing. The result? More accurate retrieval, fewer missed documents, and better answers from Claude. In Anthropic's tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.

In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using a dataset of 9 codebases. We'll walk through setup, baseline evaluation, implementation, and optimization—including how prompt caching makes this approach cost-effective in production.

Prerequisites

Before diving in, make sure you have:

  • Python 3.8+ installed
  • Docker (optional, for BM25 search)
  • 4GB+ RAM and ~5-10 GB free disk space
  • API keys for Anthropic, Voyage AI, and Cohere
  • Basic understanding of RAG and vector databases
Estimated time: 30–45 minutes API cost: ~$5–10 to run the full dataset

Step 1: Setting Up the Basic RAG Pipeline

First, let's establish a baseline. We'll use a pre-chunked dataset of 9 codebases (available in data/codebase_chunks.json) and 248 test queries with known "golden chunks" (in data/evaluation_set.jsonl). Our metric is Pass@k—whether the correct chunk appears in the top-k retrieved results.

Install Dependencies

pip install anthropic voyageai cohere

Load and Embed Chunks

import json
import voyageai

Initialize Voyage AI client

vo = voyageai.Client(api_key="your-voyage-api-key")

Load chunks

with open("data/codebase_chunks.json", "r") as f: chunks = json.load(f)

Embed all chunks (basic approach)

chunk_texts = [chunk["content"] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Store in a simple vector index (in-memory for demo)

import numpy as np index = np.array(embeddings)

Evaluate Baseline Performance

# Load evaluation queries
with open("data/evaluation_set.jsonl", "r") as f:
    eval_data = [json.loads(line) for line in f]

For each query, find top-10 chunks by cosine similarity

def search(query, k=10): q_emb = vo.embed([query], model="voyage-2").embeddings[0] scores = np.dot(index, q_emb) top_k = np.argsort(scores)[-k:][::-1] return [chunks[i]["id"] for i in top_k]

pass_at_10 = 0 for item in eval_data: retrieved = search(item["query"], k=10) if item["golden_chunk_id"] in retrieved: pass_at_10 += 1

print(f"Baseline Pass@10: {pass_at_10 / len(eval_data):.2%}")

Expected: ~87%

Step 2: Implementing Contextual Embeddings

The core idea is simple: for each chunk, ask Claude to generate a short piece of context that explains what the chunk is about and where it fits in the larger document. Then prepend that context to the chunk before embedding.

Generate Context for Each Chunk

import anthropic

client = anthropic.Anthropic(api_key="your-anthropic-api-key")

def generate_context(chunk_text, surrounding_text): """Ask Claude to generate context for a chunk.""" prompt = f"""<document> {surrounding_text} </document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk_text} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string.""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text.strip()

Example: generate context for a chunk

chunk = chunks[0] context = generate_context(chunk["content"], chunk["surrounding_text"]) print(f"Context: {context}")

Use Prompt Caching to Reduce Costs

Generating context for thousands of chunks can get expensive. Anthropic's prompt caching feature lets you reuse the same document prefix across multiple calls, dramatically lowering costs.

# With prompt caching (Anthropic API)
response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": "You are a context generator for a retrieval system.",
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": prompt}]
)

Re-embed with Context

# Prepend context to each chunk
contextual_chunks = [f"{context}\n\n{chunk['content']}" for chunk in chunks]

Re-embed

contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings contextual_index = np.array(contextual_embeddings)

Re-evaluate

def contextual_search(query, k=10): q_emb = vo.embed([query], model="voyage-2").embeddings[0] scores = np.dot(contextual_index, q_emb) top_k = np.argsort(scores)[-k:][::-1] return [chunks[i]["id"] for i in top_k]

pass_at_10 = 0 for item in eval_data: retrieved = contextual_search(item["query"], k=10) if item["golden_chunk_id"] in retrieved: pass_at_10 += 1

print(f"Contextual Embeddings Pass@10: {pass_at_10 / len(eval_data):.2%}")

Expected: ~95%

Step 3: Adding Contextual BM25

BM25 is a keyword-based retrieval method that complements dense embeddings. By applying the same chunk-specific context to BM25 indexing, you get a "Contextual BM25" that outperforms standard BM25.

Set Up BM25 with Elasticsearch (Docker)

docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.11.0

Index Contextual Chunks

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

Create index with BM25 similarity

mapping = { "settings": { "similarity": { "default": { "type": "BM25" } } }, "mappings": { "properties": { "contextual_content": {"type": "text"} } } } es.indices.create(index="contextual_chunks", body=mapping)

Index each chunk with its context

for i, chunk in enumerate(chunks): es.index(index="contextual_chunks", id=i, body={ "contextual_content": contextual_chunks[i] })

Hybrid Search (Dense + BM25)

Combine dense retrieval scores with BM25 scores for best results:

def hybrid_search(query, k=10, alpha=0.5):
    # Dense search
    q_emb = vo.embed([query], model="voyage-2").embeddings[0]
    dense_scores = np.dot(contextual_index, q_emb)
    
    # BM25 search
    bm25_results = es.search(index="contextual_chunks", body={
        "query": {"match": {"contextual_content": query}},
        "size": k
    })
    
    # Normalize and combine scores
    bm25_scores = np.zeros(len(chunks))
    for hit in bm25_results["hits"]["hits"]]:
        bm25_scores[int(hit["_id"])] = hit["_score"]
    
    combined = alpha  dense_scores + (1 - alpha)  bm25_scores
    top_k = np.argsort(combined)[-k:][::-1]
    return [chunks[i]["id"] for i in top_k]

Step 4: Reranking for Final Precision

Even with contextual retrieval, the top-10 results may contain irrelevant chunks. Adding a reranker (e.g., Cohere's rerank model) can boost Pass@1 significantly.

import cohere

co = cohere.Client("your-cohere-api-key")

def rerank(query, candidates, top_k=3): results = co.rerank( query=query, documents=candidates, model="rerank-english-v2.0", top_n=top_k ) return [r.document["text"] for r in results.results]

Example usage

query = "How does the authentication module work?" candidate_chunks = [chunks[i]["content"] for i in top_10_indices] final_results = rerank(query, candidate_chunks, top_k=3)

Production Considerations

Cost Management with Prompt Caching

Generating context for every chunk is the most expensive step. Prompt caching reduces this cost by ~50–70% because the full document only needs to be processed once per document, not once per chunk.

Deployment on AWS Bedrock

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function (provided in the cookbook's contextual-rag-lambda-function folder) as a custom chunking strategy. This lets you add context to chunks before they're indexed in Bedrock.

Choosing the Right Model

  • Claude 3 Haiku is ideal for context generation—it's fast, cheap, and accurate enough for this task.
  • Voyage 2 provides good general-purpose embeddings. For domain-specific data, consider Voyage's fine-tuned models.
  • Cohere Rerank is recommended for the final reranking step.

Conclusion

Contextual Retrieval is a simple but powerful upgrade to any RAG system. By adding a small amount of context to each chunk before embedding and BM25 indexing, you can reduce retrieval failures by over a third—without changing your underlying infrastructure.

Key Takeaways

  • Context matters: Adding chunk-specific context before embedding reduces retrieval failure rates by 35% on average.
  • Dual retrieval is better: Combining Contextual Embeddings with Contextual BM25 (hybrid search) outperforms either method alone.
  • Prompt caching makes it practical: Use Anthropic's prompt caching to cut context-generation costs by 50–70%.
  • Reranking adds polish: A final reranking step (e.g., Cohere) can further improve top-1 accuracy.
  • Works with existing infrastructure: You can deploy this on AWS Bedrock, GCP Vertex AI, or any vector database with a custom chunking Lambda.