BeClaude
Guide2026-04-27

Contextual Retrieval: Supercharge Your RAG System with Claude and Context-Aware Chunks

Learn how to reduce retrieval failure by 35% using Contextual Embeddings and BM25 with Claude. A practical guide to building high-performance RAG systems.

Quick Answer

This guide shows you how to improve RAG retrieval by adding context to each chunk before embedding. Using Contextual Embeddings and BM25, you can reduce top-20 retrieval failure by 35% and boost Pass@10 from 87% to 95%.

RAGContextual EmbeddingsClaudePrompt CachingBM25

Introduction

Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A. But traditional RAG has a blind spot: when you split documents into chunks, those chunks often lose the surrounding context. A code snippet like def process(): means nothing without knowing it's part of a payment processing module.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding and indexing. The result? A 35% reduction in top-20 retrieval failure rate across diverse datasets. In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, Voyage AI, and Cohere—and see how prompt caching makes this approach production-ready.

What You'll Need

Technical Skills

  • Intermediate Python
  • Basic RAG understanding
  • Familiarity with vector databases
  • Command-line basics

System & API Requirements

Time & Cost

  • Time: 30–45 minutes
  • Cost: ~$5–10 for the full dataset

Step 1: Basic RAG Pipeline (Baseline)

Let's start with a simple RAG pipeline to establish a performance baseline. We'll use a pre-chunked dataset of 9 codebases (248 queries with golden chunks).

import voyageai
from anthropic import Anthropic

Initialize clients

vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY") claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

Load chunks and queries

import json with open("data/codebase_chunks.json") as f: chunks = json.load(f)

with open("data/evaluation_set.jsonl") as f: queries = [json.loads(line) for line in f]

Embed all chunks

chunk_texts = [chunk["content"] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

For each query, find top-10 chunks

def search(query, k=10): q_emb = vo.embed([query], model="voyage-2").embeddings[0] scores = [cosine_similarity(q_emb, e) for e in embeddings] top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k] return [chunks[i] for i in top_indices]

Evaluate Pass@10

pass_at_10 = 0 for q in queries: results = search(q["query"]) if any(r["id"] == q["golden_chunk_id"] for r in results): pass_at_10 += 1

print(f"Baseline Pass@10: {pass_at_10/len(queries)*100:.1f}%")

Expected: ~87%

Step 2: Contextual Embeddings

The Problem

When you split a document, each chunk loses its broader context. A chunk containing def calculate_tax(): from a financial report is ambiguous—is it for payroll, sales, or corporate tax? Without context, the embedding vector is less precise.

The Solution

Before embedding, prepend a short context snippet to each chunk. Claude generates this context using the full document and the chunk itself.
def generate_context(chunk_content, full_document):
    prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_content}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Then embed the augmented chunk:

augmented_chunks = []
for chunk in chunks:
    context = generate_context(chunk["content"], chunk["document"])
    augmented_text = f"{context}\n\n{chunk['content']}"
    augmented_chunks.append(augmented_text)

Embed augmented chunks

contextual_embeddings = vo.embed(augmented_chunks, model="voyage-2").embeddings

Why Prompt Caching Matters

Generating context for thousands of chunks can be expensive. Prompt caching reduces cost by reusing the full document prefix across chunks from the same document.
# With prompt caching (Anthropic API)
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    temperature=0,
    system=[{"type": "text", "text": full_document, "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": f"<chunk>{chunk_content}</chunk>..."}]
)

This reduces API costs by ~50–70% for large document sets.

Performance Lift

After implementing Contextual Embeddings, re-run the evaluation:
# Same search function, but using contextual_embeddings
pass_at_10 = 0
for q in queries:
    results = contextual_search(q["query"])
    if any(r["id"] == q["golden_chunk_id"] for r in results):
        pass_at_10 += 1

print(f"Contextual Pass@10: {pass_at_10/len(queries)*100:.1f}%")

Expected: ~95%

Result: Pass@10 jumps from ~87% to ~95%.

Step 3: Contextual BM25

BM25 is a keyword-based retrieval method. It benefits from context too. Use the same generated context to augment chunks for BM25 indexing.

from rank_bm25 import BM25Okapi

Tokenize augmented chunks for BM25

tokenized_corpus = [augmented_text.split() for augmented_text in augmented_chunks] bm25 = BM25Okapi(tokenized_corpus)

def bm25_search(query, k=10): tokenized_query = query.split() scores = bm25.get_scores(tokenized_query) top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k] return [chunks[i] for i in top_indices]

Hybrid Search: Combine Embeddings + BM25

For best results, combine both methods with reciprocal rank fusion (RRF):

def hybrid_search(query, k=10, alpha=0.5):
    # Get scores from both methods
    emb_scores = cosine_similarity(query_emb, contextual_embeddings)
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize and combine
    combined = alpha  normalize(emb_scores) + (1-alpha)  normalize(bm25_scores)
    top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:k]
    return [chunks[i] for i in top_indices]

Step 4: Reranking for Final Precision

Even with 95% Pass@10, you can push further. Use Cohere's reranker to reorder the top-20 results:

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")

def rerank(query, candidates, top_k=10): results = co.rerank( model="rerank-english-v3.0", query=query, documents=[c["content"] for c in candidates], top_n=top_k ) return [candidates[r.index] for r in results.results]

Full pipeline

query = "How does the payment module handle refunds?" initial_results = hybrid_search(query, k=20) final_results = rerank(query, initial_results, top_k=10)

Reranking typically adds 2–5% to Pass@10, pushing it toward 97–99%.

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, deploy the provided Lambda function (contextual-rag-lambda-function/lambda_function.py) as a custom chunking option. This automates context generation during ingestion.

Cost Optimization

  • Prompt caching: Essential for large document sets
  • Batch processing: Generate context in parallel for multiple chunks
  • Model choice: Use Claude 3 Haiku for context generation (fast, cheap); use Sonnet/Opus for final answer generation

Evaluation

Always measure Pass@k on your own dataset. The 35% failure reduction is an average—your mileage may vary. Build a golden dataset of at least 100 queries with known correct chunks.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by prepending document-level context to each chunk before embedding.
  • Prompt caching makes this approach cost-effective—reusing the full document across chunks cuts API costs by 50–70%.
  • Combine Contextual Embeddings with Contextual BM25 for hybrid search that outperforms either method alone.
  • Reranking adds the final polish—a Cohere reranker on top-20 results can push Pass@10 beyond 97%.
  • Production-ready on any cloud—the technique works on Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex AI.