BeClaude
Guide2026-05-01

Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings

Learn how to improve RAG accuracy by 35% using Contextual Embeddings and BM25 with Claude. A practical guide with code examples, cost-saving tips, and evaluation metrics.

Quick Answer

This guide shows you how to add context to each document chunk before embedding, reducing retrieval failure rates by 35%. You'll implement Contextual Embeddings, Contextual BM25, and reranking using Claude, Voyage AI, and Cohere.

RAGContextual EmbeddingsClaudePrompt CachingBM25

Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful. A chunk that says "the API key is invalid" is useless if you don't know it's referring to the authentication step of a payment gateway.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets. In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, and how prompt caching makes this approach practical and cost-effective.

What You'll Need

Before diving in, make sure you have:

  • Python 3.8+ installed
  • Docker (optional, for BM25 search)
  • 4GB+ RAM and ~5-10 GB free disk space
  • API keys for Anthropic, Voyage AI, and Cohere
  • Basic familiarity with RAG, embeddings, and vector databases
Time & Cost: 30–45 minutes to run through the full dataset. API costs will be around $5–10.

1. Setting Up the Baseline: Basic RAG

We'll start by establishing a performance baseline using a dataset of 9 codebases, pre-chunked into smaller pieces. The evaluation set contains 248 queries, each with a "golden chunk"—the correct answer. Our metric is Pass@k, which checks whether the golden chunk appears in the top-k retrieved results.

First, install the required libraries:

pip install anthropic voyageai cohere

Load the dataset and create a simple vector store:

import json
from voyageai import Client

Load chunks and evaluation set

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: eval_data = [json.loads(line) for line in f]

Initialize Voyage AI for embeddings

voyage_client = Client(api_key='your-voyage-api-key')

Embed all chunks (simplified for illustration)

chunk_texts = [chunk['text'] for chunk in chunks] embeddings = voyage_client.embed(chunk_texts, model='voyage-2').embeddings

Store in a vector DB (e.g., Chroma, Pinecone)

For brevity, we'll use a simple cosine similarity search

With basic RAG, our Pass@10 was ~87%. That's decent, but we can do better.

2. Contextual Embeddings: Adding Context Before Embedding

The core idea is simple: for each chunk, use Claude to generate a short piece of context that explains what the chunk is about, then prepend that context to the chunk text before embedding.

Why It Works

When you embed a chunk like "def authenticate(): ...", the vector captures only the code syntax. But if you first prepend "This function handles user authentication for the payment gateway API", the embedding becomes much richer and more retrievable for relevant queries.

Implementation with Claude and Prompt Caching

Here's how to generate context for each chunk using Claude:

import anthropic

client = anthropic.Anthropic(api_key='your-anthropic-api-key')

def generate_context(chunk_text, full_document): """Generate context for a single chunk.""" prompt = f"""<document> {full_document} </document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk_text} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string.""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, temperature=0, system="You are a helpful assistant that provides context for document chunks.", messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Apply to all chunks

for chunk in chunks: context = generate_context(chunk['text'], chunk['full_document']) chunk['contextual_text'] = f"{context}\n\n{chunk['text']}"
But wait—this is expensive! Generating context for thousands of chunks could cost a fortune. That's where prompt caching comes in.

Making It Practical with Prompt Caching

Prompt caching allows you to reuse the full document context across multiple chunk requests. Since the full document is the same for all chunks within a document, you cache it once and pay only for the chunk-specific tokens.

# Enable prompt caching by adding the cache_control parameter
response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    temperature=0,
    system=[{
        "type": "text",
        "text": "You are a helpful assistant that provides context for document chunks.",
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"<document>{full_document}</document>",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"<chunk>{chunk_text}</chunk>\n\nPlease give a short succinct context..."
            }
        ]
    }]
)

With caching, the cost drops dramatically—often by 90% or more.

Results

After embedding the contextualized chunks and re-running our evaluation, Pass@10 improved from ~87% to ~95%—a 35% reduction in retrieval failures.

3. Contextual BM25: Hybrid Search Gets Smarter

BM25 is a classic keyword-based retrieval algorithm. By applying the same contextual prefix to chunks before indexing them with BM25, we get Contextual BM25. This hybrid approach (vector + keyword) captures both semantic meaning and exact term matching.

from rank_bm25 import BM25Okapi

Tokenize contextual chunks for BM25

tokenized_corpus = [chunk['contextual_text'].split() for chunk in chunks] bm25 = BM25Okapi(tokenized_corpus)

Search function

def hybrid_search(query, top_k=10, alpha=0.5): # Vector search query_embedding = voyage_client.embed([query]).embeddings[0] vector_scores = cosine_similarity(query_embedding, all_embeddings) # BM25 search tokenized_query = query.split() bm25_scores = bm25.get_scores(tokenized_query) # Combine scores combined_scores = alpha vector_scores + (1 - alpha) bm25_scores top_indices = np.argsort(combined_scores)[-top_k:][::-1] return [chunks[i] for i in top_indices]

Combining Contextual Embeddings with Contextual BM25 often yields the best results, especially for codebases where exact function names matter.

4. Reranking for Final Precision

Even with great retrieval, the top-10 results may contain irrelevant chunks. A reranker (like Cohere's) re-scores the top candidates for maximum precision.

import cohere

co = cohere.Client('your-cohere-api-key')

def rerank(query, candidates, top_k=5): results = co.rerank( model='rerank-english-v2.0', query=query, documents=[c['contextual_text'] for c in candidates], top_n=top_k ) return [candidates[r.index] for r in results.results]

Full pipeline

query = "How do I authenticate a user?" top_chunks = hybrid_search(query, top_k=20) final_chunks = rerank(query, top_chunks, top_k=5)

Reranking adds minimal latency but can push Pass@5 from 85% to 95%+.

5. Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The code is available in the contextual-rag-lambda-function folder of the cookbook. This lets you use Contextual Retrieval without changing your existing Bedrock pipeline.

Cost Management

  • Prompt caching is your best friend. Use it aggressively.
  • Claude 3 Haiku is fast and cheap for context generation.
  • Batch your context generation to avoid rate limits.

When to Use Contextual Retrieval

This technique shines when:

  • Your chunks are small (under 500 tokens)
  • Documents have a clear hierarchical structure (code files, legal contracts, technical manuals)
  • Queries are specific and context-dependent

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context to each chunk before embedding.
  • Prompt caching makes Contextual Retrieval cost-effective, reducing API costs by up to 90% for large document sets.
  • Combine Contextual Embeddings with Contextual BM25 for a hybrid search that captures both semantic meaning and exact keyword matches.
  • Reranking further boosts precision, pushing Pass@5 scores above 95% in many cases.
  • AWS Bedrock users can deploy a Lambda function to add context during ingestion, enabling this technique without architectural changes.