BeClaude
Guide2026-04-22

Contextual Retrieval: How to Reduce RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. A practical guide with code examples for Claude AI users.

Quick Answer

This guide shows you how to add chunk-specific context before embedding to improve RAG retrieval accuracy by 35%. You'll learn Contextual Embeddings, Contextual BM25, and reranking techniques using Claude, Voyage AI, and Cohere.

RAGContextual RetrievalClaudeEmbeddingsPrompt Caching

Contextual Retrieval: How to Reduce RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to codebase assistants. But there's a persistent problem: chunks lack context. When you split a document into pieces for embedding, each chunk becomes an orphan, stripped of the surrounding information that gives it meaning.

Anthropic's new Contextual Retrieval technique solves this. By prepending chunk-specific context before embedding, you can reduce retrieval failure rates by an average of 35% across diverse datasets. In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 in your own RAG pipeline.

What You'll Learn

  • Why traditional chunking hurts retrieval accuracy
  • How to generate and prepend context for each chunk using Claude
  • How to use prompt caching to keep costs practical
  • How to combine Contextual Embeddings with BM25 for hybrid search
  • How reranking further boosts performance

The Core Problem: Orphaned Chunks

In standard RAG, you split a document into chunks, embed each chunk into a vector, and store them in a vector database. When a user asks a question, you retrieve the most similar chunks and feed them to Claude.

But consider this chunk from a codebase:

def calculate_interest(principal, rate, time):
    return principal  (1 + rate  time)

Without context, the embedding model doesn't know this is part of a banking application that uses simple interest (not compound). A query like "How does our savings account calculate returns?" might miss this chunk entirely.

Contextual Retrieval solves this by adding a short, descriptive context to each chunk before embedding:
"This function is from the 'savings_account.py' module in a banking application. It calculates simple interest for fixed deposits."

Now the embedding captures both the code and its meaning.

How Contextual Embeddings Work

The process is straightforward:

  • Split your documents into chunks (as usual)
  • Generate context for each chunk using Claude (with a specific prompt)
  • Prepend the context to the chunk text
  • Embed the context-enriched chunk
  • Store in your vector database
At query time, you embed the user's question normally and search against the context-enriched vectors.

The Context Generation Prompt

Anthropic recommends this prompt for generating chunk context:

CONTEXT_PROMPT = """
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
"""

Notice the prompt includes the entire document plus the specific chunk. This is where prompt caching becomes essential.

Making It Practical with Prompt Caching

Generating context for every chunk by sending the full document each time would be expensive. But with prompt caching, you cache the full document once and reuse it across all chunks from that document.

Here's how to implement it with the Anthropic Python SDK:

import anthropic

client = anthropic.Anthropic()

Cache the full document

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, system=[ { "type": "text", "text": full_document, "cache_control": {"type": "ephemeral"} } ], messages=[ { "role": "user", "content": f"<chunk>{chunk}</chunk>\n\n{CONTEXT_PROMPT}" } ] )

With caching, you pay the full document cost once, then only the chunk and output tokens for subsequent chunks. This makes Contextual Embeddings production-viable.

Step-by-Step Implementation

1. Setup Your Environment

pip install anthropic voyageai cohere

Set your API keys:

import os
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
os.environ["VOYAGE_API_KEY"] = "pa-..."
os.environ["COHERE_API_KEY"] = "..."

2. Generate Context for Each Chunk

def generate_chunk_context(client, full_doc, chunk):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        system=[
            {
                "type": "text",
                "text": full_doc,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {
                "role": "user",
                "content": f"<chunk>{chunk}</chunk>\n\n{CONTEXT_PROMPT}"
            }
        ]
    )
    return response.content[0].text

3. Create Contextual Embeddings

import voyageai

vo = voyageai.Client()

def embed_chunks_with_context(chunks_with_context): # chunks_with_context is a list of strings: "context\n\nchunk_content" embeddings = vo.embed( chunks_with_context, model="voyage-2", input_type="document" ).embeddings return embeddings

4. Store and Search

Store the enriched embeddings in your vector database (Pinecone, Weaviate, Chroma, etc.). At query time, embed the user's question normally and perform similarity search.

Contextual BM25: The Text-Based Complement

Contextual Embeddings improve vector search. But you can also use the same chunk context to improve BM25 (keyword-based) search. Simply prepend the context to the chunk before indexing in your BM25 index.

from rank_bm25 import BM25Okapi

Tokenize context-enriched chunks

tokenized_corpus = [tokenize(context + " " + chunk) for chunk in chunks] bm25 = BM25Okapi(tokenized_corpus)

Search

tokenized_query = tokenize(user_query) scores = bm25.get_scores(tokenized_query)

Combine BM25 and vector scores using reciprocal rank fusion (RRF) for a powerful hybrid search:

def reciprocal_rank_fusion(vector_results, bm25_results, k=60):
    combined_scores = {}
    for rank, doc_id in enumerate(vector_results):
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc_id in enumerate(bm25_results):
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)

Reranking for Final Precision

Even with contextual retrieval, the top-10 results might contain irrelevant chunks. Add a reranking step using Cohere's rerank API:

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

def rerank_results(query, chunks, top_k=5): results = co.rerank( model="rerank-english-v3.0", query=query, documents=chunks, top_n=top_k ) return [chunks[r.index] for r in results.results]

Performance Results

Anthropic tested this approach on 9 codebases with 248 queries. Here's what they found:

MethodPass@10
Basic RAG~87%
Contextual Embeddings~95%
Contextual Embeddings + BM25 + Reranking~97%+
That's a 35% reduction in retrieval failure rate from Contextual Embeddings alone.

Production Considerations

  • Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
  • For AWS Bedrock, Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option.
  • Cost: Running the full evaluation dataset costs ~$5-10 in API calls.
  • Latency: Context generation adds ~1-2 seconds per chunk, but caching minimizes this for large documents.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure rates by 35% by adding chunk-specific context before embedding.
  • Prompt caching makes this technique production-ready by reusing the full document across chunks.
  • Contextual BM25 extends the same idea to keyword search, and hybrid search with RRF gives the best results.
  • Reranking with Cohere or similar models can push Pass@10 above 97%.
  • This technique works with any embedding model and vector database—no infrastructure changes required.
Start by adding context to your most important documents. The improvement in retrieval quality will be immediately noticeable in your Claude-powered applications.