BeClaude
Guide2026-04-30

Contextual Retrieval: How to Reduce RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 with Claude to dramatically improve RAG retrieval accuracy. Includes code examples, cost optimization with prompt caching, and production deployment tips.

Quick Answer

Contextual Retrieval adds surrounding document context to each chunk before embedding, reducing retrieval failure rates by 35%. This guide shows you how to implement it with Claude, optimize costs using prompt caching, and combine it with BM25 and reranking for maximum accuracy.

RAGContextual RetrievalClaudeEmbeddingsPrompt Caching

Contextual Retrieval: How to Reduce RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base assistants. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A chunk that says "the function returns True" is useless if you don't know which function it's referring to.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The result? Anthropic's testing across multiple datasets shows a 35% reduction in top-20-chunk retrieval failure rates. In this guide, you'll learn how to implement this technique using Claude, optimize it with prompt caching, and combine it with BM25 search and reranking for production-grade performance.

What You'll Build

By the end of this guide, you'll have a complete Contextual Retrieval pipeline that:

  • Generates context for each chunk using Claude
  • Creates contextual embeddings via Voyage AI
  • Performs hybrid search with Contextual BM25
  • Reranks results for maximum precision

Prerequisites

Skills: Intermediate Python, basic RAG knowledge, familiarity with vector databases API Keys: Time & Cost: ~30-45 minutes, ~$5-10 in API costs

Step 1: Set Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Initialize your clients:

import anthropic
import voyageai

claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY") vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")

Step 2: Understand the Problem with Basic RAG

In traditional RAG, you split documents into chunks and embed each chunk independently. Consider this code snippet:

def calculate_total(items):
    return sum(items)

If this chunk is retrieved alone, the embedding captures only the function signature and body. But if the surrounding context says "This function calculates the total price of items in a shopping cart," the embedding becomes far more meaningful.

The failure mode: Queries about "shopping cart total" might miss this chunk entirely because the embedding lacks contextual clues.

Step 3: Implement Contextual Embeddings

Contextual Embeddings work in three steps:

  • For each chunk, send the full document + the chunk to Claude
  • Ask Claude to generate a concise context (2-3 sentences) explaining what this chunk contains
  • Prepend that context to the chunk before embedding
Here's the core function:
def generate_chunk_context(document: str, chunk: str) -> str:
    """Generate context for a single chunk using Claude."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system="You are a document analyzer. Your task is to provide context for a chunk of text from a larger document. Give a 2-3 sentence explanation of what this chunk contains and its relevance to the overall document.",
        messages=[
            {
                "role": "user",
                "content": f"<document>{document}</document>\n\nHere is the chunk we want to situate within the whole document:\n<chunk>{chunk}</chunk>\n\nPlease give a short context for this chunk."
            }
        ]
    )
    return response.content[0].text

Then embed the contextualized chunk:

def embed_contextual_chunk(document: str, chunk: str) -> list:
    context = generate_chunk_context(document, chunk)
    contextualized_chunk = f"{context}\n\n{chunk}"
    embedding = vo.embed(
        texts=[contextualized_chunk],
        model="voyage-2",
        input_type="document"
    )
    return embedding.embeddings[0]

Step 4: Optimize Costs with Prompt Caching

Generating context for every chunk individually can be expensive. Prompt caching reduces costs by reusing the document prefix across multiple chunk requests.

def generate_context_with_caching(document: str, chunks: list) -> list:
    """Generate contexts for all chunks using prompt caching."""
    contexts = []
    
    # The document is the common prefix - cache it
    for chunk in chunks:
        response = claude.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=100,
            system="You are a document analyzer...",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": f"<document>{document}</document>",
                            "cache_control": {"type": "ephemeral"}
                        },
                        {
                            "type": "text",
                            "text": f"Here is the chunk we want to situate...\n<chunk>{chunk}</chunk>"
                        }
                    ]
                }
            ]
        )
        contexts.append(response.content[0].text)
    
    return contexts
Cost savings: With prompt caching, you pay the cache write cost once for the document, then only the much smaller cache read cost for each subsequent chunk. For large documents, this can reduce costs by 70-90%.

Step 5: Implement Contextual BM25

Contextual BM25 applies the same idea to keyword-based search. Instead of indexing raw chunks, index the contextualized chunks (context + chunk text).

from rank_bm25 import BM25Okapi

def build_contextual_bm25_index(documents: list, chunks_by_doc: dict): """Build a BM25 index using contextualized chunks.""" contextualized_chunks = [] for doc_id, doc in enumerate(documents): for chunk in chunks_by_doc[doc_id]: context = generate_chunk_context(doc, chunk) contextualized_chunks.append(f"{context} {chunk}") tokenized_corpus = [chunk.split() for chunk in contextualized_chunks] return BM25Okapi(tokenized_corpus), contextualized_chunks

Step 6: Hybrid Search with Reranking

For maximum performance, combine Contextual Embeddings with Contextual BM25 using reciprocal rank fusion (RRF), then rerank with Cohere:

def hybrid_search(query: str, k: int = 20):
    # Get embedding results
    query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    embedding_results = vector_db.similarity_search(query_embedding, k=k*2)
    
    # Get BM25 results
    tokenized_query = query.split()
    bm25_scores = bm25_index.get_scores(tokenized_query)
    bm25_results = sorted(range(len(bm25_scores)), key=lambda i: bm25_scores[i], reverse=True)[:k*2]
    
    # Reciprocal Rank Fusion
    combined_scores = {}
    for rank, doc_id in enumerate(embedding_results):
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 1 / (rank + 60)
    for rank, doc_id in enumerate(bm25_results):
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + 1 / (rank + 60)
    
    # Sort by combined score
    reranked = sorted(combined_scores.keys(), key=lambda x: combined_scores[x], reverse=True)[:k]
    
    # Final rerank with Cohere
    co = cohere.Client("YOUR_COHERE_KEY")
    rerank_results = co.rerank(
        query=query,
        documents=[chunks[i] for i in reranked],
        top_n=10
    )
    
    return rerank_results

Performance Results

Anthropic's evaluation on a dataset of 9 codebases with 248 queries showed:

MethodPass@10
Basic RAG87%
Contextual Embeddings95%
Contextual Embeddings + BM25 + Reranking97%+

Production Deployment Considerations

On AWS Bedrock

AWS provides a Lambda function for Contextual Retrieval that you can deploy as a custom chunking option in Bedrock Knowledge Bases. Find the code in the contextual-rag-lambda-function directory of the Anthropic cookbook.

Cost Management

  • Prompt caching is essential for production. It's available on Anthropic's first-party API and coming soon to Bedrock and Vertex AI.
  • Batch context generation during ingestion, not at query time.
  • Use Claude Haiku for context generation (fastest, cheapest) and Sonnet/Opus only for complex documents.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by prepending document-level context to each chunk before embedding, solving the "chunk without context" problem.
  • Prompt caching makes this practical by dramatically reducing the cost of generating context for thousands of chunks—expect 70-90% cost savings on large documents.
  • Hybrid search with Contextual BM25 further improves results by combining semantic and keyword-based retrieval, then fusing them with reciprocal rank fusion.
  • Reranking adds the final polish—using a dedicated reranker like Cohere on the top 20 results can push Pass@10 accuracy above 97%.
  • Production-ready on major clouds—AWS Bedrock offers a Lambda-based custom chunking option, and similar patterns work on GCP Vertex AI.