GuideBeginnerBest Practices2026-05-22

Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude AI to reduce retrieval failure rates by 35%. A practical guide with code examples for production RAG systems.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% in RAG systems using Claude AI.

RAGContextual EmbeddingsClaude APIRetrievalPrompt Caching

Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context, leading to poor search results.

Anthropic's Contextual Retrieval technique solves this by adding relevant context to each chunk before embedding. The results are impressive: a 35% reduction in retrieval failure rates across tested datasets. This guide walks you through implementing this technique with Claude AI, complete with code examples and production considerations.

What You'll Learn

Why standard chunking fails and how Contextual Embeddings fix it
How to implement Contextual Retrieval with Claude and Voyage AI
How to use Contextual BM25 for hybrid search improvements
How prompt caching makes this approach cost-effective at scale
How to evaluate retrieval performance with Pass@k metrics

Prerequisites

Technical Skills:

Intermediate Python programming
Basic understanding of RAG concepts
Familiarity with vector databases and embeddings

API Keys Needed:

Anthropic API key (free tier sufficient)
Voyage AI API key
Cohere API key (for reranking)

Estimated Time & Cost:

Setup and implementation: 30-45 minutes
API costs: ~$5-10 for the full dataset

The Problem with Traditional Chunking

In standard RAG pipelines, documents are split into smaller chunks for efficient vector search. This works well when chunks are self-contained, but fails when:

A chunk contains a variable name like process_data() without explaining what it does
A chunk references "the algorithm" without specifying which algorithm
A chunk contains a code snippet without the function signature or imports

Consider this chunk from a codebase:

def process():
    return transform(data, config)

Without context, the embedding for this chunk captures nothing about what transform does, what config contains, or what domain this code belongs to. A query like "How do I configure data transformation?" would likely miss this chunk entirely.

What Are Contextual Embeddings?

Contextual Embeddings solve this by prepending a short, chunk-specific context to each chunk before generating the embedding vector. This context is generated by Claude, which understands the full document and can summarize the chunk's relevance.

The process:

Split your documents into chunks (as usual)
For each chunk, ask Claude: "What context does a reader need to understand this chunk?"
Prepend Claude's context to the chunk
Embed the context-augmented chunk
Store in your vector database

During retrieval, queries are embedded normally and compared against these enriched vectors.

Implementation: Contextual Embeddings with Claude

Step 1: Generate Context for Each Chunk

Here's how to generate context using Claude's API:

import anthropic
client = anthropic.Anthropic()
def generate_chunk_context(chunk_text, full_document):
    """Generate context for a single chunk using Claude."""
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=100,
        messages=[
            {
                "role": "user",
                "content": f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
            }
        ]
    )
    return response.content[0].text

Step 2: Create Contextual Embeddings

Once you have the context, prepend it to the chunk before embedding:

import voyageai
vo = voyageai.Client()
def create_contextual_embedding(chunk_text, context):
    """Create an embedding for a context-augmented chunk."""
    augmented_text = f"{context}\n\n{chunk_text}"
    embedding = vo.embed(
        texts=[augmented_text],
        model="voyage-2"
    )
    return embedding.embeddings[0]

Step 3: Store and Retrieve

Store the contextual embeddings in your vector database (e.g., Pinecone, Weaviate, or Chroma). During retrieval, query as usual—the enriched embeddings will naturally match relevant queries better.

Making It Production-Ready with Prompt Caching

Generating context for every chunk can be expensive. For a codebase with 10,000 chunks, you'd send the full document 10,000 times. Prompt caching makes this practical.

Anthropic's prompt caching allows you to cache the full document and reference it across multiple context generation calls:

# First call: cache the document
response = client.messages.create(
    model="claude-3-sonnet-20241022",
    max_tokens=100,
    system=[
        {
            "type": "text",
            "text": full_document,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": f"<chunk>{chunk_1}</chunk>\n\nProvide context..."
        }
    ]
)
Subsequent calls: use cached document
response = client.messages.create(
    model="claude-3-sonnet-20241022",
    max_tokens=100,
    system=[
        {
            "type": "text",
            "text": full_document,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": f"<chunk>{chunk_2}</chunk>\n\nProvide context..."
        }
    ]
)

This reduces API costs by up to 90% for large document collections.

Contextual BM25: Hybrid Search Enhancement

The same chunk-specific context can also improve BM25 (keyword) search. Traditional BM25 struggles with chunks that lack distinctive keywords. By adding context, you introduce relevant terms that improve keyword matching.

from rank_bm25 import BM25Okapi
def create_contextual_bm25_index(chunks_with_context):
    """Create a BM25 index from context-augmented chunks."""
    # Tokenize the context-augmented chunks
    tokenized_chunks = [
        f"{ctx['context']} {ctx['text']}".split()
        for ctx in chunks_with_context
    ]
    return BM25Okapi(tokenized_chunks)

Combine contextual embeddings with contextual BM25 for hybrid search:

def hybrid_search(query, vector_db, bm25_index, alpha=0.5):
    """Combine vector and keyword search results."""
    # Vector search
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    vector_results = vector_db.similarity_search(query_embedding, k=20)
    
    # BM25 search
    bm25_scores = bm25_index.get_scores(query.split())
    bm25_results = sorted(
        range(len(bm25_scores)),
        key=lambda i: bm25_scores[i],
        reverse=True
    )[:20]
    
    # Combine scores (implementation varies by vector DB)
    # ...
    return combined_results

Performance Evaluation

Anthropic tested Contextual Retrieval on a dataset of 9 codebases with 248 queries. Results:

Method	Pass@10	Improvement
Basic RAG	~87%	Baseline
Contextual Embeddings	~95%	+8% absolute
+ Contextual BM25	~97%	+10% absolute
+ Reranking	~98%	+11% absolute

Pass@k measures whether the correct document ("golden chunk") appears in the top-k retrieved results. A failure rate reduction of 35% means significantly fewer missed retrievals.

Production Considerations

For AWS Bedrock Users

Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option in Bedrock Knowledge Bases. This allows you to use Contextual Retrieval without managing your own infrastructure.

Cost Optimization

Prompt caching is essential for large document collections
Batch context generation during off-peak hours
Consider using smaller Claude models (Claude 3 Haiku) for context generation
Cache results to avoid regenerating context for unchanged documents

When to Use Contextual Retrieval

Best for:

Codebases with implicit dependencies
Legal documents with cross-references
Technical documentation with domain-specific terminology
Any corpus where chunks lose meaning in isolation

Less critical for:

Self-contained chunks (e.g., individual FAQ entries)
Very short documents where the entire document fits in one chunk

Key Takeaways

Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the "lost in the middle" problem of traditional chunking
Prompt caching makes this practical at scale—cache the full document once and reuse it across hundreds of context generation calls, reducing API costs by up to 90%
Hybrid search with Contextual BM25 further improves results by combining semantic and keyword matching, each enriched with the same contextual information
Production-ready on major platforms—Anthropic provides Lambda functions for AWS Bedrock, and the technique works on GCP Vertex AI with minimal customization
Start with Pass@k evaluation—measure your baseline retrieval performance before and after implementing Contextual Retrieval to quantify the improvement in your specific use case