GuideBeginnerBest Practices2026-05-12

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI

Learn how to implement Contextual Embeddings and Contextual BM25 to improve RAG accuracy by 35% using Claude AI, Voyage AI, and Cohere.

Quick Answer

This guide shows you how to implement Contextual Retrieval — adding full-document context to each chunk before embedding — to reduce retrieval failure rates by 35% in your RAG applications with Claude AI.

RAGContextual EmbeddingsRetrievalClaude APIPrompt Caching

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications — from customer support bots to internal knowledge base assistants. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful.

Enter Contextual Retrieval, a technique pioneered by Anthropic that adds relevant context to each chunk before embedding. The results speak for themselves: a 35% reduction in top-20 retrieval failure rates across diverse datasets.

In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, Voyage AI, and Cohere — and see measurable improvements in your RAG pipeline.

What You'll Need

Prerequisites

Intermediate Python skills
Basic understanding of RAG and vector databases
Docker installed (optional, for BM25)

API Keys

Time & Cost

Setup time: 30–45 minutes
API costs: ~$5–10 for the full dataset

The Problem: Lost Context in Chunked Retrieval

Traditional RAG splits documents into fixed-size chunks. This works well for many cases, but consider this scenario:

A chunk contains the code snippet def calculate_interest(): — but without the surrounding context, the retriever can't tell if this is for a banking app, a savings calculator, or a loan amortization tool.

When chunks lack context, retrieval accuracy suffers. The retriever returns irrelevant results, and Claude's responses become less reliable.

Solution: Contextual Embeddings

Contextual Embeddings solve this by prepending a concise, chunk-specific context to each chunk before generating the embedding vector. This context is generated by Claude itself, which understands the full document and can summarize what the chunk is about.

How It Works

Split your documents into chunks (e.g., 512 characters)
Generate context for each chunk using Claude (prompt: "What is this chunk about, given the full document?")
Prepend context to the chunk text
Embed the enriched chunk
Retrieve using standard vector search

Implementation

Here's how to generate context for each chunk using Claude:

import anthropic
client = anthropic.Anthropic()
def generate_chunk_context(chunk_text: str, full_document: str) -> str:
    """Generate context for a single chunk using Claude."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        system="You are a document analysis assistant. Your task is to generate a brief, specific context for a chunk of text based on the full document it comes from.",
        messages=[
            {
                "role": "user",
                "content": f"Here is the full document:\n\n<document>{full_document}</document>\n\nHere is the chunk we need context for:\n\n<chunk>{chunk_text}</chunk>\n\nGenerate a concise context (1-2 sentences) that explains what this chunk is about in the context of the full document. Focus on the subject matter, not the structure."
            }
        ]
    )
    return response.content[0].text

Making It Production-Ready with Prompt Caching

Generating context for thousands of chunks can be expensive. Prompt caching reduces costs by caching the full document across multiple chunk requests.

# Enable prompt caching by marking the document as a "cache_control" breakpoint
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=100,
    system=[
        {
            "type": "text",
            "text": "You are a document analysis assistant.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Full document:\n\n{full_document}",
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": f"Chunk:\n\n{chunk_text}\n\nGenerate context."
                }
            ]
        }
    ]
)

With caching, the full document is only processed once. Subsequent chunks reuse the cached representation, reducing API costs by up to 90%.

Step 2: Embedding with Context

Once you have context for each chunk, prepend it to the chunk text before embedding:

import voyageai
vo = voyageai.Client(api_key="your-voyage-api-key")
def embed_with_context(chunks_with_context: list[dict]) -> list[list[float]]:
    """Embed chunks with their context prepended."""
    enriched_texts = [
        f"{chunk['context']}\n\n{chunk['text']}"
        for chunk in chunks_with_context
    ]
    
    response = vo.embed(
        texts=enriched_texts,
        model="voyage-3-large",
        input_type="document"
    )
    return response.embeddings

Step 3: Contextual BM25 for Hybrid Search

The same chunk-specific context can also improve BM25 (keyword-based) search. This creates a powerful hybrid search system:

Vector search captures semantic meaning
Contextual BM25 captures keyword relevance with enriched context

Implementation

from rank_bm25 import BM25Okapi
def build_contextual_bm25(chunks_with_context: list[dict]) -> BM25Okapi:
    """Build a BM25 index using context-enriched chunks."""
    enriched_texts = [
        f"{chunk['context']} {chunk['text']}"
        for chunk in chunks_with_context
    ]
    tokenized_corpus = [text.split() for text in enriched_texts]
    return BM25Okapi(tokenized_corpus)

Step 4: Reranking for Final Precision

Even with Contextual Embeddings, the top-10 results may contain irrelevant chunks. Adding a reranker (like Cohere's) improves final accuracy:

import cohere
co = cohere.Client("your-cohere-api-key")
def rerank_results(query: str, chunks: list[str], top_k: int = 5) -> list[dict]:
    """Rerank retrieved chunks using Cohere's reranker."""
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=chunks,
        top_n=top_k
    )
    return response.results

Performance Results

On a dataset of 9 codebases with 248 queries, Contextual Retrieval delivered:

Method	Pass@10	Improvement
Basic RAG	87%	—
Contextual Embeddings	95%	+8% absolute
Contextual Embeddings + BM25	97%	+10% absolute
Full Pipeline (with reranking)	99%	+12% absolute

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, Anthropic provides a Lambda function that adds context to each document during ingestion. Deploy the function from the contextual-rag-lambda-function directory and select it as a custom chunking option when configuring your knowledge base.

Cost Management

Prompt caching is essential for large document sets
Batch processing chunks reduces API calls
Voyage AI offers competitive embedding pricing
Cohere reranker is optional but adds ~$0.001 per query

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% — adding full-document context to each chunk before embedding dramatically improves accuracy.
Prompt caching makes it practical — by caching the full document, you can generate context for thousands of chunks at minimal cost.
Hybrid search with Contextual BM25 boosts results further — combining vector and keyword search with enriched context yields the best performance.
Reranking adds the final polish — a lightweight reranker can push Pass@10 from 95% to 99%.
Production-ready on any cloud — the technique works on Anthropic's API, AWS Bedrock, and GCP Vertex AI with minimal customization.