Guide2026-05-05

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users

Learn how to improve RAG performance using Contextual Embeddings and BM25 with Claude AI. Step-by-step guide with code examples, evaluation metrics, and cost-saving tips.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—adding relevant context to document chunks before embedding—to reduce retrieval failure rates by up to 35% and improve Pass@10 accuracy from ~87% to ~95% in RAG systems using Claude AI.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users

Retrieval Augmented Generation (RAG) is a powerful pattern that enables Claude to answer questions using your internal knowledge bases, codebases, or any document corpus. However, traditional RAG systems often suffer from a fundamental problem: when documents are split into smaller chunks for efficient retrieval, individual chunks can lose their surrounding context, leading to poor retrieval accuracy.

In this guide, we'll explore Contextual Retrieval—a technique developed by Anthropic that significantly improves RAG performance by adding relevant context to each chunk before embedding. According to Anthropic's internal testing, this method reduces the top-20-chunk retrieval failure rate by an average of 35% across various data sources.

What You'll Learn

By the end of this guide, you'll know how to:

Set up a basic RAG pipeline with Claude
Implement Contextual Embeddings to improve chunk quality
Use Contextual BM25 for hybrid search
Apply reranking to further boost performance
Leverage prompt caching to manage costs

Prerequisites

Technical Skills:

Intermediate Python programming
Basic understanding of RAG concepts
Familiarity with vector databases and embeddings
Basic command-line proficiency

System Requirements:

Python 3.8+
Docker installed and running (optional, for BM25 search)
4GB+ available RAM
~5-10 GB disk space for vector databases

API Keys Needed:

Anthropic API key (free tier sufficient)
Voyage AI API key (for embeddings)
Cohere API key (for reranking)

Time & Cost:

Expected completion: 30-45 minutes
API costs: ~$5-10 to run through the full dataset

Step 1: Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere

Initialize your clients:

import anthropic
import voyageai
import cohere
Initialize clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client(api_key="YOUR_COHERE_KEY")

Step 2: Understanding the Problem with Basic RAG

In a basic RAG setup, documents are split into chunks using simple character or token splitting. While this works for many applications, it creates a critical issue: individual chunks lack surrounding context.

Consider a codebase chunk containing just def calculate_total():. Without context, an embedding model might not understand this is part of a financial calculation function. The result? Poor retrieval when a user asks about "financial calculations."

Step 3: Implementing Contextual Embeddings

Contextual Embeddings solve this by prepending relevant context to each chunk before embedding. Here's how it works:

3.1 Generate Context for Each Chunk

Use Claude to generate context for each chunk. The prompt should include the full document and the specific chunk:

def generate_chunk_context(document, chunk):
    """Generate context for a single chunk using Claude."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[
            {
                "role": "user",
                "content": f"""<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context."""
            }
        ]
    )
    return response.content[0].text

3.2 Create Contextual Embeddings

Once you have the context, prepend it to the chunk before embedding:

def create_contextual_embedding(context, chunk):
    """Create an embedding for a chunk with its context."""
    contextual_chunk = f"{context}\n\n{chunk}"
    embedding = vo.embed(
        texts=[contextual_chunk],
        model="voyage-2"
    ).embeddings[0]
    return embedding

3.3 Optimize Costs with Prompt Caching

Generating context for thousands of chunks can be expensive. Use prompt caching to reduce costs by up to 90%:

def generate_chunk_context_cached(document, chunk):
    """Generate context using prompt caching for efficiency."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system=[
            {
                "type": "text",
                "text": f"You are helping to situate chunks within this document:\n\n{document}",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {
                "role": "user",
                "content": f"Here is the chunk: {chunk}"
            }
        ]
    )
    return response.content[0].text

Note: Prompt caching is currently available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

Step 4: Implementing Contextual BM25

Contextual BM25 extends the same idea to keyword-based search. Instead of using raw chunks, you use the same context-prepended chunks for BM25 indexing:

from rank_bm25 import BM25Okapi
def build_contextual_bm25_index(chunks_with_context):
    """Build a BM25 index using contextual chunks."""
    tokenized_chunks = [chunk.split() for chunk in chunks_with_context]
    bm25 = BM25Okapi(tokenized_chunks)
    return bm25
def search_contextual_bm25(bm25_index, query, top_k=10):
    """Search using contextual BM25."""
    tokenized_query = query.split()
    scores = bm25_index.get_scores(tokenized_query)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return top_indices

Step 5: Hybrid Search with Reranking

For best results, combine Contextual Embeddings and Contextual BM25, then rerank:

def hybrid_search(query, embedding_index, bm25_index, alpha=0.5, top_k=20):
    """Perform hybrid search combining embeddings and BM25."""
    # Get embedding results
    query_embedding = vo.embed(texts=[query], model="voyage-2").embeddings[0]
    emb_scores = cosine_similarity(query_embedding, embedding_index)
    
    # Get BM25 results
    tokenized_query = query.split()
    bm25_scores = bm25_index.get_scores(tokenized_query)
    
    # Combine scores
    combined_scores = alpha  emb_scores + (1 - alpha)  bm25_scores
    top_indices = sorted(range(len(combined_scores)), 
                        key=lambda i: combined_scores[i], 
                        reverse=True)[:top_k]
    return top_indices
def rerank_results(query, chunks, indices, top_k=10):
    """Rerank results using Cohere's reranker."""
    candidates = [chunks[i] for i in indices]
    reranked = co.rerank(
        query=query,
        documents=candidates,
        top_n=top_k
    )
    return [indices[r.index] for r in reranked.results]

Step 6: Measuring Performance

Use Pass@k as your evaluation metric. This measures whether the "golden chunk" (the correct answer) appears in the top-k retrieved results:

def evaluate_pass_at_k(retrieval_results, golden_chunks, k=10):
    """Calculate Pass@k accuracy."""
    correct = 0
    for query_results, golden in zip(retrieval_results, golden_chunks):
        if golden in query_results[:k]:
            correct += 1
    return correct / len(retrieval_results)

Anthropic's testing showed that Contextual Embeddings improved Pass@10 from ~87% to ~95% on a codebase dataset.

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock, you can deploy a Lambda function to add context to documents automatically. The Anthropic cookbook includes a contextual-rag-lambda-function directory with ready-to-use code. Deploy this Lambda and select it as a custom chunking option when configuring a Bedrock Knowledge Base.

Cost Management

Use Claude 3 Haiku for context generation (fastest and cheapest)
Leverage prompt caching to avoid reprocessing the full document for each chunk
Batch your API calls where possible

Key Takeaways

Contextual Embeddings dramatically improve retrieval accuracy: By prepending context to each chunk before embedding, you can reduce retrieval failure rates by 35% and improve Pass@10 from ~87% to ~95%.
Contextual BM25 boosts hybrid search: Applying the same context to BM25 indexing improves keyword-based retrieval, making hybrid search even more effective.
Prompt caching makes it practical: Without caching, generating context for thousands of chunks would be cost-prohibitive. Prompt caching reduces costs by up to 90%.
Reranking adds the final polish: Combining Contextual Embeddings, Contextual BM25, and a reranker creates a robust RAG pipeline that handles edge cases well.
Production-ready on major platforms: The technique works on Anthropic's API, AWS Bedrock (via Lambda), and GCP Vertex AI, making it accessible for enterprise deployments.