GuideBeginnerBest Practices2026-05-15

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude & Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. A practical guide with code examples for building production-ready RAG systems.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to improve RAG performance by 35% using Claude, Voyage AI, and prompt caching.

RAGContextual EmbeddingsRetrievalPrompt CachingBM25

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude & Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context, leading to poor search results and inaccurate answers.

Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.

In this guide, you'll learn how to build a Contextual Retrieval system using Claude, Voyage AI embeddings, and BM25 search—complete with code examples and production-ready optimization strategies.

What You'll Need

Prerequisites

Intermediate Python knowledge
Basic understanding of RAG and vector databases
Familiarity with command-line tools

API Keys & Tools

Anthropic API key (free tier works)
Voyage AI API key
Cohere API key (for reranking)
Python 3.8+, Docker (optional for BM25), 4GB+ RAM

Time & Cost: 30–45 minutes; ~$5–10 in API costs for the full dataset.

1. The Problem: Lost Context in Chunked Documents

Traditional RAG pipelines split documents into fixed-size chunks. While efficient, this approach creates a critical flaw:

A chunk about a Python function calculate_interest() might not mention it belongs to a BankAccount class—so a query about "bank account interest calculation" may miss it entirely.

This is where Contextual Embeddings shine.

2. What Are Contextual Embeddings?

Contextual Embeddings add a short, chunk-specific context string to each chunk before embedding. This context explains what the chunk is about and where it fits in the broader document.

How it works:

Split your document into chunks (e.g., 500 characters)
For each chunk, use Claude to generate a 50–100 word context description
Prepend this context to the chunk text
Embed the combined text (context + chunk)
Store in your vector database

When a user queries your RAG system, the embedding search now matches against enriched chunks that carry their full context.

Why It Works So Well

Semantic clarity: The embedding vector captures both the chunk's content and its role in the document
Disambiguation: Two similar-looking code snippets from different classes become distinct
Improved recall: Queries that reference high-level concepts now match relevant sub-chunks

3. Implementation: Building Contextual Retrieval with Claude

Let's walk through the implementation using a dataset of 9 codebases (248 queries with golden chunks).

Step 1: Generate Context for Each Chunk

import anthropic
client = anthropic.Anthropic(api_key="YOUR_API_KEY")
def generate_chunk_context(document_text, chunk_text):
    """Generate context for a single chunk using Claude."""
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=150,
        system="You are a document analyzer. Given a document and a chunk from it, "
               "provide a brief context (50-100 words) explaining what this chunk is about "
               "and how it relates to the broader document.",
        messages=[
            {
                "role": "user",
                "content": f"Document:\n{document_text[:2000]}\n\n"
                           f"Chunk:\n{chunk_text}\n\n"
                           f"Provide context for this chunk:"
            }
        ]
    )
    return response.content[0].text

Step 2: Create Contextual Embeddings

import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
def create_contextual_embedding(chunk_text, context):
    """Create embedding from context + chunk."""
    enriched_text = f"{context}\n\n{chunk_text}"
    result = vo.embed([enriched_text], model="voyage-2")
    return result.embeddings[0]

Step 3: Build Your Retrieval Pipeline

import chromadb
from chromadb.utils import embedding_functions
Initialize ChromaDB with voyage embeddings
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
    name="contextual_rag",
    embedding_function=embedding_functions.VoyageEmbeddingFunction(
        api_key="YOUR_VOYAGE_API_KEY",
        model_name="voyage-2"
    )
)
Add chunks with context
for i, (chunk, context) in enumerate(zip(chunks, contexts)):
    enriched = f"{context}\n\n{chunk}"
    collection.add(
        documents=[enriched],
        ids=[f"chunk_{i}"],
        metadatas=[{"original_chunk": chunk, "context": context}]
    )

4. Optimizing Costs with Prompt Caching

Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the document prefix across multiple context generation calls.

# Enable prompt caching for the document prefix
response = client.messages.create(
    model="claude-3-sonnet-20241022",
    max_tokens=150,
    system=[
        {
            "type": "text",
            "text": "You are a document analyzer...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Document:\n{document_text}\n\nChunk:\n{chunk_text}",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        }
    ]
)

Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

5. Going Further: Contextual BM25 & Hybrid Search

Context isn't just for embeddings. You can apply the same context to BM25 (keyword-based search) for even better results.

Contextual BM25 Implementation

from rank_bm25 import BM25Okapi
def build_contextual_bm25(chunks, contexts):
    """Build BM25 index with contextualized chunks."""
    enriched_chunks = [f"{context}\n\n{chunk}" for chunk, context in zip(chunks, contexts)]
    tokenized = [chunk.split() for chunk in enriched_chunks]
    return BM25Okapi(tokenized)
Hybrid search: combine embedding similarity + BM25 scores
bm25 = build_contextual_bm25(chunks, contexts)
def hybrid_search(query, embedding_results, bm25_results, alpha=0.5):
    """Combine embedding and BM25 scores."""
    combined_scores = {}
    for doc_id, score in embedding_results:
        combined_scores[doc_id] = alpha * score
    for doc_id, score in bm25_results:
        combined_scores[doc_id] = combined_scores.get(doc_id, 0) + (1 - alpha) * score
    return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)

6. Performance Results & Reranking

In Anthropic's tests with 9 codebases and 248 queries:

Technique	Pass@10	Improvement
Basic RAG	87%	Baseline
Contextual Embeddings	95%	+8%
Contextual Embeddings + BM25	97%	+10%
+ Reranking (Cohere)	98%	+11%

Adding Reranking

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank_results(query, candidates, top_k=5):
    """Rerank retrieved chunks using Cohere's reranker."""
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_k
    )
    return [(r.index, r.relevance_score) for r in results.results]

7. Production Deployment on AWS Bedrock

For AWS users, Anthropic provides a Lambda function that implements contextual chunking for Bedrock Knowledge Bases. Deploy it as a custom chunking option:

Deploy the Lambda using contextual-rag-lambda-function/lambda_function.py
In Bedrock Knowledge Base creation, select "Custom chunking"
Point to your deployed Lambda
Your chunks are now automatically contextualized

Key Takeaways

Contextual Embeddings reduce retrieval failure by 35% by enriching chunks with document-level context before embedding
Combine with BM25 for hybrid search that leverages both semantic and keyword matching
Use prompt caching to make context generation cost-effective at scale (caches document prefixes across chunks)
Reranking adds another 1–2% improvement—worth implementing for high-stakes applications
Production-ready on AWS Bedrock via custom Lambda chunking—no need to rebuild your infrastructure

Start by contextualizing your most important document chunks. Even a 35% reduction in retrieval failures can dramatically improve your RAG application's reliability and user trust.