GuideBeginnerBest Practices2026-05-22

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG retrieval accuracy using Claude and Anthropic's prompt caching.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds chunk-specific context before embedding—to reduce RAG retrieval failure rates by 35% using Claude, Voyage AI, and prompt caching.

RAGContextual EmbeddingsPrompt CachingRetrievalClaude API

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context that makes them meaningful.

Imagine searching through a codebase and finding a function called process_data(). Without knowing which module it belongs to or what data it expects, that chunk is nearly useless. Contextual Retrieval solves this by prepending relevant context to each chunk before embedding it.

In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude, Voyage AI, and Anthropic's prompt caching. The results speak for themselves: a 35% reduction in top-20 retrieval failure rates across tested datasets.

What You'll Need

Prerequisites

Intermediate Python skills
Basic understanding of RAG and vector databases
Docker installed (optional, for BM25 search)
4GB+ RAM and ~5-10GB disk space

API Keys

Anthropic API key (free tier works)
Voyage AI API key
Cohere API key (for reranking)

Time & Cost: 30-45 minutes to complete; ~$5-10 in API costs for the full dataset.

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Initialize your clients:

import anthropic
import voyageai
Initialize API clients
claude_client = anthropic.Anthropic(api_key="your-anthropic-key")
vo_client = voyageai.Client(api_key="your-voyage-key")
Test connection
response = claude_client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=100,
    messages=[{"role": "user", "content": "Hello"}]
)
print("Claude ready:", response.content[0].text)

2. The Problem: Context-Starved Chunks

Traditional RAG splits documents into fixed-size chunks, embeds them, and stores them in a vector database. When a query comes in, it retrieves the most similar chunks. But consider this code chunk:

def calculate_metrics(data):
    return np.mean(data), np.std(data)

Without context, the retriever doesn't know:

This is from a financial analysis module
data represents stock price arrays
The function is used for risk assessment

Result: A query about "stock volatility calculation" might miss this chunk entirely.

3. Contextual Embeddings: The Fix

Contextual Embeddings solve this by asking Claude to generate a brief context for each chunk before embedding. Here's the prompt:

CONTEXT_PROMPT = """
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.
"""

Implementation with Prompt Caching

Anthropic's prompt caching makes this practical by caching the full document prefix across multiple chunk requests:

def generate_chunk_context(chunk_text, full_document, chunk_index):
    """Generate context for a single chunk using Claude with prompt caching."""
    
    prompt = CONTEXT_PROMPT.replace("{{WHOLE_DOCUMENT}}", full_document)
    prompt = prompt.replace("{{CHUNK_CONTENT}}", chunk_text)
    
    response = claude_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        temperature=0,
        system=[{
            "type": "text",
            "text": "You are a context-generation assistant.",
            "cache_control": {"type": "ephemeral"}  # Enable caching
        }],
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text
Process all chunks with caching
full_doc = "..."  # Your full document
chunks = [...]  # Your pre-split chunks
contextual_chunks = []
for i, chunk in enumerate(chunks):
    context = generate_chunk_context(chunk, full_doc, i)
    contextual_chunks.append(f"{context}\n\n{chunk}")

Why prompt caching matters: Without caching, generating context for 1,000 chunks would cost ~$15. With caching, it drops to ~$2-3 because the full document is cached and only the chunk changes between requests.

4. Embedding and Storing Contextual Chunks

Now embed the contextualized chunks using Voyage AI:

def embed_chunks(chunks, batch_size=128):
    """Batch embed chunks using Voyage AI."""
    all_embeddings = []
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i+batch_size]
        result = vo_client.embed(
            texts=batch,
            model="voyage-2",
            input_type="document"
        )
        all_embeddings.extend(result.embeddings)
    
    return all_embeddings
Generate embeddings for contextual chunks
contextual_embeddings = embed_chunks(contextual_chunks)

Store these in your vector database (e.g., Pinecone, Weaviate, or Chroma):

import chromadb
client = chromadb.Client()
collection = client.create_collection("contextual_rag")
Add chunks with metadata
collection.add(
    embeddings=contextual_embeddings,
    documents=contextual_chunks,
    ids=[f"chunk_{i}" for i in range(len(contextual_chunks))],
    metadatas=[{"source": "codebase", "index": i} for i in range(len(contextual_chunks))]
)

5. Contextual BM25: Hybrid Search

BM25 is a text-based retrieval method that works well for exact keyword matches. You can apply the same contextual prefix to improve BM25 performance:

from rank_bm25 import BM25Okapi
def build_contextual_bm25(contextual_chunks):
    """Build BM25 index from contextual chunks."""
    tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
    bm25 = BM25Okapi(tokenized_chunks)
    return bm25
Search with contextual BM25
bm25 = build_contextual_bm25(contextual_chunks)
query = "stock volatility calculation"
query_tokens = query.split()
bm25_scores = bm25.get_scores(query_tokens)
top_indices = sorted(range(len(bm25_scores)), 
                     key=lambda i: bm25_scores[i], 
                     reverse=True)[:10]

Hybrid search combines vector similarity and BM25 scores:

def hybrid_search(query, vector_db, bm25, alpha=0.5):
    """Combine vector and BM25 scores."""
    # Get vector scores
    query_embedding = vo_client.embed([query], model="voyage-2", input_type="query")[0]
    vector_results = vector_db.query(query_embeddings=[query_embedding], n_results=20)
    
    # Get BM25 scores
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize and combine
    combined_scores = {}
    for i in range(len(vector_results['ids'][0])):
        idx = int(vector_results['ids'][0][i].split('_')[1])
        vector_score = 1 - vector_results['distances'][0][i]  # Convert distance to similarity
        bm25_score = bm25_scores[idx]
        combined_scores[idx] = alpha  vector_score + (1 - alpha)  bm25_score
    
    # Return top k
    top_k = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:10]
    return [idx for idx, score in top_k]

6. Measuring Performance: Pass@k

We evaluate using Pass@k—whether the correct chunk appears in the top-k results:

def evaluate_pass_at_k(retriever, queries, golden_chunks, k=10):
    """Calculate Pass@k metric."""
    successes = 0
    
    for query, golden in zip(queries, golden_chunks):
        results = retriever.search(query, k=k)
        if golden in results:
            successes += 1
    
    return successes / len(queries)
Example results
print(f"Baseline Pass@10: {0.87:.2%}")  # ~87%
print(f"Contextual Embeddings Pass@10: {0.95:.2%}")  # ~95%

In tests across 9 codebases with 248 queries, Contextual Embeddings improved Pass@10 from ~87% to ~95%.

7. Boosting Further with Reranking

For production systems, add a reranking step using Cohere:

import cohere
co_client = cohere.Client("your-cohere-key")
def rerank_results(query, candidates, top_k=5):
    """Rerank retrieved chunks using Cohere's reranker."""
    rerank_results = co_client.rerank(
        query=query,
        documents=candidates,
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    
    return [result.document for result in rerank_results.results]
Full pipeline
query = "How do we calculate portfolio risk?"
initial_results = hybrid_search(query, collection, bm25, alpha=0.5)
final_results = rerank_results(query, initial_results, top_k=5)

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, deploy the provided Lambda function (contextual-rag-lambda-function/lambda_function.py) as a custom chunking option. This allows you to add context to each document chunk before it enters your knowledge base.

Cost Optimization

Method	Cost per 1,000 chunks
Without caching	~$15
With prompt caching	~$2-3

Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

Key Takeaways

Contextual Embeddings reduce retrieval failure rates by 35% by adding chunk-specific context before embedding, solving the "context-starved chunk" problem in traditional RAG.
Prompt caching makes this practical for production, reducing costs by 80%+ by caching the full document across multiple chunk context generations.
Contextual BM25 extends the same idea to text-based retrieval, enabling hybrid search that combines vector similarity and keyword matching for even better results.
Reranking adds a final accuracy boost—use Cohere's reranker to refine top results after initial retrieval.
Start with your evaluation set—measure Pass@k before and after implementing Contextual Retrieval to quantify the improvement for your specific use case.