GuideBeginnerPricing2026-05-16

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Includes code examples, cost optimization with prompt caching, and production deployment tips for Claude AI.

Quick Answer

This guide shows you how to improve RAG retrieval accuracy by adding context to document chunks before embedding. Using Claude with Contextual Embeddings reduces top-20 retrieval failure by 35%, and combining with Contextual BM25 and reranking pushes Pass@10 from 87% to 95%.

RAGContextual EmbeddingsPrompt CachingRetrievalBM25

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering customer support bots, internal knowledge base Q&A, financial analysis tools, and code generation workflows. But there's a persistent problem: chunks lack context.

When you split a document into chunks for vector search, each chunk becomes an island. A chunk containing "the function returns True" loses the crucial context of which function and why. This leads to missed retrievals, hallucinated answers, and frustrated users.

Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's testing across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.

In this guide, you'll learn how to implement Contextual Retrieval with Claude, optimize costs using prompt caching, and combine it with BM25 search and reranking for production-grade performance.

What You'll Build

By the end of this guide, you'll have a complete Contextual Retrieval pipeline that:

Adds context to each chunk using Claude
Embeds contextualized chunks for vector search
Combines with Contextual BM25 for hybrid retrieval
Reranks results for maximum accuracy

We'll use a dataset of 9 codebases with 248 evaluation queries. The baseline Pass@10 is ~87%. With Contextual Retrieval, we'll push that to ~95%.

Prerequisites

Skills:

Intermediate Python
Basic RAG understanding
Familiarity with embeddings and vector databases

API Keys:

Anthropic API key (free tier works)
Voyage AI API key (for embeddings)
Cohere API key (for reranking)

Time & Cost:

30-45 minutes to complete
~$5-10 in API costs for the full dataset

Step 1: Setting Up the Baseline RAG

First, let's establish a performance baseline with standard RAG.

import json
import voyageai
from anthropic import Anthropic
Load your data
with open('data/codebase_chunks.json') as f:
    chunks = json.load(f)
with open('data/evaluation_set.jsonl') as f:
    eval_queries = [json.loads(line) for line in f]
Initialize clients
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
Embed all chunks (baseline - no context)
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Simple retrieval function
def retrieve(query, k=10):
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    # Compute cosine similarity (simplified)
    scores = [cosine_similarity(query_emb, emb) for emb in embeddings]
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [chunks[i] for i in top_indices]
def cosine_similarity(a, b):
    return sum(ai*bi for ai, bi in zip(a, b)) / (
        (sum(aiai for ai in a)0.5)  (sum(bi*bi for bi in b)0.5)
    )
Evaluate Pass@10
correct = 0
for query in eval_queries:
    results = retrieve(query['query'], k=10)
    if query['golden_chunk_id'] in [r['id'] for r in results]:
        correct += 1
print(f"Baseline Pass@10: {correct/len(eval_queries)*100:.1f}%")
Expected: ~87%

Step 2: Implementing Contextual Embeddings

The core idea is simple: for each chunk, ask Claude to generate a brief context that explains what this chunk is about, then prepend that context before embedding.

The Context Generation Prompt

CONTEXT_PROMPT = """You are helping to improve a retrieval system. 
Given a document and a chunk from that document, write a brief context 
(2-3 sentences) explaining what this chunk is about and how it relates 
to the broader document. Focus on:
What topic or concept this chunk covers
How it connects to surrounding content
Any important entities, functions, or terms mentioned

Document: {full_document}
Chunk: {chunk_text}
Context:"""
def generate_context(full_document, chunk_text):
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(
                full_document=full_document,
                chunk_text=chunk_text
            )
        }]
    )
    return response.content[0].text

Optimizing with Prompt Caching

Generating context for every chunk individually would be expensive. Prompt caching makes this practical by caching the full document across multiple chunk requests.

def generate_context_cached(full_document, chunk_text):
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        system=[{
            "type": "text",
            "text": "You are helping to improve a retrieval system.",
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Document:\n{full_document}",
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": f"Chunk:\n{chunk_text}\n\nContext:"
                }
            ]
        }]
    )
    return response.content[0].text

With prompt caching, you pay the full document context cost once, then subsequent chunks only incur the chunk-specific token cost. This typically reduces costs by 60-80% for large documents.

Building the Contextual Embedding Pipeline

# Generate contextualized chunks
contextualized_chunks = []
for doc_id, doc in enumerate(documents):
    full_text = doc['text']
    for chunk in doc['chunks']:
        context = generate_context_cached(full_text, chunk['text'])
        contextualized_text = f"{context}\n\n{chunk['text']}"
        contextualized_chunks.append({
            'id': chunk['id'],
            'text': contextualized_text,
            'original_chunk': chunk
        })
Embed contextualized chunks
contextual_embeddings = vo.embed(
    [c['text'] for c in contextualized_chunks],
    model="voyage-2"
).embeddings
Evaluate (same retrieval function, new embeddings)
correct = 0
for query in eval_queries:
    results = retrieve(query['query'], k=10)
    if query['golden_chunk_id'] in [r['id'] for r in results]:
        correct += 1
print(f"Contextual Embeddings Pass@10: {correct/len(eval_queries)*100:.1f}%")
Expected: ~93-95%

Step 3: Adding Contextual BM25

Contextual BM25 applies the same idea to keyword search. Use the generated context as input to a BM25 index alongside the original chunk.

from rank_bm25 import BM25Okapi
Build BM25 index with contextualized text
bm25 = BM25Okapi([c['text'].split() for c in contextualized_chunks])
def hybrid_search(query, k=10, alpha=0.5):
    # Vector search
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    vector_scores = [cosine_similarity(query_emb, emb) for emb in contextual_embeddings]
    
    # BM25 search
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize and combine
    vector_scores = normalize(vector_scores)
    bm25_scores = normalize(bm25_scores)
    
    combined = [alpha  v + (1-alpha)  b for v, b in zip(vector_scores, bm25_scores)]
    top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:k]
    return [chunks[i] for i in top_indices]
def normalize(scores):
    min_s, max_s = min(scores), max(scores)
    return [(s - min_s) / (max_s - min_s) for s in scores]

Hybrid search typically adds another 2-3% improvement over vector search alone.

Step 4: Reranking for Maximum Accuracy

For the final polish, add a reranking step using Cohere's reranker:

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def retrieve_with_rerank(query, k=10):
    # Get initial candidates (e.g., top 50)
    candidates = hybrid_search(query, k=50)
    
    # Rerank
    reranked = co.rerank(
        query=query,
        documents=[c['text'] for c in candidates],
        model="rerank-english-v2.0",
        top_n=k
    )
    
    return [candidates[r.index] for r in reranked.results]

Reranking typically adds another 1-2% improvement, pushing Pass@10 to ~95%.

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy Contextual Retrieval as a Lambda function for custom chunking:

# contextual-rag-lambda-function/lambda_function.py
def lambda_handler(event, context):
    """
    Custom chunking Lambda for Bedrock Knowledge Base.
    Expects event with 'document' and 'chunks' fields.
    Returns chunks with context prepended.
    """
    document = event['document']
    chunks = event['chunks']
    
    contextualized = []
    for chunk in chunks:
        context = generate_context_cached(document, chunk['text'])
        contextualized.append({
            **chunk,
            'text': f"{context}\n\n{chunk['text']}"
        })
    
    return {'chunks': contextualized}

Cost Management

Technique	Cost Impact	Performance Gain
Contextual Embeddings	+$0.01-0.05/chunk	+6-8% Pass@10
Prompt Caching	-60-80% context cost	Same performance
Contextual BM25	Free (compute only)	+2-3% Pass@10
Reranking	+$0.001/query	+1-2% Pass@10

Key Takeaways

Contextual Embeddings reduce retrieval failure by 35% by adding document-level context to each chunk before embedding, solving the "chunk isolation" problem that plagues standard RAG.

Prompt caching makes Contextual Retrieval production-ready by caching the full document across chunk requests, reducing costs by 60-80% compared to naive implementation.

Hybrid search with Contextual BM25 adds 2-3% improvement by combining semantic and keyword-based retrieval on the same contextualized chunks.

Reranking provides the final polish (1-2% improvement) and is worth implementing for production systems where every retrieval matters.

The technique works across platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex AI—making it accessible regardless of your cloud provider.

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

What You'll Build

Prerequisites

Step 1: Setting Up the Baseline RAG

Load your data

Initialize clients

Embed all chunks (baseline - no context)

Simple retrieval function

Evaluate Pass@10

`Expected: ~87%`

Step 2: Implementing Contextual Embeddings

The Context Generation Prompt

Optimizing with Prompt Caching

Building the Contextual Embedding Pipeline

Embed contextualized chunks

Evaluate (same retrieval function, new embeddings)

`Expected: ~93-95%`

Step 3: Adding Contextual BM25

Build BM25 index with contextualized text

Step 4: Reranking for Maximum Accuracy

Production Considerations

AWS Bedrock Integration

Cost Management

Key Takeaways