BeClaude
GuideBeginnerPricing2026-05-16

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Includes code examples, cost optimization with prompt caching, and production deployment tips for Claude AI.

Quick Answer

This guide shows you how to improve RAG retrieval accuracy by adding context to document chunks before embedding. Using Claude with Contextual Embeddings reduces top-20 retrieval failure by 35%, and combining with Contextual BM25 and reranking pushes Pass@10 from 87% to 95%.

RAGContextual EmbeddingsPrompt CachingRetrievalBM25

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering customer support bots, internal knowledge base Q&A, financial analysis tools, and code generation workflows. But there's a persistent problem: chunks lack context.

When you split a document into chunks for vector search, each chunk becomes an island. A chunk containing "the function returns True" loses the crucial context of which function and why. This leads to missed retrievals, hallucinated answers, and frustrated users.

Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's testing across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.

In this guide, you'll learn how to implement Contextual Retrieval with Claude, optimize costs using prompt caching, and combine it with BM25 search and reranking for production-grade performance.

What You'll Build

By the end of this guide, you'll have a complete Contextual Retrieval pipeline that:

  • Adds context to each chunk using Claude
  • Embeds contextualized chunks for vector search
  • Combines with Contextual BM25 for hybrid retrieval
  • Reranks results for maximum accuracy
We'll use a dataset of 9 codebases with 248 evaluation queries. The baseline Pass@10 is ~87%. With Contextual Retrieval, we'll push that to ~95%.

Prerequisites

Skills:
  • Intermediate Python
  • Basic RAG understanding
  • Familiarity with embeddings and vector databases
API Keys: Time & Cost:
  • 30-45 minutes to complete
  • ~$5-10 in API costs for the full dataset

Step 1: Setting Up the Baseline RAG

First, let's establish a performance baseline with standard RAG.

import json
import voyageai
from anthropic import Anthropic

Load your data

with open('data/codebase_chunks.json') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl') as f: eval_queries = [json.loads(line) for line in f]

Initialize clients

vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY") claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

Embed all chunks (baseline - no context)

chunk_texts = [chunk['text'] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Simple retrieval function

def retrieve(query, k=10): query_emb = vo.embed([query], model="voyage-2").embeddings[0] # Compute cosine similarity (simplified) scores = [cosine_similarity(query_emb, emb) for emb in embeddings] top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k] return [chunks[i] for i in top_indices]

def cosine_similarity(a, b): return sum(ai*bi for ai, bi in zip(a, b)) / ( (sum(aiai for ai in a)0.5) (sum(bi*bi for bi in b)0.5) )

Evaluate Pass@10

correct = 0 for query in eval_queries: results = retrieve(query['query'], k=10) if query['golden_chunk_id'] in [r['id'] for r in results]: correct += 1

print(f"Baseline Pass@10: {correct/len(eval_queries)*100:.1f}%")

Expected: ~87%

Step 2: Implementing Contextual Embeddings

The core idea is simple: for each chunk, ask Claude to generate a brief context that explains what this chunk is about, then prepend that context before embedding.

The Context Generation Prompt

CONTEXT_PROMPT = """You are helping to improve a retrieval system. 
Given a document and a chunk from that document, write a brief context 
(2-3 sentences) explaining what this chunk is about and how it relates 
to the broader document. Focus on:
  • What topic or concept this chunk covers
  • How it connects to surrounding content
  • Any important entities, functions, or terms mentioned
Document: {full_document} Chunk: {chunk_text}

Context:"""

def generate_context(full_document, chunk_text): response = claude.messages.create( model="claude-3-haiku-20240307", max_tokens=150, messages=[{ "role": "user", "content": CONTEXT_PROMPT.format( full_document=full_document, chunk_text=chunk_text ) }] ) return response.content[0].text

Optimizing with Prompt Caching

Generating context for every chunk individually would be expensive. Prompt caching makes this practical by caching the full document across multiple chunk requests.

def generate_context_cached(full_document, chunk_text):
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        system=[{
            "type": "text",
            "text": "You are helping to improve a retrieval system.",
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Document:\n{full_document}",
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": f"Chunk:\n{chunk_text}\n\nContext:"
                }
            ]
        }]
    )
    return response.content[0].text

With prompt caching, you pay the full document context cost once, then subsequent chunks only incur the chunk-specific token cost. This typically reduces costs by 60-80% for large documents.

Building the Contextual Embedding Pipeline

# Generate contextualized chunks
contextualized_chunks = []
for doc_id, doc in enumerate(documents):
    full_text = doc['text']
    for chunk in doc['chunks']:
        context = generate_context_cached(full_text, chunk['text'])
        contextualized_text = f"{context}\n\n{chunk['text']}"
        contextualized_chunks.append({
            'id': chunk['id'],
            'text': contextualized_text,
            'original_chunk': chunk
        })

Embed contextualized chunks

contextual_embeddings = vo.embed( [c['text'] for c in contextualized_chunks], model="voyage-2" ).embeddings

Evaluate (same retrieval function, new embeddings)

correct = 0 for query in eval_queries: results = retrieve(query['query'], k=10) if query['golden_chunk_id'] in [r['id'] for r in results]: correct += 1

print(f"Contextual Embeddings Pass@10: {correct/len(eval_queries)*100:.1f}%")

Expected: ~93-95%

Step 3: Adding Contextual BM25

Contextual BM25 applies the same idea to keyword search. Use the generated context as input to a BM25 index alongside the original chunk.

from rank_bm25 import BM25Okapi

Build BM25 index with contextualized text

bm25 = BM25Okapi([c['text'].split() for c in contextualized_chunks])

def hybrid_search(query, k=10, alpha=0.5): # Vector search query_emb = vo.embed([query], model="voyage-2").embeddings[0] vector_scores = [cosine_similarity(query_emb, emb) for emb in contextual_embeddings] # BM25 search bm25_scores = bm25.get_scores(query.split()) # Normalize and combine vector_scores = normalize(vector_scores) bm25_scores = normalize(bm25_scores) combined = [alpha v + (1-alpha) b for v, b in zip(vector_scores, bm25_scores)] top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:k] return [chunks[i] for i in top_indices]

def normalize(scores): min_s, max_s = min(scores), max(scores) return [(s - min_s) / (max_s - min_s) for s in scores]

Hybrid search typically adds another 2-3% improvement over vector search alone.

Step 4: Reranking for Maximum Accuracy

For the final polish, add a reranking step using Cohere's reranker:

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")

def retrieve_with_rerank(query, k=10): # Get initial candidates (e.g., top 50) candidates = hybrid_search(query, k=50) # Rerank reranked = co.rerank( query=query, documents=[c['text'] for c in candidates], model="rerank-english-v2.0", top_n=k ) return [candidates[r.index] for r in reranked.results]

Reranking typically adds another 1-2% improvement, pushing Pass@10 to ~95%.

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy Contextual Retrieval as a Lambda function for custom chunking:

# contextual-rag-lambda-function/lambda_function.py
def lambda_handler(event, context):
    """
    Custom chunking Lambda for Bedrock Knowledge Base.
    Expects event with 'document' and 'chunks' fields.
    Returns chunks with context prepended.
    """
    document = event['document']
    chunks = event['chunks']
    
    contextualized = []
    for chunk in chunks:
        context = generate_context_cached(document, chunk['text'])
        contextualized.append({
            **chunk,
            'text': f"{context}\n\n{chunk['text']}"
        })
    
    return {'chunks': contextualized}

Cost Management

TechniqueCost ImpactPerformance Gain
Contextual Embeddings+$0.01-0.05/chunk+6-8% Pass@10
Prompt Caching-60-80% context costSame performance
Contextual BM25Free (compute only)+2-3% Pass@10
Reranking+$0.001/query+1-2% Pass@10

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by adding document-level context to each chunk before embedding, solving the "chunk isolation" problem that plagues standard RAG.
  • Prompt caching makes Contextual Retrieval production-ready by caching the full document across chunk requests, reducing costs by 60-80% compared to naive implementation.
  • Hybrid search with Contextual BM25 adds 2-3% improvement by combining semantic and keyword-based retrieval on the same contextualized chunks.
  • Reranking provides the final polish (1-2% improvement) and is worth implementing for production systems where every retrieval matters.
  • The technique works across platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex AI—making it accessible regardless of your cloud provider.