Guide2026-05-02

Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Step-by-step guide with code examples, cost optimization tips, and production-ready strategies.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35%. You'll learn Contextual Embeddings, Contextual BM25, and how to use prompt caching to keep costs practical.

RAGContextual EmbeddingsClaudeRetrieval Augmented GenerationPrompt Caching

Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to code analysis tools. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet that says def process() means nothing without knowing it's part of a payment processing module. A paragraph about "the merger" is useless if the chunk doesn't mention which companies are involved.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The results are dramatic: Anthropic's testing across multiple datasets shows a 35% reduction in top-20-chunk retrieval failure rates. This guide walks you through implementing Contextual Embeddings and Contextual BM25 in a production-ready RAG pipeline.

What You'll Build

By the end of this guide, you'll have:

A basic RAG pipeline with performance baselines
Contextual Embeddings implementation that boosts Pass@10 from ~87% to ~95%
Contextual BM25 for hybrid search optimization
A reranking layer for final precision
Cost optimization strategies using prompt caching

Prerequisites

Skills: Intermediate Python, basic RAG knowledge, familiarity with vector databases System: Python 3.8+, Docker (optional for BM25), 4GB+ RAM, ~5-10GB disk space API Keys:

Anthropic API key (free tier works)
Voyage AI API key for embeddings
Cohere API key for reranking

Time & Cost: 30-45 minutes, ~$5-10 in API costs

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Initialize your clients:

import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyage-key")
co = cohere.Client(api_key="your-cohere-key")

For this guide, we'll use a dataset of 9 codebases with 248 queries, each containing a "golden chunk"—the correct document that should be retrieved. You can find the data at data/codebase_chunks.json and data/evaluation_set.jsonl.

2. Building a Basic RAG Pipeline (Baseline)

Let's establish a performance baseline using standard chunking and embedding:

import json
from typing import List, Dict
Load your chunks
with open("data/codebase_chunks.json", "r") as f:
    chunks = json.load(f)
Generate embeddings for each chunk
def embed_chunks(chunks: List[str]) -> List[List[float]]:
    response = vo.embed(chunks, model="voyage-2", input_type="document")
    return response.embeddings
chunk_embeddings = embed_chunks([c["content"] for c in chunks])
Create a simple vector store (in-memory for demo)
vector_store = list(zip(chunks, chunk_embeddings))
Search function
def search(query: str, k: int = 10) -> List[Dict]:
    query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    
    # Cosine similarity search
    scores = []
    for chunk, emb in vector_store:
        similarity = sum(a*b for a,b in zip(query_embedding, emb))
        scores.append((similarity, chunk))
    
    scores.sort(reverse=True)
    return [chunk for _, chunk in scores[:k]]

Evaluate baseline performance using Pass@k (whether the golden chunk appears in the top-k results):

def evaluate_pass_at_k(queries: List[Dict], k: int = 10) -> float:
    correct = 0
    for query in queries:
        results = search(query["question"], k=k)
        if query["golden_chunk_id"] in [r["id"] for r in results]:
            correct += 1
    return correct / len(queries)
Load evaluation set
with open("data/evaluation_set.jsonl", "r") as f:
    eval_queries = [json.loads(line) for line in f]
baseline_pass_10 = evaluate_pass_at_k(eval_queries, k=10)
print(f"Baseline Pass@10: {baseline_pass_10:.2%}")
Expected: ~87%

3. Implementing Contextual Embeddings

The core insight is simple: before embedding each chunk, prepend a short context snippet that explains what the chunk is about. You generate this context using Claude.

Step 1: Generate Context for Each Chunk

def generate_chunk_context(chunk: Dict, full_document: str) -> str:
    """Use Claude to generate context for a chunk."""
    prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk['content']}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Generate context for each chunk (this is the expensive part)
for chunk in chunks:
    chunk["context"] = generate_chunk_context(chunk, chunk["document"])

Step 2: Embed with Context

# Create contextual chunks
contextual_chunks = [
    f"{chunk['context']}\n\n{chunk['content']}" 
    for chunk in chunks
]
Embed the contextualized versions
contextual_embeddings = embed_chunks(contextual_chunks)
Rebuild vector store
contextual_vector_store = list(zip(chunks, contextual_embeddings))

Step 3: Evaluate Improvement

def contextual_search(query: str, k: int = 10) -> List[Dict]:
    query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    
    scores = []
    for chunk, emb in contextual_vector_store:
        similarity = sum(a*b for a,b in zip(query_embedding, emb))
        scores.append((similarity, chunk))
    
    scores.sort(reverse=True)
    return [chunk for _, chunk in scores[:k]]
contextual_pass_10 = evaluate_pass_at_k(eval_queries, k=10)
print(f"Contextual Pass@10: {contextual_pass_10:.2%}")
Expected: ~95% (up from ~87%)

4. Cost Optimization with Prompt Caching

Generating context for every chunk can be expensive. Prompt caching reduces costs by ~85% by caching the full document and only sending the changing chunk:

def generate_chunk_context_cached(chunk: Dict, full_document: str) -> str:
    """Use prompt caching to reduce costs."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system=[
            {
                "type": "text",
                "text": f"<document>{full_document}</document>",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{
            "role": "user", 
            "content": f"<chunk>{chunk['content']}</chunk>\n\nGive succinct context for this chunk."
        }]
    )
    return response.content[0].text

Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

5. Contextual BM25: Hybrid Search

The same chunk context can improve BM25 (keyword-based) search. Combine it with embeddings for a hybrid approach:

from rank_bm25 import BM25Okapi
Tokenize contextual chunks for BM25
tokenized_corpus = [contextual_chunk.split() for contextual_chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]:
    # Get embedding scores
    query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    emb_scores = []
    for chunk, emb in contextual_vector_store:
        similarity = sum(a*b for a,b in zip(query_embedding, emb))
        emb_scores.append(similarity)
    
    # Get BM25 scores
    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Normalize and combine
    combined_scores = []
    for i in range(len(chunks)):
        normalized_emb = emb_scores[i] / max(emb_scores)
        normalized_bm25 = bm25_scores[i] / max(bm25_scores)
        combined = alpha  normalized_emb + (1 - alpha)  normalized_bm25
        combined_scores.append((combined, chunks[i]))
    
    combined_scores.sort(reverse=True)
    return [chunk for _, chunk in combined_scores[:k]]

6. Adding a Reranking Layer

For final precision, add a Cohere reranker:

def rerank(query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
    # Prepare documents for reranking
    docs = [f"{c['context']}\n\n{c['content']}" for c in candidates]
    
    results = co.rerank(
        query=query,
        documents=docs,
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    
    return [candidates[r.index] for r in results.results]
Full pipeline
def advanced_search(query: str) -> List[Dict]:
    # Step 1: Hybrid search for initial candidates
    candidates = hybrid_search(query, k=20)
    
    # Step 2: Rerank for precision
    final_results = rerank(query, candidates, top_k=5)
    
    return final_results

Production Considerations

For AWS Bedrock Users

Anthropic and AWS have provided a Lambda function for Contextual Retrieval that integrates directly with Bedrock Knowledge Bases. You can find the code in the contextual-rag-lambda-function directory of the cookbook repository. Deploy this Lambda and select it as a custom chunking option when configuring your knowledge base.

Performance Summary

Technique	Pass@10	Improvement
Basic RAG	~87%	Baseline
Contextual Embeddings	~95%	+8%
+ Contextual BM25	~96%	+9%
+ Reranking	~97%	+10%

Key Takeaways

Contextual Embeddings reduce retrieval failure rates by 35% by adding document-level context to each chunk before embedding, solving the "lost context" problem in traditional RAG.
Prompt caching makes this practical by reducing the cost of generating context for thousands of chunks by approximately 85%.
Contextual BM25 provides complementary improvements—combining it with contextual embeddings in a hybrid search yields the best results.
A reranking layer adds final precision but comes with additional latency and cost; use it only when you need top-5 accuracy.
AWS Bedrock users can deploy this as a Lambda function for seamless integration with existing knowledge bases, making production deployment straightforward.

Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

What You'll Build

Prerequisites

1. Setting Up Your Environment

Initialize API clients

2. Building a Basic RAG Pipeline (Baseline)

Load your chunks

Generate embeddings for each chunk

Create a simple vector store (in-memory for demo)

Search function

Load evaluation set

`Expected: ~87%`

3. Implementing Contextual Embeddings

Step 1: Generate Context for Each Chunk

Generate context for each chunk (this is the expensive part)

Step 2: Embed with Context

Embed the contextualized versions

Rebuild vector store

Step 3: Evaluate Improvement

`Expected: ~95% (up from ~87%)`

4. Cost Optimization with Prompt Caching

5. Contextual BM25: Hybrid Search

Tokenize contextual chunks for BM25

6. Adding a Reranking Layer

Full pipeline

Production Considerations

For AWS Bedrock Users

Performance Summary

Key Takeaways