BeClaude
Guide2026-05-02

Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to dramatically improve RAG accuracy. Step-by-step guide with code examples, cost optimization tips, and production-ready strategies.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35%. You'll learn Contextual Embeddings, Contextual BM25, and how to use prompt caching to keep costs practical.

RAGContextual EmbeddingsClaudeRetrieval Augmented GenerationPrompt Caching

Mastering Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to code analysis tools. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet that says def process() means nothing without knowing it's part of a payment processing module. A paragraph about "the merger" is useless if the chunk doesn't mention which companies are involved.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The results are dramatic: Anthropic's testing across multiple datasets shows a 35% reduction in top-20-chunk retrieval failure rates. This guide walks you through implementing Contextual Embeddings and Contextual BM25 in a production-ready RAG pipeline.

What You'll Build

By the end of this guide, you'll have:

  • A basic RAG pipeline with performance baselines
  • Contextual Embeddings implementation that boosts Pass@10 from ~87% to ~95%
  • Contextual BM25 for hybrid search optimization
  • A reranking layer for final precision
  • Cost optimization strategies using prompt caching

Prerequisites

Skills: Intermediate Python, basic RAG knowledge, familiarity with vector databases System: Python 3.8+, Docker (optional for BM25), 4GB+ RAM, ~5-10GB disk space API Keys: Time & Cost: 30-45 minutes, ~$5-10 in API costs

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Initialize your clients:

import anthropic
import voyageai
import cohere

Initialize API clients

claude = anthropic.Anthropic(api_key="your-anthropic-key") vo = voyageai.Client(api_key="your-voyage-key") co = cohere.Client(api_key="your-cohere-key")

For this guide, we'll use a dataset of 9 codebases with 248 queries, each containing a "golden chunk"—the correct document that should be retrieved. You can find the data at data/codebase_chunks.json and data/evaluation_set.jsonl.

2. Building a Basic RAG Pipeline (Baseline)

Let's establish a performance baseline using standard chunking and embedding:

import json
from typing import List, Dict

Load your chunks

with open("data/codebase_chunks.json", "r") as f: chunks = json.load(f)

Generate embeddings for each chunk

def embed_chunks(chunks: List[str]) -> List[List[float]]: response = vo.embed(chunks, model="voyage-2", input_type="document") return response.embeddings

chunk_embeddings = embed_chunks([c["content"] for c in chunks])

Create a simple vector store (in-memory for demo)

vector_store = list(zip(chunks, chunk_embeddings))

Search function

def search(query: str, k: int = 10) -> List[Dict]: query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0] # Cosine similarity search scores = [] for chunk, emb in vector_store: similarity = sum(a*b for a,b in zip(query_embedding, emb)) scores.append((similarity, chunk)) scores.sort(reverse=True) return [chunk for _, chunk in scores[:k]]
Evaluate baseline performance using Pass@k (whether the golden chunk appears in the top-k results):
def evaluate_pass_at_k(queries: List[Dict], k: int = 10) -> float:
    correct = 0
    for query in queries:
        results = search(query["question"], k=k)
        if query["golden_chunk_id"] in [r["id"] for r in results]:
            correct += 1
    return correct / len(queries)

Load evaluation set

with open("data/evaluation_set.jsonl", "r") as f: eval_queries = [json.loads(line) for line in f]

baseline_pass_10 = evaluate_pass_at_k(eval_queries, k=10) print(f"Baseline Pass@10: {baseline_pass_10:.2%}")

Expected: ~87%

3. Implementing Contextual Embeddings

The core insight is simple: before embedding each chunk, prepend a short context snippet that explains what the chunk is about. You generate this context using Claude.

Step 1: Generate Context for Each Chunk

def generate_chunk_context(chunk: Dict, full_document: str) -> str:
    """Use Claude to generate context for a chunk."""
    prompt = f"""<document>
{full_document}
</document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk['content']} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.""" response = claude.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Generate context for each chunk (this is the expensive part)

for chunk in chunks: chunk["context"] = generate_chunk_context(chunk, chunk["document"])

Step 2: Embed with Context

# Create contextual chunks
contextual_chunks = [
    f"{chunk['context']}\n\n{chunk['content']}" 
    for chunk in chunks
]

Embed the contextualized versions

contextual_embeddings = embed_chunks(contextual_chunks)

Rebuild vector store

contextual_vector_store = list(zip(chunks, contextual_embeddings))

Step 3: Evaluate Improvement

def contextual_search(query: str, k: int = 10) -> List[Dict]:
    query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    
    scores = []
    for chunk, emb in contextual_vector_store:
        similarity = sum(a*b for a,b in zip(query_embedding, emb))
        scores.append((similarity, chunk))
    
    scores.sort(reverse=True)
    return [chunk for _, chunk in scores[:k]]

contextual_pass_10 = evaluate_pass_at_k(eval_queries, k=10) print(f"Contextual Pass@10: {contextual_pass_10:.2%}")

Expected: ~95% (up from ~87%)

4. Cost Optimization with Prompt Caching

Generating context for every chunk can be expensive. Prompt caching reduces costs by ~85% by caching the full document and only sending the changing chunk:

def generate_chunk_context_cached(chunk: Dict, full_document: str) -> str:
    """Use prompt caching to reduce costs."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system=[
            {
                "type": "text",
                "text": f"<document>{full_document}</document>",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{
            "role": "user", 
            "content": f"<chunk>{chunk['content']}</chunk>\n\nGive succinct context for this chunk."
        }]
    )
    return response.content[0].text
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

5. Contextual BM25: Hybrid Search

The same chunk context can improve BM25 (keyword-based) search. Combine it with embeddings for a hybrid approach:

from rank_bm25 import BM25Okapi

Tokenize contextual chunks for BM25

tokenized_corpus = [contextual_chunk.split() for contextual_chunk in contextual_chunks] bm25 = BM25Okapi(tokenized_corpus)

def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]: # Get embedding scores query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0] emb_scores = [] for chunk, emb in contextual_vector_store: similarity = sum(a*b for a,b in zip(query_embedding, emb)) emb_scores.append(similarity) # Get BM25 scores tokenized_query = query.split() bm25_scores = bm25.get_scores(tokenized_query) # Normalize and combine combined_scores = [] for i in range(len(chunks)): normalized_emb = emb_scores[i] / max(emb_scores) normalized_bm25 = bm25_scores[i] / max(bm25_scores) combined = alpha normalized_emb + (1 - alpha) normalized_bm25 combined_scores.append((combined, chunks[i])) combined_scores.sort(reverse=True) return [chunk for _, chunk in combined_scores[:k]]

6. Adding a Reranking Layer

For final precision, add a Cohere reranker:

def rerank(query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
    # Prepare documents for reranking
    docs = [f"{c['context']}\n\n{c['content']}" for c in candidates]
    
    results = co.rerank(
        query=query,
        documents=docs,
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    
    return [candidates[r.index] for r in results.results]

Full pipeline

def advanced_search(query: str) -> List[Dict]: # Step 1: Hybrid search for initial candidates candidates = hybrid_search(query, k=20) # Step 2: Rerank for precision final_results = rerank(query, candidates, top_k=5) return final_results

Production Considerations

For AWS Bedrock Users

Anthropic and AWS have provided a Lambda function for Contextual Retrieval that integrates directly with Bedrock Knowledge Bases. You can find the code in the contextual-rag-lambda-function directory of the cookbook repository. Deploy this Lambda and select it as a custom chunking option when configuring your knowledge base.

Performance Summary

TechniquePass@10Improvement
Basic RAG~87%Baseline
Contextual Embeddings~95%+8%
+ Contextual BM25~96%+9%
+ Reranking~97%+10%

Key Takeaways

  • Contextual Embeddings reduce retrieval failure rates by 35% by adding document-level context to each chunk before embedding, solving the "lost context" problem in traditional RAG.
  • Prompt caching makes this practical by reducing the cost of generating context for thousands of chunks by approximately 85%.
  • Contextual BM25 provides complementary improvements—combining it with contextual embeddings in a hybrid search yields the best results.
  • A reranking layer adds final precision but comes with additional latency and cost; use it only when you need top-5 accuracy.
  • AWS Bedrock users can deploy this as a Lambda function for seamless integration with existing knowledge bases, making production deployment straightforward.