Guide2026-05-05

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude and Contextual Embeddings

Learn how to implement Contextual Embeddings and Contextual BM25 to improve RAG retrieval accuracy. Includes code examples, prompt caching tips, and performance benchmarks.

Quick Answer

This guide shows you how to reduce RAG retrieval failure rates by 35% using Contextual Embeddings. You'll learn to add chunk-specific context before embedding, use prompt caching to control costs, and combine with Contextual BM25 and reranking for maximum accuracy.

RAGContextual EmbeddingsClaudePrompt CachingRetrieval

Introduction

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A. But traditional RAG has a dirty secret: when you split documents into chunks, those chunks often lose their surrounding context. A chunk containing "the revenue grew by 20%" is useless if you don't know which company or quarter it refers to.

Contextual Retrieval solves this by prepending a short, chunk-specific context to each piece of text before embedding. The result? A 35% reduction in top-20 retrieval failure rates across multiple datasets, as demonstrated in Anthropic's research.

In this guide, you'll build a complete Contextual Retrieval system using a codebase dataset. You'll learn:

How to set up a baseline RAG pipeline
How to implement Contextual Embeddings with Claude
How to use prompt caching to keep costs practical
How to combine Contextual Embeddings with Contextual BM25 and reranking

Let's dive in.

Prerequisites

Before starting, make sure you have:

Python 3.8+ installed
Docker (optional, for BM25 search)
Anthropic API key (get one free)
Voyage AI API key (sign up)
Cohere API key (for reranking, optional)
~4GB RAM and 5-10GB disk space

Time: 30–45 minutes Cost: ~$5–10 in API calls for the full dataset

1. Baseline RAG Pipeline

First, let's establish a baseline. We'll use a pre-chunked dataset of 9 codebases with 248 queries, each containing a "golden chunk" that should be retrieved.

Load the Data

import json
with open('data/codebase_chunks.json', 'r') as f:
    chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
    eval_data = [json.loads(line) for line in f]
print(f"Loaded {len(chunks)} chunks and {len(eval_data)} queries")

Generate Embeddings

We'll use Voyage AI's embedding model. Install the client first:

pip install voyageai

import voyageai
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
Embed all chunks
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Evaluate with Pass@k

We'll use Pass@k—whether the golden chunk appears in the top-k results.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def evaluate_pass_at_k(embeddings, chunks, eval_data, k=10):
    correct = 0
    for item in eval_data:
        query_emb = vo.embed([item['query']], model="voyage-2").embeddings[0]
        scores = cosine_similarity([query_emb], embeddings)[0]
        top_k_indices = np.argsort(scores)[-k:][::-1]
        top_k_chunks = [chunks[i]['id'] for i in top_k_indices]
        if item['golden_chunk_id'] in top_k_chunks:
            correct += 1
    return correct / len(eval_data)
baseline_pass10 = evaluate_pass_at_k(embeddings, chunks, eval_data, k=10)
print(f"Baseline Pass@10: {baseline_pass10:.2%}")
Expected: ~87%

2. Contextual Embeddings

Contextual Embeddings prepend a short context to each chunk before embedding. This context is generated by Claude and includes information like the surrounding document, section headers, or relevant metadata.

Generate Context with Claude

from anthropic import Anthropic
client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
def generate_context(chunk_text, surrounding_text):
    prompt = f"""<document>
{surrounding_text}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short, succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip()
Example
chunk = chunks[0]
context = generate_context(chunk['text'], chunk['surrounding_text'])
print(f"Context: {context}")

Embed Contextualized Chunks

contextualized_texts = [
    f"{chunk['context']}\n\n{chunk['text']}" 
    for chunk in chunks
]
contextual_embeddings = vo.embed(contextualized_texts, model="voyage-2").embeddings

Evaluate Again

contextual_pass10 = evaluate_pass_at_k(contextual_embeddings, chunks, eval_data, k=10)
print(f"Contextual Embeddings Pass@10: {contextual_pass10:.2%}")
Expected: ~95%

That's an 8 percentage point improvement—a 35% reduction in failure rate.

Managing Costs with Prompt Caching

Generating context for every chunk can be expensive. Prompt caching slashes costs by reusing the surrounding document across multiple chunks.

response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": surrounding_text,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": f"<chunk>{chunk_text}</chunk>\n\nProvide context..."}]
)

With caching, you pay the full prompt cost only once per document, then a fraction for each subsequent chunk.

3. Contextual BM25

BM25 is a keyword-based retrieval method. By applying the same chunk-specific context to BM25, you can further improve hybrid search.

Set Up BM25

docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.10.0

from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
Index contextualized chunks
for i, chunk in enumerate(chunks):
    doc = {
        "text": f"{chunk['context']}\n\n{chunk['text']}",
        "id": chunk['id']
    }
    es.index(index="contextual_chunks", id=i, document=doc)

Hybrid Search

Combine BM25 scores with embedding similarity scores:

def hybrid_search(query, alpha=0.5):
    # BM25 score
    bm25_results = es.search(
        index="contextual_chunks",
        query={"match": {"text": query}},
        size=50
    )
    
    # Embedding score
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    emb_scores = cosine_similarity([query_emb], contextual_embeddings)[0]
    
    # Combine
    combined = []
    for hit in bm25_results['hits']['hits']:
        idx = int(hit['_id'])
        bm25_score = hit['_score']
        emb_score = emb_scores[idx]
        combined.append((idx, alpha  bm25_score + (1-alpha)  emb_score))
    
    combined.sort(key=lambda x: x[1], reverse=True)
    return [chunks[i[0]] for i in combined[:10]]

4. Reranking for Final Polish

Even with contextual retrieval, a reranker can push accuracy further. Use Cohere's rerank API:

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query, candidates, top_k=10):
    results = co.rerank(
        query=query,
        documents=[c['text'] for c in candidates],
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    return [candidates[r.index] for r in results.results]

Full Pipeline

def contextual_rag_pipeline(query):
    # Step 1: Hybrid search
    candidates = hybrid_search(query, alpha=0.5)
    
    # Step 2: Rerank
    top_chunks = rerank(query, candidates, top_k=5)
    
    # Step 3: Generate answer with Claude
    context = "\n\n".join([c['text'] for c in top_chunks])
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Performance Summary

Method	Pass@10	Improvement
Baseline	~87%	—
Contextual Embeddings	~95%	+8%
+ Contextual BM25	~97%	+10%
+ Reranking	~98%	+11%

Key Takeaways

Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the "lost context" problem in traditional RAG.
Prompt caching makes this practical by reusing surrounding document context across chunks, dramatically lowering API costs.
Combine with Contextual BM25 for hybrid search that leverages both semantic and keyword signals, further boosting accuracy.
Reranking adds the final polish—a lightweight reranker can push Pass@10 from 95% to 98%.
Works on any platform—while demonstrated with Anthropic's API, the same technique can be adapted for AWS Bedrock (with the provided Lambda function) and GCP Vertex AI.

Next Steps

Read the full Anthropic blog post on Contextual Retrieval
Explore the Anthropic Cookbook notebook for the complete code
Try the AWS Bedrock Lambda function in contextual-rag-lambda-function/lambda_function.py for serverless deployment

Happy building!