BeClaude
Guide2026-05-05

Contextual Retrieval: How to Slash RAG Failure Rates by 35% with Claude and Contextual Embeddings

Learn how to implement Contextual Embeddings and Contextual BM25 to improve RAG retrieval accuracy. Includes code examples, prompt caching tips, and performance benchmarks.

Quick Answer

This guide shows you how to reduce RAG retrieval failure rates by 35% using Contextual Embeddings. You'll learn to add chunk-specific context before embedding, use prompt caching to control costs, and combine with Contextual BM25 and reranking for maximum accuracy.

RAGContextual EmbeddingsClaudePrompt CachingRetrieval

Introduction

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A. But traditional RAG has a dirty secret: when you split documents into chunks, those chunks often lose their surrounding context. A chunk containing "the revenue grew by 20%" is useless if you don't know which company or quarter it refers to.

Contextual Retrieval solves this by prepending a short, chunk-specific context to each piece of text before embedding. The result? A 35% reduction in top-20 retrieval failure rates across multiple datasets, as demonstrated in Anthropic's research.

In this guide, you'll build a complete Contextual Retrieval system using a codebase dataset. You'll learn:

  • How to set up a baseline RAG pipeline
  • How to implement Contextual Embeddings with Claude
  • How to use prompt caching to keep costs practical
  • How to combine Contextual Embeddings with Contextual BM25 and reranking
Let's dive in.

Prerequisites

Before starting, make sure you have:

  • Python 3.8+ installed
  • Docker (optional, for BM25 search)
  • Anthropic API key (get one free)
  • Voyage AI API key (sign up)
  • Cohere API key (for reranking, optional)
  • ~4GB RAM and 5-10GB disk space
Time: 30–45 minutes Cost: ~$5–10 in API calls for the full dataset

1. Baseline RAG Pipeline

First, let's establish a baseline. We'll use a pre-chunked dataset of 9 codebases with 248 queries, each containing a "golden chunk" that should be retrieved.

Load the Data

import json

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: eval_data = [json.loads(line) for line in f]

print(f"Loaded {len(chunks)} chunks and {len(eval_data)} queries")

Generate Embeddings

We'll use Voyage AI's embedding model. Install the client first:

pip install voyageai
import voyageai

vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")

Embed all chunks

chunk_texts = [chunk['text'] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Evaluate with Pass@k

We'll use Pass@k—whether the golden chunk appears in the top-k results.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def evaluate_pass_at_k(embeddings, chunks, eval_data, k=10): correct = 0 for item in eval_data: query_emb = vo.embed([item['query']], model="voyage-2").embeddings[0] scores = cosine_similarity([query_emb], embeddings)[0] top_k_indices = np.argsort(scores)[-k:][::-1] top_k_chunks = [chunks[i]['id'] for i in top_k_indices] if item['golden_chunk_id'] in top_k_chunks: correct += 1 return correct / len(eval_data)

baseline_pass10 = evaluate_pass_at_k(embeddings, chunks, eval_data, k=10) print(f"Baseline Pass@10: {baseline_pass10:.2%}")

Expected: ~87%

2. Contextual Embeddings

Contextual Embeddings prepend a short context to each chunk before embedding. This context is generated by Claude and includes information like the surrounding document, section headers, or relevant metadata.

Generate Context with Claude

from anthropic import Anthropic

client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")

def generate_context(chunk_text, surrounding_text): prompt = f"""<document> {surrounding_text} </document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk_text} </chunk>

Please give a short, succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string.""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text.strip()

Example

chunk = chunks[0] context = generate_context(chunk['text'], chunk['surrounding_text']) print(f"Context: {context}")

Embed Contextualized Chunks

contextualized_texts = [
    f"{chunk['context']}\n\n{chunk['text']}" 
    for chunk in chunks
]

contextual_embeddings = vo.embed(contextualized_texts, model="voyage-2").embeddings

Evaluate Again

contextual_pass10 = evaluate_pass_at_k(contextual_embeddings, chunks, eval_data, k=10)
print(f"Contextual Embeddings Pass@10: {contextual_pass10:.2%}")

Expected: ~95%

That's an 8 percentage point improvement—a 35% reduction in failure rate.

Managing Costs with Prompt Caching

Generating context for every chunk can be expensive. Prompt caching slashes costs by reusing the surrounding document across multiple chunks.

response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": surrounding_text,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": f"<chunk>{chunk_text}</chunk>\n\nProvide context..."}]
)

With caching, you pay the full prompt cost only once per document, then a fraction for each subsequent chunk.

3. Contextual BM25

BM25 is a keyword-based retrieval method. By applying the same chunk-specific context to BM25, you can further improve hybrid search.

Set Up BM25

docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.10.0
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

Index contextualized chunks

for i, chunk in enumerate(chunks): doc = { "text": f"{chunk['context']}\n\n{chunk['text']}", "id": chunk['id'] } es.index(index="contextual_chunks", id=i, document=doc)

Hybrid Search

Combine BM25 scores with embedding similarity scores:

def hybrid_search(query, alpha=0.5):
    # BM25 score
    bm25_results = es.search(
        index="contextual_chunks",
        query={"match": {"text": query}},
        size=50
    )
    
    # Embedding score
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    emb_scores = cosine_similarity([query_emb], contextual_embeddings)[0]
    
    # Combine
    combined = []
    for hit in bm25_results['hits']['hits']:
        idx = int(hit['_id'])
        bm25_score = hit['_score']
        emb_score = emb_scores[idx]
        combined.append((idx, alpha  bm25_score + (1-alpha)  emb_score))
    
    combined.sort(key=lambda x: x[1], reverse=True)
    return [chunks[i[0]] for i in combined[:10]]

4. Reranking for Final Polish

Even with contextual retrieval, a reranker can push accuracy further. Use Cohere's rerank API:

import cohere

co = cohere.Client("YOUR_COHERE_API_KEY")

def rerank(query, candidates, top_k=10): results = co.rerank( query=query, documents=[c['text'] for c in candidates], top_n=top_k, model="rerank-english-v2.0" ) return [candidates[r.index] for r in results.results]

Full Pipeline

def contextual_rag_pipeline(query):
    # Step 1: Hybrid search
    candidates = hybrid_search(query, alpha=0.5)
    
    # Step 2: Rerank
    top_chunks = rerank(query, candidates, top_k=5)
    
    # Step 3: Generate answer with Claude
    context = "\n\n".join([c['text'] for c in top_chunks])
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Performance Summary

MethodPass@10Improvement
Baseline~87%
Contextual Embeddings~95%+8%
+ Contextual BM25~97%+10%
+ Reranking~98%+11%

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the "lost context" problem in traditional RAG.
  • Prompt caching makes this practical by reusing surrounding document context across chunks, dramatically lowering API costs.
  • Combine with Contextual BM25 for hybrid search that leverages both semantic and keyword signals, further boosting accuracy.
  • Reranking adds the final polish—a lightweight reranker can push Pass@10 from 95% to 98%.
  • Works on any platform—while demonstrated with Anthropic's API, the same technique can be adapted for AWS Bedrock (with the provided Lambda function) and GCP Vertex AI.

Next Steps

Happy building!