GuideBeginnerPricing2026-05-15

Mastering Contextual Retrieval: Boost RAG Accuracy by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 with Claude to dramatically improve RAG retrieval accuracy. Includes code examples, cost optimization with prompt caching, and AWS Bedrock deployment.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% using Claude, Voyage AI, and prompt caching for cost efficiency.

RAGContextual EmbeddingsPrompt CachingRetrievalClaude

Mastering Contextual Retrieval: Boost RAG Accuracy by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering customer support bots, internal knowledge base Q&A, legal document analysis, and code generation. But there's a persistent problem: chunked documents lose context. When you split a 50-page technical manual into 500-character chunks, each chunk becomes an orphan—a fragment without its parent document's narrative.

Anthropic's research team discovered a powerful fix: Contextual Retrieval. By prepending relevant context to each chunk before embedding, they reduced top-20-chunk retrieval failure rates by an average of 35% across diverse datasets. This guide walks you through implementing this technique with Claude, including cost-saving strategies using prompt caching and deployment options for AWS Bedrock.

What You'll Build

By the end of this guide, you'll have:

A basic RAG pipeline with baseline performance metrics
A Contextual Embeddings system that adds chunk-specific context
A Contextual BM25 hybrid search for even better retrieval
A reranking layer to maximize accuracy

Prerequisites

Technical Skills:

Intermediate Python programming
Basic understanding of RAG concepts
Familiarity with vector databases and embeddings

System Requirements:

Python 3.8+
Docker installed (optional, for BM25 search)
4GB+ available RAM
~5-10 GB disk space for vector databases

API Keys:

Time & Cost:

Expected completion: 30-45 minutes
API costs: ~$5-10 for the full dataset

Step 1: Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Initialize your clients:

import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client(api_key="YOUR_COHERE_KEY")

Step 2: Building a Basic RAG Baseline

Before improving retrieval, establish a baseline. We'll use a dataset of 9 codebases (248 queries with known "golden chunks") and measure Pass@k—whether the correct chunk appears in the top-k results.

import json
Load your chunked dataset
with open('data/codebase_chunks.json', 'r') as f:
    chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
    eval_queries = [json.loads(line) for line in f]
Basic chunking (character-based split)
def basic_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks
Embed chunks using Voyage AI
chunk_embeddings = vo.embed(
    texts=chunks,
    model="voyage-2",
    input_type="document"
).embeddings
Simple cosine similarity search
def search(query, k=10):
    query_emb = vo.embed(
        texts=[query],
        model="voyage-2",
        input_type="query"
    ).embeddings[0]
    
    similarities = [
        cosine_similarity(query_emb, chunk_emb)
        for chunk_emb in chunk_embeddings
    ]
    top_indices = sorted(
        range(len(similarities)),
        key=lambda i: similarities[i],
        reverse=True
    )[:k]
    return [chunks[i] for i in top_indices]

Baseline Result: Pass@10 ≈ 87%—decent, but we can do better.

Step 3: Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend a short context that explains where the chunk comes from. This context is generated by Claude itself.

def generate_chunk_context(chunk, document_title, surrounding_text):
    """Use Claude to generate context for a chunk."""
    prompt = f"""You are helping to improve a RAG system. 
    
    Document: {document_title}
    
    Here is a chunk from this document:
    <chunk>{chunk}</chunk>
    
    Here is the surrounding text (100 chars before and after):
    <context>{surrounding_text}</context>
    
    Generate a brief context (2-3 sentences) that explains what this chunk is about 
    and how it fits into the larger document. Focus on:
    - What topic or concept this chunk covers
    - How it relates to adjacent content
    - Any key entities or references
    
    Return ONLY the context, no additional text."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Apply to all chunks
contextual_chunks = []
for chunk in chunks:
    context = generate_chunk_context(
        chunk["text"],
        chunk["document_title"],
        chunk["surrounding_text"]
    )
    contextual_chunks.append(f"{context}\n\n{chunk['text']}")
Embed the contextualized chunks
contextual_embeddings = vo.embed(
    texts=contextual_chunks,
    model="voyage-2",
    input_type="document"
).embeddings

Result: Pass@10 jumps from ~87% to ~95%—a 62% reduction in retrieval failures.

Cost Optimization with Prompt Caching

Generating context for thousands of chunks can be expensive. Claude's prompt caching feature dramatically reduces costs by reusing shared prompt prefixes.

# With prompt caching, the system prompt is cached
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    system=[{
        "type": "text",
        "text": "You are a context generator for a RAG system...",
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": prompt}],
    max_tokens=150
)

Prompt caching reduces API costs by up to 90% for this use case, making Contextual Embeddings practical for production.

Step 4: Contextual BM25 Hybrid Search

BM25 (a text-based retrieval algorithm) can also benefit from contextualized chunks. Combine it with vector search for a hybrid approach.

from rank_bm25 import BM25Okapi
from functools import lru_cache
Tokenize contextual chunks for BM25
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
Hybrid search: combine BM25 and vector scores
def hybrid_search(query, k=10, alpha=0.5):
    # Vector search
    query_emb = vo.embed(
        texts=[query],
        model="voyage-2",
        input_type="query"
    ).embeddings[0]
    
    vector_scores = [
        cosine_similarity(query_emb, emb)
        for emb in contextual_embeddings
    ]
    
    # BM25 search
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize and combine
    combined = [
        alpha * (v / max(vector_scores)) + 
        (1 - alpha) * (b / max(bm25_scores))
        for v, b in zip(vector_scores, bm25_scores)
    ]
    
    top_indices = sorted(
        range(len(combined)),
        key=lambda i: combined[i],
        reverse=True
    )[:k]
    return [chunks[i] for i in top_indices]

Result: Hybrid Contextual Retrieval further improves Pass@10 by 2-3% over Contextual Embeddings alone.

Step 5: Adding a Reranking Layer

For maximum accuracy, add a Cohere reranker to reorder the top-20 results:

def rerank(query, candidates, top_k=10):
    results = co.rerank(
        model="rerank-english-v2.0",
        query=query,
        documents=candidates,
        top_n=top_k
    )
    return [candidates[r.index] for r in results.results]
Full pipeline
query = "How does the authentication module handle token refresh?"
top_20 = hybrid_search(query, k=20)
final_results = rerank(query, top_20, top_k=10)

Final Result: Pass@10 reaches ~97%—near-perfect retrieval.

Deploying to AWS Bedrock

Anthropic provides a ready-to-deploy Lambda function for AWS Bedrock Knowledge Bases. The code is in contextual-rag-lambda-function/lambda_function.py. Deploy it as a custom chunking option:

Create a Lambda function with the provided code
Configure your Bedrock Knowledge Base to use it
Select "Custom chunking" and point to your Lambda ARN

This makes Contextual Retrieval production-ready on AWS without managing infrastructure.

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% by adding chunk-specific context before embedding, solving the "orphaned chunk" problem in traditional RAG.
Prompt caching makes it cost-effective—Claude's ephemeral caching reduces API costs by up to 90% for context generation, making this practical for production.
Hybrid search (Contextual Embeddings + Contextual BM25) outperforms either alone—combining semantic and keyword retrieval yields 2-3% additional improvement.
Reranking adds the final polish—a Cohere reranker on top-20 results pushes Pass@10 to ~97%.
AWS Bedrock deployment is straightforward—use the provided Lambda function for custom chunking in Bedrock Knowledge Bases.

Next Steps

Read the full Anthropic blog post on Contextual Retrieval for more performance evaluations
Experiment with different chunk sizes and overlap ratios
Try Claude 3.5 Sonnet for context generation (higher quality, slightly higher cost)
Explore the complete notebook at Anthropic's Cookbook