BeClaude
GuideBeginnerPricing2026-05-15

Mastering Contextual Retrieval: Boost RAG Accuracy by 35% with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 with Claude to dramatically improve RAG retrieval accuracy. Includes code examples, cost optimization with prompt caching, and AWS Bedrock deployment.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% using Claude, Voyage AI, and prompt caching for cost efficiency.

RAGContextual EmbeddingsPrompt CachingRetrievalClaude

Mastering Contextual Retrieval: Boost RAG Accuracy by 35% with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering customer support bots, internal knowledge base Q&A, legal document analysis, and code generation. But there's a persistent problem: chunked documents lose context. When you split a 50-page technical manual into 500-character chunks, each chunk becomes an orphan—a fragment without its parent document's narrative.

Anthropic's research team discovered a powerful fix: Contextual Retrieval. By prepending relevant context to each chunk before embedding, they reduced top-20-chunk retrieval failure rates by an average of 35% across diverse datasets. This guide walks you through implementing this technique with Claude, including cost-saving strategies using prompt caching and deployment options for AWS Bedrock.

What You'll Build

By the end of this guide, you'll have:

  • A basic RAG pipeline with baseline performance metrics
  • A Contextual Embeddings system that adds chunk-specific context
  • A Contextual BM25 hybrid search for even better retrieval
  • A reranking layer to maximize accuracy

Prerequisites

Technical Skills:
  • Intermediate Python programming
  • Basic understanding of RAG concepts
  • Familiarity with vector databases and embeddings
System Requirements:
  • Python 3.8+
  • Docker installed (optional, for BM25 search)
  • 4GB+ available RAM
  • ~5-10 GB disk space for vector databases
API Keys: Time & Cost:
  • Expected completion: 30-45 minutes
  • API costs: ~$5-10 for the full dataset

Step 1: Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Initialize your clients:

import anthropic
import voyageai
import cohere

Initialize API clients

claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY") vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY") co = cohere.Client(api_key="YOUR_COHERE_KEY")

Step 2: Building a Basic RAG Baseline

Before improving retrieval, establish a baseline. We'll use a dataset of 9 codebases (248 queries with known "golden chunks") and measure Pass@k—whether the correct chunk appears in the top-k results.

import json

Load your chunked dataset

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: eval_queries = [json.loads(line) for line in f]

Basic chunking (character-based split)

def basic_chunk(text, chunk_size=500, overlap=50): chunks = [] for i in range(0, len(text), chunk_size - overlap): chunks.append(text[i:i + chunk_size]) return chunks

Embed chunks using Voyage AI

chunk_embeddings = vo.embed( texts=chunks, model="voyage-2", input_type="document" ).embeddings

Simple cosine similarity search

def search(query, k=10): query_emb = vo.embed( texts=[query], model="voyage-2", input_type="query" ).embeddings[0] similarities = [ cosine_similarity(query_emb, chunk_emb) for chunk_emb in chunk_embeddings ] top_indices = sorted( range(len(similarities)), key=lambda i: similarities[i], reverse=True )[:k] return [chunks[i] for i in top_indices]
Baseline Result: Pass@10 ≈ 87%—decent, but we can do better.

Step 3: Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend a short context that explains where the chunk comes from. This context is generated by Claude itself.

def generate_chunk_context(chunk, document_title, surrounding_text):
    """Use Claude to generate context for a chunk."""
    prompt = f"""You are helping to improve a RAG system. 
    
    Document: {document_title}
    
    Here is a chunk from this document:
    <chunk>{chunk}</chunk>
    
    Here is the surrounding text (100 chars before and after):
    <context>{surrounding_text}</context>
    
    Generate a brief context (2-3 sentences) that explains what this chunk is about 
    and how it fits into the larger document. Focus on:
    - What topic or concept this chunk covers
    - How it relates to adjacent content
    - Any key entities or references
    
    Return ONLY the context, no additional text."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Apply to all chunks

contextual_chunks = [] for chunk in chunks: context = generate_chunk_context( chunk["text"], chunk["document_title"], chunk["surrounding_text"] ) contextual_chunks.append(f"{context}\n\n{chunk['text']}")

Embed the contextualized chunks

contextual_embeddings = vo.embed( texts=contextual_chunks, model="voyage-2", input_type="document" ).embeddings
Result: Pass@10 jumps from ~87% to ~95%—a 62% reduction in retrieval failures.

Cost Optimization with Prompt Caching

Generating context for thousands of chunks can be expensive. Claude's prompt caching feature dramatically reduces costs by reusing shared prompt prefixes.

# With prompt caching, the system prompt is cached
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    system=[{
        "type": "text",
        "text": "You are a context generator for a RAG system...",
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": prompt}],
    max_tokens=150
)

Prompt caching reduces API costs by up to 90% for this use case, making Contextual Embeddings practical for production.

Step 4: Contextual BM25 Hybrid Search

BM25 (a text-based retrieval algorithm) can also benefit from contextualized chunks. Combine it with vector search for a hybrid approach.

from rank_bm25 import BM25Okapi
from functools import lru_cache

Tokenize contextual chunks for BM25

tokenized_chunks = [chunk.split() for chunk in contextual_chunks] bm25 = BM25Okapi(tokenized_chunks)

Hybrid search: combine BM25 and vector scores

def hybrid_search(query, k=10, alpha=0.5): # Vector search query_emb = vo.embed( texts=[query], model="voyage-2", input_type="query" ).embeddings[0] vector_scores = [ cosine_similarity(query_emb, emb) for emb in contextual_embeddings ] # BM25 search bm25_scores = bm25.get_scores(query.split()) # Normalize and combine combined = [ alpha * (v / max(vector_scores)) + (1 - alpha) * (b / max(bm25_scores)) for v, b in zip(vector_scores, bm25_scores) ] top_indices = sorted( range(len(combined)), key=lambda i: combined[i], reverse=True )[:k] return [chunks[i] for i in top_indices]
Result: Hybrid Contextual Retrieval further improves Pass@10 by 2-3% over Contextual Embeddings alone.

Step 5: Adding a Reranking Layer

For maximum accuracy, add a Cohere reranker to reorder the top-20 results:

def rerank(query, candidates, top_k=10):
    results = co.rerank(
        model="rerank-english-v2.0",
        query=query,
        documents=candidates,
        top_n=top_k
    )
    return [candidates[r.index] for r in results.results]

Full pipeline

query = "How does the authentication module handle token refresh?" top_20 = hybrid_search(query, k=20) final_results = rerank(query, top_20, top_k=10)
Final Result: Pass@10 reaches ~97%—near-perfect retrieval.

Deploying to AWS Bedrock

Anthropic provides a ready-to-deploy Lambda function for AWS Bedrock Knowledge Bases. The code is in contextual-rag-lambda-function/lambda_function.py. Deploy it as a custom chunking option:

  • Create a Lambda function with the provided code
  • Configure your Bedrock Knowledge Base to use it
  • Select "Custom chunking" and point to your Lambda ARN
This makes Contextual Retrieval production-ready on AWS without managing infrastructure.

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by adding chunk-specific context before embedding, solving the "orphaned chunk" problem in traditional RAG.
  • Prompt caching makes it cost-effective—Claude's ephemeral caching reduces API costs by up to 90% for context generation, making this practical for production.
  • Hybrid search (Contextual Embeddings + Contextual BM25) outperforms either alone—combining semantic and keyword retrieval yields 2-3% additional improvement.
  • Reranking adds the final polish—a Cohere reranker on top-20 results pushes Pass@10 to ~97%.
  • AWS Bedrock deployment is straightforward—use the provided Lambda function for custom chunking in Bedrock Knowledge Bases.

Next Steps