GuideBeginnerBest Practices2026-05-12

Supercharge Your RAG Pipeline: A Practical Guide to Contextual Retrieval with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to reduce retrieval failure rates by 35% in your Claude RAG applications. Step-by-step guide with code.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% in Claude-powered RAG systems, using prompt caching to keep costs practical.

RAGContextual EmbeddingsRetrievalPrompt CachingClaude API

Supercharge Your RAG Pipeline: A Practical Guide to Contextual Retrieval with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering everything from customer support chatbots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet without its function name, a paragraph without its section header—these orphans lead to failed retrievals and poor Claude responses.

Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%. In this guide, you'll learn exactly how to implement it.

What You'll Build

By the end of this guide, you'll have a production-ready Contextual Retrieval pipeline that:

Establishes a baseline RAG system for performance measurement
Implements Contextual Embeddings to boost retrieval accuracy
Adds Contextual BM25 for hybrid search improvements
Applies reranking for the final polish

We'll use a dataset of 9 codebases (248 queries with known "golden chunks") and measure performance using Pass@k—whether the correct chunk appears in the top-k retrieved results.

Prerequisites

Skills:

Intermediate Python
Basic RAG understanding
Familiarity with vector databases and embeddings

System:

Python 3.8+
Docker (optional, for BM25)
4GB+ RAM, ~5-10GB disk space

API Keys:

Anthropic API key
Voyage AI API key (for embeddings)
Cohere API key (for reranking)

Time & Cost: ~30-45 minutes, ~$5-10 in API costs

Step 1: Setting Up the Basic RAG Pipeline

First, let's establish our baseline. We'll load the codebase chunks and evaluation dataset, then implement a simple retrieval system.

import json
import voyageai
from typing import List, Dict
Load data
with open('data/codebase_chunks.json', 'r') as f:
    chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
    eval_data = [json.loads(line) for line in f]
Initialize Voyage AI client
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
chunk_embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Simple cosine similarity search
def search(query: str, k: int = 10) -> List[Dict]:
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    similarities = [
        (i, cosine_similarity(query_embedding, chunk_embeddings[i]))
        for i in range(len(chunk_embeddings))
    ]
    top_k = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
    return [chunks[i] for i, _ in top_k]

Baseline Performance: With this basic setup, you'll likely see Pass@10 around 87%—meaning 13% of queries fail to retrieve the correct chunk in the top 10 results. Let's improve that.

Step 2: Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend a short context snippet that explains where the chunk came from. For codebases, this context might include:

The file path
The function or class name
A brief description of the surrounding module

def create_contextual_chunk(chunk: Dict) -> str:
    """Add context to a chunk before embedding."""
    context_parts = []
    
    if chunk.get('file_path'):
        context_parts.append(f"File: {chunk['file_path']}")
    if chunk.get('function_name'):
        context_parts.append(f"Function: {chunk['function_name']}")
    if chunk.get('class_name'):
        context_parts.append(f"Class: {chunk['class_name']}")
    if chunk.get('description'):
        context_parts.append(f"Description: {chunk['description']}")
    
    context_str = " | ".join(context_parts)
    return f"{context_str}\n\n{chunk['content']}"
Generate contextual chunks
contextual_chunks = [create_contextual_chunk(chunk) for chunk in chunks]
Embed the contextual chunks
contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings

Why This Works

When you search for "how to handle authentication errors," a bare chunk containing raise AuthenticationError("Invalid token") might not rank highly. But with context like File: auth/handler.py | Function: validate_token | Description: Handles JWT token validation, the same chunk becomes highly relevant to authentication-related queries.

Managing Costs with Prompt Caching

The obvious concern: embedding longer strings costs more. Prompt caching is your solution. Available on Anthropic's first-party API (and coming soon to AWS Bedrock and GCP Vertex), prompt caching lets you reuse the system prompt and context across multiple requests.

import anthropic
client = anthropic.Anthropic()
Cache the contextual chunks
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1000,
    system=[
        {
            "type": "text",
            "text": "You are a retrieval assistant. Use the following context to answer questions.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What is the authentication flow?"
        }
    ]
)

Performance Boost: Contextual Embeddings alone improved Pass@10 from ~87% to ~95% in Anthropic's tests—a 62% reduction in retrieval failures.

Step 3: Adding Contextual BM25

BM25 is a traditional text-search algorithm that works well for exact keyword matching. By applying the same contextual prefix to chunks before BM25 indexing, you get Contextual BM25—a powerful complement to your embedding-based search.

from rank_bm25 import BM25Okapi
from typing import List
Tokenize contextual chunks for BM25
tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_chunks)
def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]:
    """Combine embedding similarity and BM25 scores."""
    # Get embedding scores
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    emb_scores = [cosine_similarity(query_embedding, e) for e in contextual_embeddings]
    
    # Get BM25 scores
    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Normalize and combine
    emb_scores = normalize(emb_scores)
    bm25_scores = normalize(bm25_scores)
    
    combined = [
        alpha  emb + (1 - alpha)  bm25
        for emb, bm25 in zip(emb_scores, bm25_scores)
    ]
    
    top_indices = sorted(
        range(len(combined)),
        key=lambda i: combined[i],
        reverse=True
    )[:k]
    
    return [chunks[i] for i in top_indices]

Why Both? Embeddings capture semantic meaning ("how to fix login issues"), while BM25 excels at exact matches ("AuthenticationError"). Together, they cover more retrieval scenarios.

Step 4: Reranking for Precision

Even with hybrid search, your top-10 results might include near-misses. A reranker (like Cohere's) takes your top-20 results and re-scores them based on deeper semantic understanding.

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query: str, candidates: List[Dict], top_k: int = 10) -> List[Dict]:
    """Rerank candidates using Cohere's reranker."""
    candidate_texts = [c['content'] for c in candidates]
    
    results = co.rerank(
        query=query,
        documents=candidate_texts,
        model="rerank-english-v2.0",
        top_n=top_k
    )
    
    return [candidates[r.index] for r in results]

Final Pipeline:

Retrieve top-20 using hybrid Contextual Embeddings + Contextual BM25
Rerank to get the final top-10
Pass to Claude for answer generation

AWS Bedrock Implementation

For AWS customers, Anthropic's team has provided a Lambda function that implements Contextual Retrieval as a custom chunking strategy for Bedrock Knowledge Bases. You'll find the code in the contextual-rag-lambda-function directory of the cookbook repository.

# lambda_function.py (simplified)
def lambda_handler(event, context):
    chunk = event['chunk']
    context = generate_context(chunk)
    contextual_chunk = f"{context}\n\n{chunk['content']}"
    return {
        'chunkId': chunk['id'],
        'content': contextual_chunk,
        'metadata': chunk.get('metadata', {})
    }

Deploy this Lambda, select it as your custom chunking option when creating a Knowledge Base, and you're set.

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context (file path, function name, description) to each chunk before embedding
Pair with Contextual BM25 for hybrid search that combines semantic understanding with exact keyword matching
Use prompt caching to manage the increased token costs of contextual chunks—available on Anthropic's API, coming to Bedrock and Vertex
Add a reranker (like Cohere) as a final precision layer to eliminate near-misses from your top-k results
Start simple: even basic context (just the file path) provides significant improvements over bare chunks

Contextual Retrieval is one of the highest-ROI improvements you can make to a RAG system. The implementation is straightforward, the costs are manageable with prompt caching, and the performance gains are substantial. Your Claude-powered applications will thank you.