BeClaude
GuideBeginnerBest Practices2026-05-12

Supercharge Your RAG Pipeline: A Practical Guide to Contextual Retrieval with Claude

Learn how to implement Contextual Embeddings and Contextual BM25 to reduce retrieval failure rates by 35% in your Claude RAG applications. Step-by-step guide with code.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to reduce retrieval failure rates by 35% in Claude-powered RAG systems, using prompt caching to keep costs practical.

RAGContextual EmbeddingsRetrievalPrompt CachingClaude API

Supercharge Your RAG Pipeline: A Practical Guide to Contextual Retrieval with Claude

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—powering everything from customer support chatbots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet without its function name, a paragraph without its section header—these orphans lead to failed retrievals and poor Claude responses.

Contextual Retrieval solves this. By prepending relevant context to each chunk before embedding, you dramatically improve retrieval accuracy. In Anthropic's tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%. In this guide, you'll learn exactly how to implement it.

What You'll Build

By the end of this guide, you'll have a production-ready Contextual Retrieval pipeline that:

  • Establishes a baseline RAG system for performance measurement
  • Implements Contextual Embeddings to boost retrieval accuracy
  • Adds Contextual BM25 for hybrid search improvements
  • Applies reranking for the final polish
We'll use a dataset of 9 codebases (248 queries with known "golden chunks") and measure performance using Pass@k—whether the correct chunk appears in the top-k retrieved results.

Prerequisites

Skills:
  • Intermediate Python
  • Basic RAG understanding
  • Familiarity with vector databases and embeddings
System:
  • Python 3.8+
  • Docker (optional, for BM25)
  • 4GB+ RAM, ~5-10GB disk space
API Keys: Time & Cost: ~30-45 minutes, ~$5-10 in API costs

Step 1: Setting Up the Basic RAG Pipeline

First, let's establish our baseline. We'll load the codebase chunks and evaluation dataset, then implement a simple retrieval system.

import json
import voyageai
from typing import List, Dict

Load data

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: eval_data = [json.loads(line) for line in f]

Initialize Voyage AI client

vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")

Embed all chunks

chunk_texts = [chunk['content'] for chunk in chunks] chunk_embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Simple cosine similarity search

def search(query: str, k: int = 10) -> List[Dict]: query_embedding = vo.embed([query], model="voyage-2").embeddings[0] similarities = [ (i, cosine_similarity(query_embedding, chunk_embeddings[i])) for i in range(len(chunk_embeddings)) ] top_k = sorted(similarities, key=lambda x: x[1], reverse=True)[:k] return [chunks[i] for i, _ in top_k]
Baseline Performance: With this basic setup, you'll likely see Pass@10 around 87%—meaning 13% of queries fail to retrieve the correct chunk in the top 10 results. Let's improve that.

Step 2: Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend a short context snippet that explains where the chunk came from. For codebases, this context might include:

  • The file path
  • The function or class name
  • A brief description of the surrounding module
def create_contextual_chunk(chunk: Dict) -> str:
    """Add context to a chunk before embedding."""
    context_parts = []
    
    if chunk.get('file_path'):
        context_parts.append(f"File: {chunk['file_path']}")
    if chunk.get('function_name'):
        context_parts.append(f"Function: {chunk['function_name']}")
    if chunk.get('class_name'):
        context_parts.append(f"Class: {chunk['class_name']}")
    if chunk.get('description'):
        context_parts.append(f"Description: {chunk['description']}")
    
    context_str = " | ".join(context_parts)
    return f"{context_str}\n\n{chunk['content']}"

Generate contextual chunks

contextual_chunks = [create_contextual_chunk(chunk) for chunk in chunks]

Embed the contextual chunks

contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings

Why This Works

When you search for "how to handle authentication errors," a bare chunk containing raise AuthenticationError("Invalid token") might not rank highly. But with context like File: auth/handler.py | Function: validate_token | Description: Handles JWT token validation, the same chunk becomes highly relevant to authentication-related queries.

Managing Costs with Prompt Caching

The obvious concern: embedding longer strings costs more. Prompt caching is your solution. Available on Anthropic's first-party API (and coming soon to AWS Bedrock and GCP Vertex), prompt caching lets you reuse the system prompt and context across multiple requests.

import anthropic

client = anthropic.Anthropic()

Cache the contextual chunks

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, system=[ { "type": "text", "text": "You are a retrieval assistant. Use the following context to answer questions.", "cache_control": {"type": "ephemeral"} } ], messages=[ { "role": "user", "content": "What is the authentication flow?" } ] )
Performance Boost: Contextual Embeddings alone improved Pass@10 from ~87% to ~95% in Anthropic's tests—a 62% reduction in retrieval failures.

Step 3: Adding Contextual BM25

BM25 is a traditional text-search algorithm that works well for exact keyword matching. By applying the same contextual prefix to chunks before BM25 indexing, you get Contextual BM25—a powerful complement to your embedding-based search.

from rank_bm25 import BM25Okapi
from typing import List

Tokenize contextual chunks for BM25

tokenized_chunks = [chunk.split() for chunk in contextual_chunks] bm25 = BM25Okapi(tokenized_chunks)

def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]: """Combine embedding similarity and BM25 scores.""" # Get embedding scores query_embedding = vo.embed([query], model="voyage-2").embeddings[0] emb_scores = [cosine_similarity(query_embedding, e) for e in contextual_embeddings] # Get BM25 scores tokenized_query = query.split() bm25_scores = bm25.get_scores(tokenized_query) # Normalize and combine emb_scores = normalize(emb_scores) bm25_scores = normalize(bm25_scores) combined = [ alpha emb + (1 - alpha) bm25 for emb, bm25 in zip(emb_scores, bm25_scores) ] top_indices = sorted( range(len(combined)), key=lambda i: combined[i], reverse=True )[:k] return [chunks[i] for i in top_indices]

Why Both? Embeddings capture semantic meaning ("how to fix login issues"), while BM25 excels at exact matches ("AuthenticationError"). Together, they cover more retrieval scenarios.

Step 4: Reranking for Precision

Even with hybrid search, your top-10 results might include near-misses. A reranker (like Cohere's) takes your top-20 results and re-scores them based on deeper semantic understanding.

import cohere

co = cohere.Client("YOUR_COHERE_API_KEY")

def rerank(query: str, candidates: List[Dict], top_k: int = 10) -> List[Dict]: """Rerank candidates using Cohere's reranker.""" candidate_texts = [c['content'] for c in candidates] results = co.rerank( query=query, documents=candidate_texts, model="rerank-english-v2.0", top_n=top_k ) return [candidates[r.index] for r in results]

Final Pipeline:
  • Retrieve top-20 using hybrid Contextual Embeddings + Contextual BM25
  • Rerank to get the final top-10
  • Pass to Claude for answer generation

AWS Bedrock Implementation

For AWS customers, Anthropic's team has provided a Lambda function that implements Contextual Retrieval as a custom chunking strategy for Bedrock Knowledge Bases. You'll find the code in the contextual-rag-lambda-function directory of the cookbook repository.

# lambda_function.py (simplified)
def lambda_handler(event, context):
    chunk = event['chunk']
    context = generate_context(chunk)
    contextual_chunk = f"{context}\n\n{chunk['content']}"
    return {
        'chunkId': chunk['id'],
        'content': contextual_chunk,
        'metadata': chunk.get('metadata', {})
    }

Deploy this Lambda, select it as your custom chunking option when creating a Knowledge Base, and you're set.

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context (file path, function name, description) to each chunk before embedding
  • Pair with Contextual BM25 for hybrid search that combines semantic understanding with exact keyword matching
  • Use prompt caching to manage the increased token costs of contextual chunks—available on Anthropic's API, coming to Bedrock and Vertex
  • Add a reranker (like Cohere) as a final precision layer to eliminate near-misses from your top-k results
  • Start simple: even basic context (just the file path) provides significant improvements over bare chunks
Contextual Retrieval is one of the highest-ROI improvements you can make to a RAG system. The implementation is straightforward, the costs are manageable with prompt caching, and the performance gains are substantial. Your Claude-powered applications will thank you.