Guide2026-04-26

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users

Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude AI. Includes code examples, evaluation metrics, and production tips.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—adding relevant context to document chunks before embedding—to reduce retrieval failure rates by up to 35% and improve RAG accuracy with Claude AI.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Introduction

Retrieval Augmented Generation (RAG) is a powerful technique that enables Claude AI to answer questions using your internal knowledge bases, codebases, or any document corpus. However, traditional RAG systems often struggle when individual document chunks lack sufficient context—a problem that leads to missed retrievals and lower-quality answers.

Contextual Retrieval solves this by adding relevant context to each chunk before embedding. This simple but effective method improves the quality of each embedded chunk, allowing for more accurate retrieval and better overall performance. In tests across multiple data sources, Contextual Embeddings reduced the top-20-chunk retrieval failure rate by an average of 35%.

In this guide, you'll learn how to build and optimize a Contextual Retrieval system using Claude AI. We'll cover:

Setting up a basic retrieval pipeline as a baseline
Implementing Contextual Embeddings with prompt caching for cost efficiency
Enhancing BM25 search with contextual information
Improving results further with reranking

Prerequisites

Before starting, ensure you have:

Technical Skills:

Intermediate Python programming
Basic understanding of RAG concepts
Familiarity with vector databases and embeddings

System Requirements:

Python 3.8+
Docker (optional, for BM25 search)
4GB+ available RAM
~5-10 GB disk space for vector databases

API Keys:

Anthropic API key (free tier sufficient)
Voyage AI API key
Cohere API key (for reranking)

Time & Cost:

Expected completion: 30-45 minutes
API costs: ~$5-10 for the full dataset

Step 1: Setting Up the Basic RAG Pipeline

First, let's establish a baseline. We'll use a dataset of 9 codebases, pre-chunked using character splitting. The evaluation dataset contains 248 queries, each with a "golden chunk" that represents the correct answer.

import json
from typing import List, Dict
import voyageai
from anthropic import Anthropic
Load data
with open('data/codebase_chunks.json', 'r') as f:
    chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
    eval_data = [json.loads(line) for line in f]
Initialize clients
vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
claude = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Simple retrieval function
def retrieve(query: str, top_k: int = 10) -> List[Dict]:
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    # Compute cosine similarity (simplified)
    scores = [cosine_similarity(query_embedding, emb) for emb in embeddings]
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return [chunks[i] for i in top_indices]

Evaluation Metric: Pass@k

We'll use Pass@k to measure performance—whether the golden chunk appears in the first k retrieved documents. Our baseline Pass@10 is approximately 87%.

Step 2: Implementing Contextual Embeddings

Contextual Embeddings add relevant context to each chunk before embedding. This context typically includes:

The document title or source
Surrounding chunk summaries
Key entities or concepts from the broader document

Here's how to implement it:

def generate_chunk_context(chunk: Dict, full_document: str) -> str:
    """Generate context for a chunk using Claude."""
    prompt = f"""Given the following chunk from a codebase document, provide a brief context (2-3 sentences) that explains what this chunk is about and how it fits into the larger document.
Full document context:
{full_document[:2000]}  # First 2000 chars for context
Chunk content:
{chunk['content']}
Context:"""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Apply to all chunks (with prompt caching for efficiency)
contextual_chunks = []
for chunk in chunks[:10]:  # Example: first 10 chunks
    context = generate_chunk_context(chunk, chunk.get('document', ''))
    contextual_chunks.append({
        'original': chunk,
        'context': context,
        'contextual_content': f"{context}\n\n{chunk['content']}"
    })

Why Prompt Caching Matters

Generating context for every chunk individually can be expensive. Prompt caching (available on Anthropic's first-party API) dramatically reduces costs by reusing cached prompts across similar chunks. This makes Contextual Embeddings practical for production.

# Example with prompt caching
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=150,
    system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": chunk_prompt}]
)

Performance Improvement: After implementing Contextual Embeddings, our Pass@10 improved from ~87% to ~95%—a significant reduction in retrieval failures.

Step 3: Contextual BM25

BM25 is a traditional keyword-based retrieval method that complements embedding-based search. By applying the same chunk-specific context to BM25, we can further improve hybrid search performance.

from rank_bm25 import BM25Okapi
Tokenize contextual chunks for BM25
tokenized_contextual = [chunk['contextual_content'].split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_contextual)
Hybrid search: combine BM25 and embedding scores
def hybrid_search(query: str, top_k: int = 10, alpha: float = 0.5) -> List[Dict]:
    # Embedding score
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    emb_scores = [cosine_similarity(query_embedding, emb) for emb in contextual_embeddings]
    
    # BM25 score
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize and combine
    combined = [
        alpha * (emb_scores[i] / max(emb_scores)) + 
        (1 - alpha) * (bm25_scores[i] / max(bm25_scores))
        for i in range(len(chunks))
    ]
    
    top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:top_k]
    return [chunks[i] for i in top_indices]

Step 4: Reranking for Final Precision

Reranking adds a final layer of accuracy by using a cross-encoder model to reorder the top-k results. This step is especially useful when you need high precision.

import cohere
co = cohere.Client("YOUR_COHERE_API_KEY")
def rerank(query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[c['content'] for c in candidates],
        top_n=top_k
    )
    return [candidates[r.index] for r in results.results]
Full pipeline
query = "How does the authentication module handle JWT tokens?"
initial_results = hybrid_search(query, top_k=20)
final_results = rerank(query, initial_results, top_k=5)

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can implement Contextual Retrieval as a custom Lambda function for chunking. The AWS team provides a reference implementation in the contextual-rag-lambda-function directory.

# Lambda function skeleton (from contextual-rag-lambda-function/lambda_function.py)
def lambda_handler(event, context):
    # Extract document chunks from event
    chunks = event['chunks']
    
    # Generate context for each chunk using Claude
    contextual_chunks = []
    for chunk in chunks:
        context = generate_context(chunk, event['document'])
        contextual_chunks.append({
            **chunk,
            'content': f"{context}\n\n{chunk['content']}"
        })
    
    return {'chunks': contextual_chunks}

Cost Optimization

Prompt caching reduces context generation costs by up to 90%
Batch processing chunks from the same document together
Use Claude Haiku for context generation (fastest/cheapest model)

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% on average by adding relevant context to each chunk before embedding, significantly improving RAG accuracy.
Prompt caching makes Contextual Retrieval cost-effective for production by reusing cached prompts across similar chunks, reducing API costs by up to 90%.
Hybrid search with Contextual BM25 combines the strengths of semantic and keyword-based retrieval, further improving performance over embeddings alone.
Reranking adds final precision to your retrieval pipeline, ensuring the most relevant results appear at the top.
AWS Bedrock and GCP Vertex AI support custom chunking via Lambda functions, making Contextual Retrieval deployable in enterprise environments.