Guide2026-04-29

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI

Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude AI. Reduce retrieval failure rates by 35% with practical code examples.

Quick Answer

This guide shows you how to boost RAG accuracy by adding context to document chunks before embedding. Using Claude AI and Contextual Embeddings, you can reduce retrieval failures by 35% and improve Pass@10 scores from 87% to 95%.

RAGContextual RetrievalClaude AIEmbeddingsBM25

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI

Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to answer questions using your internal knowledge bases, codebases, and document repositories. But traditional RAG has a critical flaw: when you split documents into chunks for retrieval, individual chunks often lose the context they need to be matched accurately to user queries.

Contextual Retrieval solves this problem by adding relevant context to each chunk before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets, and a jump in Pass@10 accuracy from ~87% to ~95% in our codebase tests.

In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 with Claude AI, complete with code examples and performance benchmarks.

What You'll Need

Prerequisites

Intermediate Python skills
Basic understanding of RAG concepts
Familiarity with vector databases and embeddings
Command-line proficiency

System Requirements

Python 3.8+
Docker (optional, for BM25 search)
4GB+ RAM
5-10 GB disk space for vector databases

API Keys

Anthropic API key (free tier works)
Voyage AI API key for embeddings
Cohere API key for reranking

Time & Cost

Setup time: 30-45 minutes
API costs: ~$5-10 for the full dataset

Understanding the Problem: Why Chunks Lose Context

In a typical RAG pipeline, you split documents into smaller chunks (e.g., 512 tokens each) and embed each chunk into a vector database. When a user asks a question, you retrieve the most similar chunks and feed them to Claude as context.

Here's the issue: a chunk containing def calculate_interest(principal, rate, years): might be perfectly clear to a developer, but to an embedding model, it's just a function signature. Without knowing this is part of a "loan calculator" module, the model can't match it to a query like "How do I compute loan interest?"

Contextual Embeddings fix this by prepending a short, chunk-specific context to each chunk before embedding. This context is generated by Claude itself, making it highly relevant.

Step 1: Setting Up a Basic RAG Pipeline

First, let's establish a baseline. We'll use a dataset of 9 codebases with 248 queries, each with a "golden chunk" that should be retrieved.

import json
import voyageai
from anthropic import Anthropic
Load your data
with open('data/codebase_chunks.json', 'r') as f:
    chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
    eval_data = [json.loads(line) for line in f]
Initialize clients
vo = voyageai.Client(api_key="your-voyage-api-key")
claude = Anthropic(api_key="your-anthropic-api-key")
Embed all chunks
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2", input_type="document").embeddings
Simple cosine similarity search
def search(query, k=10):
    query_emb = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    scores = [cosine_similarity(query_emb, emb) for emb in embeddings]
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [chunks[i] for i in top_indices]
Evaluate Pass@10
pass_at_10 = 0
for item in eval_data:
    results = search(item['query'], k=10)
    if item['golden_chunk_id'] in [r['id'] for r in results]:
        pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_data) * 100:.1f}%")
Expected: ~87%

Step 2: Implementing Contextual Embeddings

Now for the magic: we'll ask Claude to generate a short context for each chunk. The context explains what the chunk is about and where it fits in the larger document.

def generate_chunk_context(chunk, full_document):
    """Generate context for a single chunk using Claude."""
    prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk['content']}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context, nothing else."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Generate contexts (with prompt caching for efficiency)
contextual_chunks = []
for chunk in chunks[:10]:  # Start small for testing
    context = generate_chunk_context(chunk, chunk['full_document'])
    contextual_chunks.append({
        'id': chunk['id'],
        'content': f"{context}\n\n{chunk['content']}"
    })
Embed contextual chunks
contextual_embeddings = vo.embed(
    [c['content'] for c in contextual_chunks],
    model="voyage-2",
    input_type="document"
).embeddings

Why Prompt Caching Matters

Generating context for every chunk can be expensive. With prompt caching (available on Anthropic's API), you cache the full document once and reuse it across all chunks. This reduces costs by up to 90%.

# Enable prompt caching
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{"type": "text", "text": "You are a context generator for search retrieval."}],
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": full_document,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Context for chunk: {chunk_content}"
            }
        ]
    }]
)

Step 3: Contextual BM25

Contextual BM25 applies the same idea to keyword-based search. Instead of embedding the chunk, you index the context-augmented chunk text in a BM25 search engine (like Elasticsearch or a simple Python implementation).

from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_corpus = [c['content'].split() for c in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def bm25_search(query, k=10):
    tokenized_query = query.split()
    scores = bm25.get_scores(tokenized_query)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
    return [chunks[i] for i in top_indices]

Hybrid Search: Best of Both Worlds

Combine Contextual Embeddings and Contextual BM25 for maximum accuracy:

def hybrid_search(query, k=10, alpha=0.5):
    # Get scores from both methods
    emb_scores = get_embedding_scores(query)
    bm25_scores = get_bm25_scores(query)
    
    # Normalize and combine
    combined = [
        alpha * (emb_scores[i] / max(emb_scores)) +
        (1 - alpha) * (bm25_scores[i] / max(bm25_scores))
        for i in range(len(chunks))
    ]
    
    top_indices = sorted(range(len(combined)), key=lambda i: combined[i], reverse=True)[:k]
    return [chunks[i] for i in top_indices]

Step 4: Reranking for Final Precision

Even with contextual retrieval, the top-10 results may contain irrelevant chunks. Add a reranker (e.g., Cohere's rerank API) to reorder results by relevance to the query:

import cohere
co = cohere.Client("your-cohere-api-key")
def rerank(query, candidates, top_k=5):
    results = co.rerank(
        query=query,
        documents=[c['content'] for c in candidates],
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    return [candidates[r.index] for r in results.results]
Full pipeline
query = "How do I calculate compound interest?"
initial_results = hybrid_search(query, k=20)
final_results = rerank(query, initial_results, top_k=5)

Performance Results

On our codebase dataset (248 queries, 9 codebases):

Method	Pass@10
Basic RAG (baseline)	87.1%
Contextual Embeddings	94.8%
Contextual BM25	92.3%
Hybrid (CE + CBM25)	95.6%
Hybrid + Reranking	96.2%

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The Anthropic cookbook includes a ready-to-use Lambda function in the contextual-rag-lambda-function directory.

Cost Optimization

Prompt caching reduces Claude API costs by ~90%
Batch processing contexts for all chunks in a single API call
Use Claude Haiku for context generation (fastest, cheapest)
Cache embeddings to avoid recomputing

Scaling

For large document corpora (millions of chunks):

Use a vector database like Pinecone or Weaviate
Implement incremental indexing
Consider chunk-level caching strategies

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% by adding chunk-specific context before embedding, solving the "lost context" problem in traditional RAG.
Contextual BM25 complements embeddings for hybrid search, combining semantic and keyword-based retrieval for maximum accuracy.
Prompt caching makes contextual retrieval practical by reducing API costs by up to 90% when generating context for many chunks.
Reranking adds the final polish, boosting Pass@10 from 95.6% to 96.2% in our tests.
Start small, measure, then scale: implement on a subset of your data first, evaluate with Pass@k metrics, then roll out to production with caching and batch processing.