BeClaude
GuideBeginner2026-05-06

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI

Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25 with Claude AI. Includes code examples, evaluation metrics, and production tips.

Quick Answer

This guide walks you through implementing Contextual Retrieval—adding relevant context to each document chunk before embedding—to reduce retrieval failure rates by up to 35% with Claude AI. You'll learn setup, evaluation, and production deployment tips.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI

Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to answer questions based on your internal knowledge bases, codebases, or document repositories. However, traditional RAG systems often stumble when individual document chunks lack sufficient context—a problem that leads to missed retrievals and incomplete answers.

In this guide, we'll introduce Contextual Retrieval, a technique that dramatically improves retrieval accuracy by adding relevant context to each chunk before embedding. Based on Anthropic's research, this method reduces the top-20-chunk retrieval failure rate by an average of 35% across diverse datasets. We'll walk through implementation using Python, Claude, and supporting APIs, with practical code examples and evaluation metrics.

What You'll Learn

  • How to set up a baseline RAG pipeline for evaluation
  • The theory behind Contextual Embeddings and why they work
  • Step-by-step implementation of Contextual Embeddings with prompt caching
  • How to extend the technique to BM25 search (Contextual BM25)
  • How to further boost performance with reranking

Prerequisites

Technical Skills:
  • Intermediate Python programming
  • Basic understanding of RAG concepts
  • Familiarity with vector databases and embeddings
System Requirements:
  • Python 3.8+
  • Docker (optional, for BM25 search)
  • 4GB+ RAM, ~5-10 GB disk space
API Keys: Time & Cost:
  • Setup: 30–45 minutes
  • API costs: ~$5–10 for the full dataset

1. Setting Up the Baseline RAG Pipeline

Before improving retrieval, we need a baseline. We'll use a pre-chunked dataset of 9 codebases (248 queries with golden chunks) and evaluate using Pass@k—whether the correct chunk appears in the top-k retrieved results.

Install Dependencies

pip install anthropic voyageai cohere

Load and Prepare Data

import json

Load chunks and evaluation set

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: eval_queries = [json.loads(line) for line in f]

Create Embeddings and Vector Store

import voyageai

vo = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")

Embed all chunks

chunk_texts = [chunk['content'] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Store in a simple vector index (use FAISS or Chroma for production)

import numpy as np embedding_matrix = np.array(embeddings)

Evaluate Baseline Performance

def search(query, k=10):
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    scores = np.dot(embedding_matrix, query_emb)
    top_k_indices = np.argsort(scores)[-k:][::-1]
    return [chunks[i] for i in top_k_indices]

Pass@10 evaluation

pass_at_10 = 0 for q in eval_queries: results = search(q['query'], k=10) golden_id = q['golden_chunk_id'] if any(r['id'] == golden_id for r in results): pass_at_10 += 1

print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.2%}")

Typical baseline: ~87% Pass@10.

2. Understanding Contextual Embeddings

The core problem: when you split a document into chunks, each chunk loses its surrounding context. A code snippet like def calculate_total(): might be meaningless without knowing it belongs to an invoice processing module.

Contextual Embeddings solve this by prepending a short context description to each chunk before embedding. For example:
  • Original chunk: "def calculate_total(): return subtotal + tax"
  • Contextual chunk: "This function is part of the Invoice class in the billing module. It calculates the total amount including tax. Code: def calculate_total(): return subtotal + tax"
This enriched chunk produces a more meaningful embedding, leading to better retrieval.

Why It Works

  • Semantic enrichment: The embedding captures both the chunk's content and its role in the larger document.
  • Disambiguation: Similar chunks from different contexts become distinguishable.
  • Improved matching: Queries that reference the broader topic now match more accurately.

Managing Costs with Prompt Caching

Generating context for every chunk with Claude could be expensive. Prompt caching (available on Anthropic's first-party API) reduces costs by reusing the full document context across multiple chunk requests. The system prompt and document are cached once, and only the chunk-specific instruction changes.

import anthropic

client = anthropic.Anthropic(api_key="YOUR_API_KEY")

Full document text

document = """... (entire document content) ..."""

Generate context for each chunk using prompt caching

contexts = [] for chunk in chunks: response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, system=[{ "type": "text", "text": "You are a document context generator. Given a document and a chunk, provide a brief context (1-2 sentences) describing the chunk's role.", "cache_control": {"type": "ephemeral"} }], messages=[ {"role": "user", "content": f"Document: {document}\n\nChunk: {chunk['content']}\n\nProvide context:"} ] ) contexts.append(response.content[0].text)
Note: Prompt caching is available on Anthropic's API and coming soon to AWS Bedrock and GCP Vertex. For Bedrock, AWS provides a Lambda function for custom chunking (see contextual-rag-lambda-function/lambda_function.py in the cookbook).

3. Implementing Contextual Embeddings

Now let's implement the full pipeline.

Step 1: Generate Context for Each Chunk

def generate_context(document, chunk, client):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        system=[{
            "type": "text",
            "text": "Provide a concise context (1-2 sentences) describing this chunk's role in the document. Focus on what the chunk does and where it fits.",
            "cache_control": {"type": "ephemeral"}
        }],
        messages=[
            {"role": "user", "content": f"Document: {document}\n\nChunk: {chunk}\n\nContext:"}
        ]
    )
    return response.content[0].text

Step 2: Create Contextual Embeddings

contextual_chunks = []
for chunk in chunks:
    context = generate_context(document, chunk['content'], client)
    contextual_text = f"{context}\n\n{chunk['content']}"
    contextual_chunks.append(contextual_text)

Embed contextual chunks

contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings

Step 3: Evaluate

# Re-run evaluation with contextual embeddings
contextual_matrix = np.array(contextual_embeddings)

def contextual_search(query, k=10): query_emb = vo.embed([query], model="voyage-2").embeddings[0] scores = np.dot(contextual_matrix, query_emb) top_k_indices = np.argsort(scores)[-k:][::-1] return [chunks[i] for i in top_k_indices]

Evaluate Pass@10

pass_at_10_contextual = 0 for q in eval_queries: results = contextual_search(q['query'], k=10) if any(r['id'] == q['golden_chunk_id'] for r in results): pass_at_10_contextual += 1

print(f"Contextual Embeddings Pass@10: {pass_at_10_contextual / len(eval_queries):.2%}")

Expected improvement: ~87% → ~95% Pass@10.

4. Contextual BM25: Hybrid Search

Contextual retrieval isn't limited to embeddings. You can apply the same context to BM25 (a keyword-based search algorithm) and combine it with embeddings for hybrid search.

Implementing Contextual BM25

from rank_bm25 import BM25Okapi

Tokenize contextual chunks

tokenized_contextual = [chunk.split() for chunk in contextual_chunks] bm25 = BM25Okapi(tokenized_contextual)

def bm25_search(query, k=10): tokenized_query = query.split() scores = bm25.get_scores(tokenized_query) top_k_indices = np.argsort(scores)[-k:][::-1] return [chunks[i] for i in top_k_indices]

Hybrid Search (Embeddings + BM25)

def hybrid_search(query, k=10, alpha=0.5):
    # Get scores from both methods
    emb_scores = np.dot(contextual_matrix, vo.embed([query]).embeddings[0])
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize scores
    emb_scores = (emb_scores - emb_scores.min()) / (emb_scores.max() - emb_scores.min())
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
    
    # Combine
    combined = alpha  emb_scores + (1 - alpha)  bm25_scores
    top_k_indices = np.argsort(combined)[-k:][::-1]
    return [chunks[i] for i in top_k_indices]

Hybrid search often yields the best results, capturing both semantic and keyword matches.

5. Improving Performance with Reranking

Reranking adds a final layer of precision. After retrieving top-k candidates, a cross-encoder model re-scores them based on deeper semantic relevance to the query.

import cohere

co = cohere.Client("YOUR_COHERE_API_KEY")

def rerank(query, candidates, top_n=5): # Prepare documents for reranking docs = [c['content'] for c in candidates] results = co.rerank( model="rerank-english-v3.0", query=query, documents=docs, top_n=top_n ) return [candidates[r.index] for r in results.results]

Use with hybrid search

initial_results = hybrid_search(query, k=20) final_results = rerank(query, initial_results, top_n=5)

Reranking typically adds 2–5% to Pass@k and significantly improves user-perceived relevance.

Production Considerations

  • Prompt Caching: Essential for cost-effective context generation at scale. Cache the full document and system prompt; only vary the chunk.
  • AWS Bedrock Integration: Use the provided Lambda function (contextual-rag-lambda-function/lambda_function.py) as a custom chunking option in Bedrock Knowledge Bases.
  • Vector Database Choice: For production, use FAISS, Pinecone, or Weaviate with proper indexing.
  • Batch Processing: Generate contexts in batches to reduce API calls.
  • Monitoring: Track Pass@k over time to detect drift.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by enriching chunks with surrounding document context before embedding.
  • Prompt caching makes this technique cost-effective for production by caching the document and system prompt across multiple chunk requests.
  • Contextual BM25 extends the same idea to keyword search, and hybrid search (embeddings + BM25) often yields the best results.
  • Reranking adds a final precision layer, improving user-perceived relevance by 2–5%.
  • The technique works across platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex—making it accessible for enterprise deployments.