Guide2026-05-05

Mastering Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude AI to reduce retrieval failure rates by 35%. A step-by-step guide with code examples for production-ready RAG systems.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to dramatically improve RAG accuracy. You'll learn to reduce retrieval failure rates by 35% using Claude, Voyage AI, and BM25 search.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Mastering Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support chatbots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A chunk containing "the revenue increased by 20%" is useless if the system doesn't know which company or quarter it refers to.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets. In this guide, you'll learn how to implement this technique using Claude, Voyage AI embeddings, and BM25 search—with practical code you can adapt for production.

What You'll Build

By the end of this guide, you'll have built a complete Contextual Retrieval pipeline that:

Uses Claude to generate context for each document chunk
Embeds chunks with their context for more accurate vector search
Combines contextual embeddings with contextual BM25 for hybrid search
Optionally adds a reranking step for maximum precision

Prerequisites

Technical Skills:

Intermediate Python
Basic understanding of RAG and vector databases
Command-line proficiency

System Requirements:

Python 3.8+
Docker (optional, for BM25)
4GB+ RAM, 5-10 GB disk space

API Keys:

Anthropic API key (free tier works)
Voyage AI API key
Cohere API key (for reranking)

Time & Cost:

~30-45 minutes to complete
~$5-10 in API costs for the full dataset

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere numpy pandas

Initialize your clients:

import anthropic
import voyageai
Initialize API clients
claude = anthropic.Anthropic(api_key="your-anthropic-key")
vo = voyageai.Client(api_key="your-voyage-key")
Load your dataset (example structure)
import json
with open("data/codebase_chunks.json", "r") as f:
    chunks = json.load(f)
with open("data/evaluation_set.jsonl", "r") as f:
    eval_queries = [json.loads(line) for line in f]

2. The Problem: Contextless Chunks

In traditional RAG, you split documents into chunks and embed each chunk independently. Consider this chunk from a codebase:

def calculate_metrics(data):
    return precision_score(data.y_true, data.y_pred)

Without context, the embedding doesn't capture that this function is part of a classification evaluation module. A query like "How do I evaluate my classifier?" might miss this chunk entirely.

Contextual Embeddings fix this by asking Claude to generate a concise context for each chunk:

def generate_chunk_context(chunk_text, full_document):
    """Use Claude to generate context for a chunk."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context."""
        }]
    )
    return response.content[0].text

3. Implementing Contextual Embeddings

Now, let's build the full pipeline. We'll process each chunk, generate its context, and embed the combined text:

import numpy as np
from typing import List, Dict
def build_contextual_embeddings(chunks: List[Dict], documents: Dict[str, str]) -> np.ndarray:
    """Generate contextual embeddings for all chunks."""
    contextual_chunks = []
    
    for chunk in chunks:
        doc_id = chunk["doc_id"]
        full_doc = documents[doc_id]
        
        # Generate context using Claude
        context = generate_chunk_context(chunk["text"], full_doc)
        
        # Prepend context to chunk
        contextual_text = f"{context}\n\n{chunk['text']}"
        contextual_chunks.append(contextual_text)
    
    # Batch embed with Voyage AI
    embeddings = vo.embed(
        contextual_chunks,
        model="voyage-2",
        input_type="document"
    ).embeddings
    
    return np.array(embeddings)

Performance Results

On a dataset of 9 codebases with 248 queries, Contextual Embeddings improved Pass@10 from ~87% to ~95%—a 35% reduction in retrieval failures.

4. Managing Costs with Prompt Caching

Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the document prefix across multiple context-generation calls:

# Use prompt caching for the full document
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": f"<document>{full_document}</document>",
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{
        "role": "user",
        "content": f"<chunk>{chunk_text}</chunk>\n\nPlease give a short succinct context..."
    }]
)

Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

5. Contextual BM25: Hybrid Search

The same context you generated for embeddings can also improve BM25 (keyword-based) search. This creates a powerful hybrid system:

from rank_bm25 import BM25Okapi
def build_contextual_bm25(contextual_chunks: List[str]):
    """Build BM25 index from contextual chunks."""
    tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
    return BM25Okapi(tokenized_chunks)
Hybrid search: combine vector and BM25 scores
def hybrid_search(query: str, vector_embeddings, bm25_index, alpha=0.5):
    """Combine vector and BM25 scores."""
    # Vector similarity
    query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    vector_scores = cosine_similarity([query_embedding], vector_embeddings)[0]
    
    # BM25 scores
    bm25_scores = bm25_index.get_scores(query.split())
    
    # Normalize and combine
    vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
    
    combined = alpha  vector_scores + (1 - alpha)  bm25_scores
    return combined.argsort()[::-1]

6. Adding Reranking for Maximum Precision

For production systems, add a reranking step using Cohere's rerank API:

import cohere
co = cohere.Client("your-cohere-key")
def rerank_results(query: str, candidates: List[str], top_k: int = 10):
    """Rerank retrieved chunks for maximum relevance."""
    results = co.rerank(
        query=query,
        documents=candidates,
        model="rerank-english-v2.0",
        top_n=top_k
    )
    return [r.document for r in results.results]

7. Putting It All Together

Here's the complete pipeline:

def contextual_rag_pipeline(query: str, chunks, documents, vector_db, bm25_index):
    # 1. Generate context for the query (optional, but helps)
    query_context = generate_chunk_context(query, "")
    contextual_query = f"{query_context}\n\n{query}"
    
    # 2. Hybrid retrieval
    top_indices = hybrid_search(contextual_query, vector_db, bm25_index)
    retrieved_chunks = [chunks[i] for i in top_indices[:20]]
    
    # 3. Rerank
    reranked = rerank_results(query, [c["text"] for c in retrieved_chunks], top_k=5)
    
    # 4. Generate answer with Claude
    context = "\n\n".join(reranked)
    response = claude.messages.create(
        model="claude-3-sonnet-20240229",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Deployment Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context during chunking. The code is available in the contextual-rag-lambda-function directory of the cookbook repository. Configure it as a custom chunking option when creating your knowledge base.

Cost Optimization

Use Claude 3 Haiku for context generation (fastest, cheapest)
Batch your context generation calls
Cache generated contexts in a database for reuse
Use prompt caching to reduce token usage by up to 90%

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context to each chunk before embedding, solving the "lost context" problem in traditional RAG
Hybrid search with Contextual BM25 combines semantic and keyword matching for more robust retrieval—use both vector and BM25 scores
Prompt caching makes this practical for production by dramatically reducing the cost of generating context for thousands of chunks
Reranking adds a final precision boost—use Cohere's rerank API or Claude itself to reorder retrieved chunks before generation
Start simple, then layer complexity: Begin with Contextual Embeddings alone, then add BM25 and reranking as needed for your use case