BeClaude
Guide2026-05-05

Mastering Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude AI to reduce retrieval failure rates by 35%. A step-by-step guide with code examples for production-ready RAG systems.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to dramatically improve RAG accuracy. You'll learn to reduce retrieval failure rates by 35% using Claude, Voyage AI, and BM25 search.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Mastering Contextual Retrieval: How to Supercharge RAG with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support chatbots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A chunk containing "the revenue increased by 20%" is useless if the system doesn't know which company or quarter it refers to.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets. In this guide, you'll learn how to implement this technique using Claude, Voyage AI embeddings, and BM25 search—with practical code you can adapt for production.

What You'll Build

By the end of this guide, you'll have built a complete Contextual Retrieval pipeline that:

  • Uses Claude to generate context for each document chunk
  • Embeds chunks with their context for more accurate vector search
  • Combines contextual embeddings with contextual BM25 for hybrid search
  • Optionally adds a reranking step for maximum precision

Prerequisites

Technical Skills:
  • Intermediate Python
  • Basic understanding of RAG and vector databases
  • Command-line proficiency
System Requirements:
  • Python 3.8+
  • Docker (optional, for BM25)
  • 4GB+ RAM, 5-10 GB disk space
API Keys: Time & Cost:
  • ~30-45 minutes to complete
  • ~$5-10 in API costs for the full dataset

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere numpy pandas

Initialize your clients:

import anthropic
import voyageai

Initialize API clients

claude = anthropic.Anthropic(api_key="your-anthropic-key") vo = voyageai.Client(api_key="your-voyage-key")

Load your dataset (example structure)

import json with open("data/codebase_chunks.json", "r") as f: chunks = json.load(f)

with open("data/evaluation_set.jsonl", "r") as f: eval_queries = [json.loads(line) for line in f]

2. The Problem: Contextless Chunks

In traditional RAG, you split documents into chunks and embed each chunk independently. Consider this chunk from a codebase:

def calculate_metrics(data):
    return precision_score(data.y_true, data.y_pred)

Without context, the embedding doesn't capture that this function is part of a classification evaluation module. A query like "How do I evaluate my classifier?" might miss this chunk entirely.

Contextual Embeddings fix this by asking Claude to generate a concise context for each chunk:
def generate_chunk_context(chunk_text, full_document):
    """Use Claude to generate context for a chunk."""
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""<document>
{full_document}
</document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk_text} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context.""" }] ) return response.content[0].text

3. Implementing Contextual Embeddings

Now, let's build the full pipeline. We'll process each chunk, generate its context, and embed the combined text:

import numpy as np
from typing import List, Dict

def build_contextual_embeddings(chunks: List[Dict], documents: Dict[str, str]) -> np.ndarray: """Generate contextual embeddings for all chunks.""" contextual_chunks = [] for chunk in chunks: doc_id = chunk["doc_id"] full_doc = documents[doc_id] # Generate context using Claude context = generate_chunk_context(chunk["text"], full_doc) # Prepend context to chunk contextual_text = f"{context}\n\n{chunk['text']}" contextual_chunks.append(contextual_text) # Batch embed with Voyage AI embeddings = vo.embed( contextual_chunks, model="voyage-2", input_type="document" ).embeddings return np.array(embeddings)

Performance Results

On a dataset of 9 codebases with 248 queries, Contextual Embeddings improved Pass@10 from ~87% to ~95%—a 35% reduction in retrieval failures.

4. Managing Costs with Prompt Caching

Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the document prefix across multiple context-generation calls:

# Use prompt caching for the full document
response = claude.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{
        "type": "text",
        "text": f"<document>{full_document}</document>",
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{
        "role": "user",
        "content": f"<chunk>{chunk_text}</chunk>\n\nPlease give a short succinct context..."
    }]
)
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

5. Contextual BM25: Hybrid Search

The same context you generated for embeddings can also improve BM25 (keyword-based) search. This creates a powerful hybrid system:

from rank_bm25 import BM25Okapi

def build_contextual_bm25(contextual_chunks: List[str]): """Build BM25 index from contextual chunks.""" tokenized_chunks = [chunk.split() for chunk in contextual_chunks] return BM25Okapi(tokenized_chunks)

Hybrid search: combine vector and BM25 scores

def hybrid_search(query: str, vector_embeddings, bm25_index, alpha=0.5): """Combine vector and BM25 scores.""" # Vector similarity query_embedding = vo.embed([query], model="voyage-2", input_type="query").embeddings[0] vector_scores = cosine_similarity([query_embedding], vector_embeddings)[0] # BM25 scores bm25_scores = bm25_index.get_scores(query.split()) # Normalize and combine vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min()) bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min()) combined = alpha vector_scores + (1 - alpha) bm25_scores return combined.argsort()[::-1]

6. Adding Reranking for Maximum Precision

For production systems, add a reranking step using Cohere's rerank API:

import cohere
co = cohere.Client("your-cohere-key")

def rerank_results(query: str, candidates: List[str], top_k: int = 10): """Rerank retrieved chunks for maximum relevance.""" results = co.rerank( query=query, documents=candidates, model="rerank-english-v2.0", top_n=top_k ) return [r.document for r in results.results]

7. Putting It All Together

Here's the complete pipeline:

def contextual_rag_pipeline(query: str, chunks, documents, vector_db, bm25_index):
    # 1. Generate context for the query (optional, but helps)
    query_context = generate_chunk_context(query, "")
    contextual_query = f"{query_context}\n\n{query}"
    
    # 2. Hybrid retrieval
    top_indices = hybrid_search(contextual_query, vector_db, bm25_index)
    retrieved_chunks = [chunks[i] for i in top_indices[:20]]
    
    # 3. Rerank
    reranked = rerank_results(query, [c["text"] for c in retrieved_chunks], top_k=5)
    
    # 4. Generate answer with Claude
    context = "\n\n".join(reranked)
    response = claude.messages.create(
        model="claude-3-sonnet-20240229",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Deployment Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context during chunking. The code is available in the contextual-rag-lambda-function directory of the cookbook repository. Configure it as a custom chunking option when creating your knowledge base.

Cost Optimization

  • Use Claude 3 Haiku for context generation (fastest, cheapest)
  • Batch your context generation calls
  • Cache generated contexts in a database for reuse
  • Use prompt caching to reduce token usage by up to 90%

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by prepending relevant context to each chunk before embedding, solving the "lost context" problem in traditional RAG
  • Hybrid search with Contextual BM25 combines semantic and keyword matching for more robust retrieval—use both vector and BM25 scores
  • Prompt caching makes this practical for production by dramatically reducing the cost of generating context for thousands of chunks
  • Reranking adds a final precision boost—use Cohere's rerank API or Claude itself to reorder retrieved chunks before generation
  • Start simple, then layer complexity: Begin with Contextual Embeddings alone, then add BM25 and reranking as needed for your use case