GuideBeginnerBest Practices2026-05-15

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. Step-by-step guide with code examples for Contextual Embeddings and BM25.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds chunk-specific context before embedding—to dramatically improve RAG accuracy. You'll build a pipeline using Claude, Voyage AI, and Cohere, achieving up to 35% fewer retrieval failures.

RAGContextual EmbeddingsRetrievalClaude APIPrompt Caching

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to codebase Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet like def calculate_total(): means nothing without knowing it belongs to an Order class in an e-commerce system.

Contextual Retrieval solves this. By prepending chunk-specific context before embedding, you dramatically improve retrieval accuracy. In Anthropic's testing across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.

In this guide, you'll learn how to build a Contextual Retrieval system using Claude, Voyage AI embeddings, and Cohere reranking. We'll walk through the full pipeline—from basic RAG to production-ready Contextual Embeddings with BM25 hybrid search.

What You'll Need

Prerequisites

Intermediate Python skills
Basic understanding of RAG and vector databases
Docker installed (optional, for BM25)

API Keys & Costs

Anthropic API key (free tier works)
Voyage AI API key
Cohere API key (for reranking)
Estimated API cost: $5–10 for the full dataset
Time: 30–45 minutes

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere numpy pandas

Load your API keys and prepare the dataset. We'll use Anthropic's pre-chunked codebase dataset (9 codebases, 248 queries with golden chunks):

import json
import os
from anthropic import Anthropic
import voyageai
Initialize clients
anthropic = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])
Load data
with open("data/codebase_chunks.json") as f:
    chunks = json.load(f)
with open("data/evaluation_set.jsonl") as f:
    eval_queries = [json.loads(line) for line in f]

2. Building a Basic RAG Baseline

Before improving retrieval, establish a baseline. We'll use Pass@k as our metric—does the golden chunk appear in the top-k retrieved results?

def basic_rag_retrieve(query, chunks, top_k=10):
    # Generate embedding for query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Compute cosine similarity with all chunk embeddings
    # (Assume chunks have pre-computed embeddings)
    scores = []
    for chunk in chunks:
        chunk_emb = chunk["embedding"]
        similarity = cosine_similarity(query_embedding, chunk_emb)
        scores.append((similarity, chunk))
    
    # Return top-k
    scores.sort(reverse=True, key=lambda x: x[0])
    return [s[1] for s in scores[:top_k]]
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Evaluate Pass@10:

pass_at_10 = 0
for query in eval_queries:
    results = basic_rag_retrieve(query["query"], chunks)
    golden_id = query["golden_chunk_id"]
    if any(r["id"] == golden_id for r in results):
        pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.1%}")
Output: ~87%

3. Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend context that explains what the chunk is about.

How It Works

For each chunk, you ask Claude to generate a concise context (1–2 sentences) that situates the chunk within its parent document. This context is then prepended to the chunk text before embedding.

def generate_chunk_context(chunk_text, full_document):
    """Use Claude to generate context for a single chunk."""
    prompt = f"""<document>
{full_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string."""
    
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Making It Production-Ready with Prompt Caching

Generating context for every chunk individually would be expensive. Prompt caching makes this practical by caching the full document prompt:

def generate_context_with_caching(chunk_text, full_document, document_id):
    """Use prompt caching to reduce costs."""
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system=[{
            "type": "text",
            "text": f"You are helping to generate context for chunks of this document: {full_document}",
            "cache_control": {"type": "ephemeral"}  # Cache the document
        }],
        messages=[{
            "role": "user",
            "content": f"Generate context for this chunk: {chunk_text}"
        }]
    )
    return response.content[0].text

Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

Embedding with Context

Once you have context for each chunk, prepend it before embedding:

def embed_with_context(chunks_with_context):
    contextual_texts = [
        f"{c['context']}\n\n{c['text']}" 
        for c in chunks_with_context
    ]
    embeddings = vo.embed(contextual_texts, model="voyage-2").embeddings
    return embeddings

Performance Results

After implementing Contextual Embeddings on our codebase dataset:

Metric	Basic RAG	Contextual Embeddings
Pass@10	87%	95%
Failure rate reduction	—	35%

4. Contextual BM25: Hybrid Search

Contextual Embeddings work with dense vectors. But you can also apply the same context to BM25 (a keyword-based retrieval method) for even better results.

Why BM25 + Context?

BM25 excels at exact keyword matching. By adding context to chunks before indexing, you give BM25 more relevant terms to match against queries.

# Install BM25 (requires Docker for production, or use rank-bm25 library)
pip install rank-bm25
from rank_bm25 import BM25Okapi
def build_contextual_bm25_index(chunks_with_context):
    # Tokenize contextualized chunks
    tokenized_corpus = [
        f"{c['context']} {c['text']}".split()
        for c in chunks_with_context
    ]
    return BM25Okapi(tokenized_corpus)
Hybrid search: combine BM25 and embedding scores
bm25 = build_contextual_bm25_index(chunks_with_context)
def hybrid_search(query, chunks, bm25, alpha=0.5):
    # Get BM25 scores
    bm25_scores = bm25.get_scores(query.split())
    
    # Get embedding scores (normalized)
    query_emb = vo.embed([query]).embeddings[0]
    emb_scores = [
        cosine_similarity(query_emb, c["embedding"])
        for c in chunks
    ]
    
    # Combine scores
    combined = [
        alpha  bm25_scores[i] + (1 - alpha)  emb_scores[i]
        for i in range(len(chunks))
    ]
    
    # Return top-k
    top_indices = np.argsort(combined)[-10:][::-1]
    return [chunks[i] for i in top_indices]

5. Improving with Reranking

For maximum accuracy, add a reranking step using Cohere's rerank API:

import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
def rerank_results(query, candidates, top_k=5):
    # Prepare documents for reranking
    docs = [c["text"] for c in candidates]
    
    # Rerank
    results = co.rerank(
        query=query,
        documents=docs,
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    
    # Map back to original chunks
    return [candidates[r.index] for r in results.results]
Full pipeline
def contextual_rag_pipeline(query):
    # Step 1: Hybrid retrieval (top 20)
    candidates = hybrid_search(query, chunks, bm25, alpha=0.3)
    
    # Step 2: Rerank (top 5)
    top_results = rerank_results(query, candidates, top_k=5)
    
    # Step 3: Generate answer with Claude
    context = "\n\n".join([r["text"] for r in top_results])
    response = anthropic.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The code is available in the contextual-rag-lambda-function directory of the cookbook repository.

Cost Optimization

Prompt caching reduces context generation costs by ~50%
Use Claude 3 Haiku for context generation (fastest/cheapest)
Batch process chunks per document to maximize cache hits
Consider using Claude 3.5 Sonnet only for the final answer generation

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% by prepending chunk-specific context before embedding, solving the "lost-in-the-middle" problem for RAG systems.
Combine Contextual Embeddings with BM25 for hybrid search that leverages both semantic and keyword matching, further improving accuracy.
Prompt caching makes this practical at scale by caching the parent document, reducing API costs by approximately 50% for context generation.
Reranking adds a final accuracy boost—using Cohere's rerank API on your top-20 results can push Pass@k performance even higher.
Production-ready on major cloud platforms—the technique works with AWS Bedrock Knowledge Bases (via Lambda custom chunking) and GCP Vertex AI.