BeClaude
GuideBeginnerBest Practices2026-05-15

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. Step-by-step guide with code examples for Contextual Embeddings and BM25.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds chunk-specific context before embedding—to dramatically improve RAG accuracy. You'll build a pipeline using Claude, Voyage AI, and Cohere, achieving up to 35% fewer retrieval failures.

RAGContextual EmbeddingsRetrievalClaude APIPrompt Caching

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to codebase Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context. A code snippet like def calculate_total(): means nothing without knowing it belongs to an Order class in an e-commerce system.

Contextual Retrieval solves this. By prepending chunk-specific context before embedding, you dramatically improve retrieval accuracy. In Anthropic's testing across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%.

In this guide, you'll learn how to build a Contextual Retrieval system using Claude, Voyage AI embeddings, and Cohere reranking. We'll walk through the full pipeline—from basic RAG to production-ready Contextual Embeddings with BM25 hybrid search.

What You'll Need

Prerequisites

  • Intermediate Python skills
  • Basic understanding of RAG and vector databases
  • Docker installed (optional, for BM25)

API Keys & Costs

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere numpy pandas

Load your API keys and prepare the dataset. We'll use Anthropic's pre-chunked codebase dataset (9 codebases, 248 queries with golden chunks):

import json
import os
from anthropic import Anthropic
import voyageai

Initialize clients

anthropic = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) vo = voyageai.Client(api_key=os.environ["VOYAGE_API_KEY"])

Load data

with open("data/codebase_chunks.json") as f: chunks = json.load(f)

with open("data/evaluation_set.jsonl") as f: eval_queries = [json.loads(line) for line in f]

2. Building a Basic RAG Baseline

Before improving retrieval, establish a baseline. We'll use Pass@k as our metric—does the golden chunk appear in the top-k retrieved results?

def basic_rag_retrieve(query, chunks, top_k=10):
    # Generate embedding for query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Compute cosine similarity with all chunk embeddings
    # (Assume chunks have pre-computed embeddings)
    scores = []
    for chunk in chunks:
        chunk_emb = chunk["embedding"]
        similarity = cosine_similarity(query_embedding, chunk_emb)
        scores.append((similarity, chunk))
    
    # Return top-k
    scores.sort(reverse=True, key=lambda x: x[0])
    return [s[1] for s in scores[:top_k]]

def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Evaluate Pass@10:

pass_at_10 = 0
for query in eval_queries:
    results = basic_rag_retrieve(query["query"], chunks)
    golden_id = query["golden_chunk_id"]
    if any(r["id"] == golden_id for r in results):
        pass_at_10 += 1

print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.1%}")

Output: ~87%

3. Implementing Contextual Embeddings

The core idea is simple: before embedding each chunk, prepend context that explains what the chunk is about.

How It Works

For each chunk, you ask Claude to generate a concise context (1–2 sentences) that situates the chunk within its parent document. This context is then prepended to the chunk text before embedding.

def generate_chunk_context(chunk_text, full_document):
    """Use Claude to generate context for a single chunk."""
    prompt = f"""<document>
{full_document}
</document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk_text} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string.""" response = anthropic.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Making It Production-Ready with Prompt Caching

Generating context for every chunk individually would be expensive. Prompt caching makes this practical by caching the full document prompt:

def generate_context_with_caching(chunk_text, full_document, document_id):
    """Use prompt caching to reduce costs."""
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        system=[{
            "type": "text",
            "text": f"You are helping to generate context for chunks of this document: {full_document}",
            "cache_control": {"type": "ephemeral"}  # Cache the document
        }],
        messages=[{
            "role": "user",
            "content": f"Generate context for this chunk: {chunk_text}"
        }]
    )
    return response.content[0].text
Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

Embedding with Context

Once you have context for each chunk, prepend it before embedding:

def embed_with_context(chunks_with_context):
    contextual_texts = [
        f"{c['context']}\n\n{c['text']}" 
        for c in chunks_with_context
    ]
    embeddings = vo.embed(contextual_texts, model="voyage-2").embeddings
    return embeddings

Performance Results

After implementing Contextual Embeddings on our codebase dataset:

MetricBasic RAGContextual Embeddings
Pass@1087%95%
Failure rate reduction35%

4. Contextual BM25: Hybrid Search

Contextual Embeddings work with dense vectors. But you can also apply the same context to BM25 (a keyword-based retrieval method) for even better results.

Why BM25 + Context?

BM25 excels at exact keyword matching. By adding context to chunks before indexing, you give BM25 more relevant terms to match against queries.

# Install BM25 (requires Docker for production, or use rank-bm25 library)

pip install rank-bm25

from rank_bm25 import BM25Okapi

def build_contextual_bm25_index(chunks_with_context): # Tokenize contextualized chunks tokenized_corpus = [ f"{c['context']} {c['text']}".split() for c in chunks_with_context ] return BM25Okapi(tokenized_corpus)

Hybrid search: combine BM25 and embedding scores

bm25 = build_contextual_bm25_index(chunks_with_context)

def hybrid_search(query, chunks, bm25, alpha=0.5): # Get BM25 scores bm25_scores = bm25.get_scores(query.split()) # Get embedding scores (normalized) query_emb = vo.embed([query]).embeddings[0] emb_scores = [ cosine_similarity(query_emb, c["embedding"]) for c in chunks ] # Combine scores combined = [ alpha bm25_scores[i] + (1 - alpha) emb_scores[i] for i in range(len(chunks)) ] # Return top-k top_indices = np.argsort(combined)[-10:][::-1] return [chunks[i] for i in top_indices]

5. Improving with Reranking

For maximum accuracy, add a reranking step using Cohere's rerank API:

import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])

def rerank_results(query, candidates, top_k=5): # Prepare documents for reranking docs = [c["text"] for c in candidates] # Rerank results = co.rerank( query=query, documents=docs, top_n=top_k, model="rerank-english-v2.0" ) # Map back to original chunks return [candidates[r.index] for r in results.results]

Full pipeline

def contextual_rag_pipeline(query): # Step 1: Hybrid retrieval (top 20) candidates = hybrid_search(query, chunks, bm25, alpha=0.3) # Step 2: Rerank (top 5) top_results = rerank_results(query, candidates, top_k=5) # Step 3: Generate answer with Claude context = "\n\n".join([r["text"] for r in top_results]) response = anthropic.messages.create( model="claude-3-sonnet-20240229", max_tokens=500, messages=[{ "role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}" }] ) return response.content[0].text

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The code is available in the contextual-rag-lambda-function directory of the cookbook repository.

Cost Optimization

  • Prompt caching reduces context generation costs by ~50%
  • Use Claude 3 Haiku for context generation (fastest/cheapest)
  • Batch process chunks per document to maximize cache hits
  • Consider using Claude 3.5 Sonnet only for the final answer generation

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by prepending chunk-specific context before embedding, solving the "lost-in-the-middle" problem for RAG systems.
  • Combine Contextual Embeddings with BM25 for hybrid search that leverages both semantic and keyword matching, further improving accuracy.
  • Prompt caching makes this practical at scale by caching the parent document, reducing API costs by approximately 50% for context generation.
  • Reranking adds a final accuracy boost—using Cohere's rerank API on your top-20 results can push Pass@k performance even higher.
  • Production-ready on major cloud platforms—the technique works with AWS Bedrock Knowledge Bases (via Lambda custom chunking) and GCP Vertex AI.