Guide2026-05-06

How to Build a Contextual Retrieval System with Claude: A Practical Guide

Learn to improve RAG accuracy by 35% using Contextual Embeddings and BM25 with Claude. Step-by-step guide with code examples, evaluation metrics, and cost optimization tips.

Quick Answer

This guide shows you how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. You'll learn to add context to document chunks before embedding, use Contextual BM25, and apply reranking—all with practical code examples and cost-saving prompt caching techniques.

Contextual RetrievalRAGClaudeEmbeddingsPrompt Caching

How to Build a Contextual Retrieval System with Claude: A Practical Guide

Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the surrounding context, leading to poor search results and inaccurate answers.

Contextual Retrieval solves this by adding relevant context to each chunk before embedding. The result? Anthropic's testing shows a 35% reduction in top-20-chunk retrieval failure rates across multiple datasets. In this guide, you'll learn how to implement this technique with Claude, complete with code examples and performance benchmarks.

What You'll Need

Prerequisites

Intermediate Python skills
Basic understanding of RAG and vector databases
Command-line proficiency

System Requirements

Python 3.8+
4GB+ RAM
~5–10 GB disk space for vector databases
Docker (optional, for BM25 search)

API Keys

Anthropic API key (free tier works)
Voyage AI API key (for embeddings)
Cohere API key (for reranking)

Time & Cost: 30–45 minutes to complete. API costs run approximately $5–10 for the full dataset.

Step 1: Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere numpy pandas

Then initialize your clients:

import anthropic
import voyageai
import cohere
Initialize API clients
claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY")
co = cohere.Client("YOUR_COHERE_KEY")

Step 2: Building a Basic RAG Baseline

Before improving retrieval, you need a baseline. We'll use a dataset of 9 codebases (248 queries with golden chunks) to measure performance.

Load and Chunk Your Documents

import json
Load pre-chunked codebase data
with open("data/codebase_chunks.json", "r") as f:
    chunks = json.load(f)
Load evaluation queries
with open("data/evaluation_set.jsonl", "r") as f:
    eval_data = [json.loads(line) for line in f]

Create Embeddings and Index

# Generate embeddings for each chunk
chunk_texts = [chunk["text"] for chunk in chunks]
embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings
Store in a simple vector index (use FAISS or Chroma for production)
import numpy as np
index = {}
for i, emb in enumerate(embeddings):
    index[i] = {
        "text": chunk_texts[i],
        "embedding": np.array(emb)
    }

Evaluate with Pass@k

We use Pass@k—does the golden chunk appear in the top-k results? Here's how to compute it:

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query, k=10):
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    scores = []
    for i, item in index.items():
        sim = cosine_similarity(query_emb, item["embedding"])
        scores.append((i, sim))
    scores.sort(key=lambda x: x[1], reverse=True)
    return [index[idx]["text"] for idx, _ in scores[:k]]
Evaluate
pass_at_10 = 0
for item in eval_data:
    results = retrieve(item["query"], k=10)
    if item["golden_chunk"] in results:
        pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10/len(eval_data)*100:.1f}%")

Expect a baseline around 87%—good, but we can do better.

Step 3: Implementing Contextual Embeddings

Contextual Embeddings add surrounding context to each chunk before embedding. This prevents chunks from being retrieved out of context.

How It Works

For each chunk, you ask Claude to generate a concise context snippet that includes:

The document title or section heading
The preceding content summary
The chunk's role in the overall document

Generate Context with Claude

def generate_chunk_context(chunk_text, surrounding_text, doc_title):
    prompt = f"""
    Document: {doc_title}
    Surrounding text: {surrounding_text}
    Chunk: {chunk_text}
    
    Provide a brief context (1-2 sentences) explaining what this chunk is about
    and how it fits into the document. Focus on key entities, topics, and purpose.
    """
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Optimize with Prompt Caching

Generating context for thousands of chunks can get expensive. Prompt caching reduces costs by reusing the system prompt across multiple calls:

def generate_context_cached(chunks_batch, doc_title):
    system_prompt = f"You are a context generator. For each chunk from '{doc_title}', provide a 1-2 sentence context."
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": f"Chunk: {chunk}"} for chunk in chunks_batch],
        max_tokens=100
    )
    return response.content[0].text

Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex.

Embed Contextual Chunks

contextual_chunks = []
for chunk in chunks:
    context = generate_chunk_context(
        chunk["text"],
        chunk.get("surrounding_text", ""),
        chunk.get("doc_title", "Unknown")
    )
    contextual_chunks.append(f"{context}\n\n{chunk['text']}")
Re-embed with context
contextual_embeddings = vo.embed(contextual_chunks, model="voyage-2").embeddings

Evaluate Again

Re-run the evaluation with your new contextual embeddings. You should see Pass@10 jump from ~87% to ~95%—a 35% reduction in retrieval failures.

Step 4: Adding Contextual BM25

BM25 is a text-based retrieval method that complements embeddings. By applying the same contextual prefix to BM25, you get Contextual BM25—a hybrid approach that captures both semantic and keyword matches.

Set Up BM25

from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_corpus = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query, k=10, alpha=0.5):
    # Vector search
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    vector_scores = [cosine_similarity(query_emb, emb) for emb in contextual_embeddings]
    
    # BM25 search
    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Normalize and combine
    vector_scores = np.array(vector_scores) / max(vector_scores)
    bm25_scores = bm25_scores / max(bm25_scores)
    combined = alpha  vector_scores + (1 - alpha)  bm25_scores
    
    top_indices = np.argsort(combined)[-k:][::-1]
    return [chunks[i]["text"] for i in top_indices]

Hybrid search typically yields another 2–5% improvement over embeddings alone.

Step 5: Reranking for Final Precision

Reranking reorders your top-k results using a more powerful model. Cohere's rerank API works well here:

def rerank(query, results, k=5):
    reranked = co.rerank(
        query=query,
        documents=results,
        top_n=k,
        model="rerank-english-v2.0"
    )
    return [result.document["text"] for result in reranked.results]
Use in pipeline
def final_retrieve(query):
    initial_results = hybrid_search(query, k=20)  # Get more candidates
    return rerank(query, initial_results, k=5)     # Rerank to top 5

Reranking can push Pass@5 above 97% in many cases.

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function (provided in the Anthropic cookbook) as a custom chunking option. The function adds context to each document chunk before storage.

Cost Management

Use Claude 3 Haiku for context generation (fastest/cheapest)
Enable prompt caching to reduce API calls by up to 90%
Batch your context generation requests

Key Takeaways

Contextual Embeddings reduce retrieval failures by 35% by adding document context to each chunk before embedding, solving the "lost in the middle" problem common in basic RAG.
Hybrid search (Contextual Embeddings + Contextual BM25) outperforms either method alone, capturing both semantic meaning and keyword precision.
Reranking adds a final precision boost, pushing Pass@5 accuracy above 97% in many enterprise datasets.
Prompt caching makes Contextual Retrieval cost-effective for production, reducing API overhead by reusing system prompts across batch operations.
The technique works across cloud platforms—Anthropic API, AWS Bedrock (via Lambda), and GCP Vertex—making it accessible regardless of your infrastructure.