BeClaude
Guide2026-05-06

How to Build a Contextual Retrieval System with Claude: A Practical Guide

Learn how to implement Contextual Embeddings and Contextual BM25 to reduce RAG retrieval failure rates by 35% using Claude, Voyage AI, and Cohere.

Quick Answer

This guide shows you how to improve RAG performance by adding context to document chunks before embedding. Using Contextual Embeddings and BM25, you can reduce retrieval failure rates by 35% and boost Pass@10 accuracy from 87% to 95%.

ClaudeRAGContextual EmbeddingsRetrievalPrompt Caching

How to Build a Contextual Retrieval System with Claude: A Practical Guide

Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support bots to internal knowledge base Q&A. But there's a persistent problem: when you split documents into chunks for retrieval, individual chunks often lose the context they need to be useful. A chunk containing "the revenue increased by 20%" is meaningless without knowing which company, quarter, or product line it refers to.

Contextual Retrieval solves this by prepending a short, chunk-specific context before embedding. The result? A 35% reduction in retrieval failure rates across diverse datasets, and a jump in Pass@10 accuracy from ~87% to ~95% on codebase queries.

In this guide, you'll build a complete Contextual Retrieval system using Claude, Voyage AI embeddings, and Cohere reranking. You'll learn:

  • How to set up a basic RAG pipeline as a baseline
  • Why Contextual Embeddings work and how prompt caching makes them affordable
  • How to implement Contextual BM25 for hybrid search
  • How reranking further boosts performance
Let's dive in.

Prerequisites

Before starting, make sure you have:

  • Python 3.8+ installed
  • Docker (optional, for BM25 search)
  • 4GB+ RAM and ~5-10 GB disk space
  • API keys for Anthropic, Voyage AI, and Cohere
  • Basic familiarity with RAG, vector databases, and embeddings
Time & Cost: 30–45 minutes to complete. API costs run about $5–10 for the full dataset.

Step 1: Set Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere pandas numpy

Then initialize your clients:

import anthropic
import voyageai
import cohere

Initialize API clients

claude = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_KEY") vo = voyageai.Client(api_key="YOUR_VOYAGE_KEY") co = cohere.Client("YOUR_COHERE_KEY")

Step 2: Build a Basic RAG Baseline

Before improving retrieval, you need a baseline. Load your chunked dataset and create a simple vector index.

import json

Load pre-chunked codebase data

with open("data/codebase_chunks.json", "r") as f: chunks = json.load(f)

Generate embeddings for each chunk

chunk_texts = [chunk["content"] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Store in a simple in-memory index (use FAISS or Pinecone for production)

import numpy as np embedding_matrix = np.array(embeddings)

Now define a retrieval function:

def retrieve(query, k=10):
    query_emb = vo.embed([query], model="voyage-2").embeddings[0]
    scores = np.dot(embedding_matrix, query_emb)
    top_indices = np.argsort(scores)[-k:][::-1]
    return [chunks[i] for i in top_indices]

Evaluate using Pass@k. With a dataset of 248 queries (each with a known "golden chunk"), your baseline Pass@10 should land around 87%.

Step 3: Implement Contextual Embeddings

The core idea is simple: before embedding each chunk, ask Claude to generate a short piece of context that explains what the chunk is about.

The Context Generation Prompt

def generate_context(chunk_text, document_text):
    prompt = f"""<document>
{document_text}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk_text}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Making It Cost-Effective with Prompt Caching

Generating context for every chunk individually would be expensive. Prompt caching solves this by reusing the document prefix across multiple chunk requests.

# Cache the document prefix once
cached_doc = claude.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    system=[{"type": "text", "text": document_text, "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": f"<chunk>{chunk_text}</chunk>"}]
)

With caching, the cost drops dramatically—often by 90% or more—making Contextual Embeddings viable for production.

Embed the Contextualized Chunks

contextualized_chunks = []
for chunk in chunks:
    context = generate_context(chunk["content"], chunk["document"])
    contextualized_text = f"{context}\n\n{chunk['content']}"
    contextualized_chunks.append(contextualized_text)

Re-embed with Voyage AI

new_embeddings = vo.embed(contextualized_chunks, model="voyage-2").embeddings

After re-evaluating, your Pass@10 should jump to ~95%.

Step 4: Add Contextual BM25 for Hybrid Search

BM25 is a text-based retrieval method that complements semantic search. You can apply the same context to BM25 by indexing the contextualized chunks instead of raw chunks.

# Using a simple BM25 implementation (e.g., rank_bm25)
from rank_bm25 import BM25Okapi

tokenized_corpus = [chunk.split() for chunk in contextualized_chunks] bm25 = BM25Okapi(tokenized_corpus)

def hybrid_search(query, k=10, alpha=0.5): # Semantic search query_emb = vo.embed([query], model="voyage-2").embeddings[0] semantic_scores = np.dot(embedding_matrix, query_emb) # BM25 search tokenized_query = query.split() bm25_scores = bm25.get_scores(tokenized_query) # Normalize and combine combined = alpha * (semantic_scores / np.max(semantic_scores)) + \ (1 - alpha) * (bm25_scores / np.max(bm25_scores)) top_indices = np.argsort(combined)[-k:][::-1] return [chunks[i] for i in top_indices]

Hybrid search with Contextual BM25 typically yields another 2–5% improvement over Contextual Embeddings alone.

Step 5: Rerank for Final Precision

Even with excellent retrieval, the top-10 results may contain irrelevant chunks. A reranker (like Cohere's) re-orders results based on deeper relevance scoring.

def rerank(query, retrieved_chunks, k=10):
    results = co.rerank(
        query=query,
        documents=[chunk["content"] for chunk in retrieved_chunks],
        top_n=k,
        model="rerank-english-v2.0"
    )
    return [retrieved_chunks[r.index] for r in results.results]

Reranking can push Pass@5 to 98%+ and is especially valuable when you need high precision (e.g., legal or medical Q&A).

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context to each document during ingestion. The function code is included in the Anthropic cookbook under contextual-rag-lambda-function/lambda_function.py. Select this Lambda as a custom chunking option when configuring your knowledge base.

Cost Management

  • Prompt caching is essential. It's available on Anthropic's first-party API and coming soon to Bedrock and Vertex AI.
  • Use Claude 3 Haiku for context generation—it's fast and cheap.
  • Batch your embedding calls to minimize API overhead.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the "lost context" problem in traditional RAG.
  • Prompt caching makes Contextual Embeddings production-ready, cutting context generation costs by up to 90%.
  • Contextual BM25 + semantic search (hybrid retrieval) yields the best results, combining lexical and semantic matching.
  • Reranking with Cohere pushes precision even higher, achieving Pass@5 above 98%.
  • AWS Bedrock users can deploy this as a Lambda function for seamless integration with Knowledge Bases.
By implementing Contextual Retrieval, you can build RAG systems that find the right information faster, more accurately, and at a fraction of the cost you might expect.