BeClaude
Guide2026-04-24

Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings

Learn how to improve RAG accuracy by 35% using Contextual Embeddings and Contextual BM25 with Claude. A practical guide with code examples and evaluation metrics.

Quick Answer

This guide shows you how to enhance RAG systems by adding context to document chunks before embedding, reducing retrieval failure rates by 35%. You'll implement Contextual Embeddings, Contextual BM25, and reranking using Claude and Voyage AI.

RAGContextual EmbeddingsClaudePrompt CachingBM25

Contextual Retrieval: Boosting RAG Performance with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of many enterprise AI applications—from customer support chatbots to internal knowledge base Q&A. But traditional RAG has a blind spot: when you split documents into chunks, individual pieces often lose the surrounding context, leading to poor retrieval accuracy.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. In tests across multiple datasets, this technique reduced the top-20-chunk retrieval failure rate by 35%. This guide walks you through implementing Contextual Embeddings and Contextual BM25 with Claude, complete with code examples and performance benchmarks.

What You'll Learn

  • How to set up a basic RAG pipeline as a baseline
  • What Contextual Embeddings are and why they work
  • How to implement Contextual Embeddings with prompt caching to manage costs
  • How to combine Contextual Embeddings with Contextual BM25 for hybrid search
  • How to further improve performance with reranking

Prerequisites

Before diving in, make sure you have:

  • Python 3.8+ installed
  • API keys for Anthropic, Voyage AI, and Cohere
  • Basic familiarity with RAG, vector databases, and embeddings
  • Docker installed (optional, for BM25 search)
  • About 30–45 minutes and ~$5–10 in API costs

1. Setting Up a Basic RAG Pipeline

We'll start with a simple RAG pipeline to establish a performance baseline. The dataset consists of 9 codebases, chunked using character splitting, with 248 evaluation queries—each with a "golden chunk" that should be retrieved.

Install Dependencies

pip install anthropic voyageai cohere numpy pandas

Load and Chunk Documents

import json

Load pre-chunked codebase data

with open('data/codebase_chunks.json', 'r') as f: chunks = json.load(f)

Load evaluation queries

with open('data/evaluation_set.jsonl', 'r') as f: eval_data = [json.loads(line) for line in f]

Generate Embeddings

We'll use Voyage AI's embedding model to vectorize each chunk.

import voyageai

vo = voyageai.Client(api_key='YOUR_VOYAGE_API_KEY')

Embed all chunks

chunk_texts = [chunk['text'] for chunk in chunks] embeddings = vo.embed(chunk_texts, model='voyage-2').embeddings

Perform Retrieval

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def retrieve(query, embeddings, chunk_texts, k=10): query_emb = vo.embed([query], model='voyage-2').embeddings[0] similarities = cosine_similarity([query_emb], embeddings)[0] top_k_indices = np.argsort(similarities)[-k:][::-1] return [chunk_texts[i] for i in top_k_indices]

Evaluate Pass@10

pass_at_10 = 0 for item in eval_data: results = retrieve(item['query'], embeddings, chunk_texts, k=10) if item['golden_chunk'] in results: pass_at_10 += 1

print(f"Baseline Pass@10: {pass_at_10 / len(eval_data):.2%}")

Expected output: Baseline Pass@10: ~87%

2. Contextual Embeddings: Adding Context to Each Chunk

The problem with basic RAG is that chunks are isolated. A chunk containing def calculate_interest(principal, rate, years): might be meaningless without knowing it's from a loan calculator app. Contextual Embeddings fix this by prepending a short context snippet to each chunk before embedding.

How It Works

For each chunk, we ask Claude to generate a concise context that explains what the chunk is about, based on the full document. This context is prepended to the chunk text before embedding.

Implementation with Prompt Caching

Generating context for thousands of chunks can be expensive. Prompt caching (available on Anthropic's API) dramatically reduces costs by reusing the system prompt across multiple calls.

import anthropic

client = anthropic.Anthropic(api_key='YOUR_ANTHROPIC_API_KEY')

def generate_context(chunk_text, full_document, cache_key): response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, system=[{ "type": "text", "text": "You are a document context generator. Given a chunk of text from a larger document, provide a brief (1-2 sentence) context that explains what this chunk is about and where it fits in the document.", "cache_control": {"type": "ephemeral", "cache_key": cache_key} }], messages=[ {"role": "user", "content": f"Full document: {full_document[:2000]}\n\nChunk: {chunk_text}\n\nContext:"} ] ) return response.content[0].text

Generate contexts for all chunks

contexts = [] for i, chunk in enumerate(chunks): ctx = generate_context(chunk['text'], chunk['full_document'], cache_key=f"doc_{chunk['doc_id']}") contexts.append(ctx) print(f"Generated context for chunk {i+1}/{len(chunks)}")

Embed with Context

# Prepend context to each chunk before embedding
contextual_chunks = [f"{ctx}\n\n{chunk['text']}" for ctx, chunk in zip(contexts, chunks)]
contextual_embeddings = vo.embed(contextual_chunks, model='voyage-2').embeddings

Evaluate Contextual Embeddings

def contextual_retrieve(query, embeddings, contextual_chunks, k=10):
    query_emb = vo.embed([query], model='voyage-2').embeddings[0]
    similarities = cosine_similarity([query_emb], embeddings)[0]
    top_k_indices = np.argsort(similarities)[-k:][::-1]
    return [contextual_chunks[i] for i in top_k_indices]

pass_at_10 = 0 for item in eval_data: results = contextual_retrieve(item['query'], contextual_embeddings, contextual_chunks, k=10) if item['golden_chunk'] in results: pass_at_10 += 1

print(f"Contextual Embeddings Pass@10: {pass_at_10 / len(eval_data):.2%}")

Expected output: Contextual Embeddings Pass@10: ~95% — a significant jump from 87%.

3. Contextual BM25: Hybrid Search for Even Better Results

BM25 is a text-based retrieval method that works well for exact keyword matching. By applying the same contextual prefix to BM25, we get Contextual BM25.

Setting Up BM25

from rank_bm25 import BM25Okapi

Tokenize contextual chunks

tokenized_chunks = [chunk.split() for chunk in contextual_chunks] bm25 = BM25Okapi(tokenized_chunks)

def bm25_retrieve(query, k=10): tokenized_query = query.split() scores = bm25.get_scores(tokenized_query) top_k_indices = np.argsort(scores)[-k:][::-1] return [contextual_chunks[i] for i in top_k_indices]

Hybrid Search: Combine Embeddings + BM25

def hybrid_retrieve(query, emb_embeddings, bm25, contextual_chunks, k=10, alpha=0.5):
    # Get embedding scores
    query_emb = vo.embed([query], model='voyage-2').embeddings[0]
    emb_scores = cosine_similarity([query_emb], emb_embeddings)[0]
    
    # Get BM25 scores
    tokenized_query = query.split()
    bm25_scores = bm25.get_scores(tokenized_query)
    
    # Normalize and combine
    emb_scores = (emb_scores - emb_scores.min()) / (emb_scores.max() - emb_scores.min())
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
    combined = alpha  emb_scores + (1 - alpha)  bm25_scores
    
    top_k_indices = np.argsort(combined)[-k:][::-1]
    return [contextual_chunks[i] for i in top_k_indices]

Evaluate hybrid search—you should see another 1–2% improvement.

4. Reranking for Precision

Even with contextual retrieval, the top-10 results may contain irrelevant chunks. Reranking using Cohere's rerank API or Claude itself can push the golden chunk to position 1.

import cohere

co = cohere.Client('YOUR_COHERE_API_KEY')

def rerank(query, candidates, top_n=5): results = co.rerank( model='rerank-english-v2.0', query=query, documents=candidates, top_n=top_n ) return [candidates[r.index] for r in results.results]

Use reranking on top-20 results from hybrid search

pass_at_1 = 0 for item in eval_data: candidates = hybrid_retrieve(item['query'], contextual_embeddings, bm25, contextual_chunks, k=20) reranked = rerank(item['query'], candidates, top_n=5) if item['golden_chunk'] in reranked[:1]: pass_at_1 += 1

print(f"Pass@1 with Reranking: {pass_at_1 / len(eval_data):.2%}")

Cost Optimization with Prompt Caching

Generating context for thousands of chunks can be costly. Prompt caching reduces costs by up to 90% by reusing the system prompt across multiple API calls. This feature is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.

For AWS Bedrock users, Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option when configuring a Bedrock Knowledge Base.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by adding document-level context to each chunk before embedding.
  • Prompt caching makes Contextual Embeddings production-ready by slashing API costs for context generation.
  • Hybrid search (Contextual Embeddings + Contextual BM25) yields the best results, combining semantic and keyword-based retrieval.
  • Reranking further improves precision, pushing the most relevant chunk to the top of results.
  • This technique works across platforms — use it with Anthropic's API, AWS Bedrock, or GCP Vertex AI with minor customization.