BeClaude
Guide2026-05-06

Enhancing RAG with Contextual Retrieval: A Practical Guide for Claude AI Users

Learn how to improve RAG performance using Contextual Embeddings and BM25 with Claude AI. Includes code examples, evaluation metrics, and production tips.

Quick Answer

This guide shows you how to boost RAG accuracy by adding context to document chunks before embedding. You'll learn Contextual Embeddings, Contextual BM25, and reranking techniques to reduce retrieval failure rates by up to 35%.

RAGContextual EmbeddingsClaude AIRetrieval OptimizationPrompt Caching

Introduction

Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to answer questions using your internal knowledge bases, codebases, or document repositories. However, traditional RAG systems often stumble when individual document chunks lack sufficient context—a single code function or paragraph snippet may be meaningless on its own.

Contextual Retrieval solves this problem by enriching each chunk with relevant context before embedding. The result? More accurate retrieval, better answers, and a 35% reduction in top-20 retrieval failures across tested datasets.

In this guide, you'll learn how to implement Contextual Embeddings and Contextual BM25 using Claude and supporting APIs. We'll walk through a complete pipeline using a dataset of 9 codebases, evaluate performance with Pass@k metrics, and show you how prompt caching makes this approach production-ready.

What You'll Need

Prerequisites

  • Intermediate Python skills
  • Basic understanding of RAG and vector databases
  • Command-line proficiency

System Requirements

  • Python 3.8+
  • Docker (optional, for BM25 search)
  • 4GB+ RAM
  • 5–10 GB disk space for vector databases

API Keys

Time & Cost

  • Setup: 30–45 minutes
  • API costs: ~$5–10 for the full dataset

Step 1: Basic RAG Pipeline (Baseline)

Before improving retrieval, establish a baseline. We'll split documents into chunks, embed them, and measure Pass@10 performance.

import voyageai
import numpy as np
from typing import List, Dict

Initialize Voyage AI client

vo = voyageai.Client(api_key="your-voyage-api-key")

Load pre-chunked dataset (from data/codebase_chunks.json)

Each chunk has: id, content, source_file

Generate embeddings for all chunks

chunks = load_chunks() # Your loading logic here chunk_texts = [chunk["content"] for chunk in chunks] embeddings = vo.embed(chunk_texts, model="voyage-2").embeddings

Build a simple vector store (using numpy for demo)

vector_store = np.array(embeddings)

def retrieve(query: str, k: int = 10) -> List[Dict]: query_embedding = vo.embed([query], model="voyage-2").embeddings[0] scores = np.dot(vector_store, query_embedding) top_indices = np.argsort(scores)[-k:][::-1] return [chunks[i] for i in top_indices]

Evaluate with Pass@10

Load evaluation_set.jsonl (contains queries + golden chunk IDs)

Check if golden chunk appears in top 10 results

Expected baseline: ~87% Pass@10 on the codebase dataset.

Step 2: Contextual Embeddings

Contextual Embeddings prepend a chunk-specific context to each chunk before embedding. This context is generated by Claude, which understands the chunk's role in the broader document.

How It Works

  • For each chunk, send the full document + chunk to Claude with a prompt like:
"Given the document below, provide a short context for this chunk that explains its purpose and relevance."
  • Prepend the generated context to the chunk text.
  • Embed the enriched chunk.
  • At query time, search against enriched embeddings.

Implementation

import anthropic

client = anthropic.Anthropic(api_key="your-anthropic-api-key")

def generate_chunk_context(document: str, chunk: str) -> str: """Generate context for a single chunk using Claude.""" prompt = f"""<document> {document} </document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the context string.""" response = client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, temperature=0, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text.strip()

Apply to all chunks

enriched_chunks = [] for chunk in chunks: context = generate_chunk_context(chunk["document"], chunk["content"]) enriched_text = f"{context}\n\n{chunk['content']}" enriched_chunks.append({**chunk, "enriched_text": enriched_text})

Embed enriched chunks

enriched_embeddings = vo.embed( [c["enriched_text"] for c in enriched_chunks], model="voyage-2" ).embeddings

Performance Boost

After implementing Contextual Embeddings, re-run your evaluation. Expect Pass@10 to jump from ~87% to ~95%—a significant reduction in retrieval failures.

Step 3: Optimizing Costs with Prompt Caching

Generating context for every chunk can be expensive. Prompt caching reduces costs by reusing the document prefix across chunk requests.

# Enable prompt caching by marking the document as a cache control point
response = client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=100,
    temperature=0,
    system=[
        {
            "type": "text",
            "text": document,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": f"<chunk>{chunk}</chunk>\n\nProvide context..."}]
)

With caching, you only pay the full document token cost once. Subsequent chunks reuse the cached prefix, slashing API costs by 50–80%.

Note: Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex.

Step 4: Contextual BM25

BM25 is a keyword-based retrieval method that complements dense embeddings. Contextual BM25 applies the same chunk context to BM25 indexing.

Implementation

from rank_bm25 import BM25Okapi

Tokenize enriched texts

enriched_texts = [c["enriched_text"] for c in enriched_chunks] tokenized_corpus = [text.split() for text in enriched_texts] bm25 = BM25Okapi(tokenized_corpus)

def bm25_retrieve(query: str, k: int = 10) -> List[Dict]: tokenized_query = query.split() scores = bm25.get_scores(tokenized_query) top_indices = np.argsort(scores)[-k:][::-1] return [chunks[i] for i in top_indices]

Hybrid search: combine dense + BM25 scores

def hybrid_retrieve(query: str, k: int = 10, alpha: float = 0.5) -> List[Dict]: dense_results = retrieve(query, k*2) # Get more candidates bm25_results = bm25_retrieve(query, k*2) # Combine scores (simplified) combined_scores = {} for i, chunk in enumerate(dense_results): combined_scores[chunk["id"]] = alpha (1 - i/(k2)) for i, chunk in enumerate(bm25_results): if chunk["id"] in combined_scores: combined_scores[chunk["id"]] += (1-alpha) (1 - i/(k2)) else: combined_scores[chunk["id"]] = (1-alpha) (1 - i/(k2)) sorted_ids = sorted(combined_scores, key=combined_scores.get, reverse=True)[:k] return [chunks_by_id[i] for i in sorted_ids]

Hybrid search with Contextual BM25 typically yields another 2–5% improvement in Pass@10.

Step 5: Reranking for Final Precision

Reranking applies a cross-encoder model to reorder the top-k results from your hybrid search. This adds a small latency cost but can push accuracy even higher.

import cohere

co = cohere.Client("your-cohere-api-key")

def rerank(query: str, candidates: List[Dict], top_k: int = 10) -> List[Dict]: results = co.rerank( model="rerank-english-v3.0", query=query, documents=[c["enriched_text"] for c in candidates], top_n=top_k ) return [candidates[r.index] for r in results.results]

Full pipeline

query = "How does the authentication module handle token refresh?" candidates = hybrid_retrieve(query, k=20) final_results = rerank(query, candidates, top_k=10)

Reranking can push Pass@10 beyond 97% on well-structured datasets.

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, you can deploy a Lambda function that adds context during chunking. The Anthropic cookbook includes a ready-to-use Lambda in the contextual-rag-lambda-function folder. Configure it as a custom chunking option in your Bedrock Knowledge Base.

Latency vs. Accuracy Trade-offs

  • Contextual Embeddings: Adds upfront processing time but zero query-time latency.
  • Contextual BM25: Minimal query-time overhead.
  • Reranking: Adds 100–500ms per query but delivers the highest accuracy.

Scaling

For large corpora (millions of chunks), consider:

  • Batching context generation with Claude
  • Using approximate nearest neighbor (ANN) indexes like FAISS
  • Pre-computing BM25 indices

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by enriching chunks with document-level context before embedding.
  • Prompt caching cuts costs by 50–80% when generating context for many chunks from the same document.
  • Hybrid search (dense + BM25) outperforms either method alone—Contextual BM25 adds another 2–5% improvement.
  • Reranking pushes accuracy to 97%+ but adds latency; use it when precision is critical.
  • Production-ready on AWS Bedrock via a custom Lambda function for chunking—no vendor lock-in.
By implementing Contextual Retrieval, you transform a basic RAG system into a high-precision knowledge engine that Claude can trust—and your users will love.