BeClaude
Guide2026-04-25

Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking

Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25. This guide covers setup, implementation, and optimization with Claude AI and Anthropic's ecosystem.

Quick Answer

This guide teaches you how to enhance RAG systems by adding context to document chunks before embedding, reducing retrieval failure rates by 35% using Contextual Embeddings and Contextual BM25 with Claude AI.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking

Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to tap into your internal knowledge bases, code repositories, and document libraries. But traditional RAG has a blind spot: when you split documents into chunks for embedding, those chunks often lose the surrounding context that makes them meaningful. A chunk that reads "the function returns True" is useless without knowing which function or what condition it checks.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. This guide walks you through implementing Contextual Embeddings and Contextual BM25, showing how to reduce retrieval failure rates by up to 35%—all using Anthropic's ecosystem, including Claude and prompt caching to keep costs manageable.

What You'll Learn

  • How to set up a basic RAG pipeline as a baseline
  • What Contextual Embeddings are and why they work
  • How to implement Contextual Embeddings with Claude and Voyage AI
  • How to combine Contextual Embeddings with Contextual BM25 for hybrid search
  • How to further improve results with reranking

Prerequisites

Before diving in, make sure you have:

Technical Skills:
  • Intermediate Python programming
  • Basic understanding of RAG and vector databases
  • Familiarity with command-line tools
System Requirements:
  • Python 3.8+
  • Docker (optional, for BM25 search)
  • 4GB+ RAM, ~5-10 GB disk space
API Keys: Time & Cost:
  • Setup: 30-45 minutes
  • API costs: ~$5-10 for the full dataset

Step 1: Setting Up a Basic RAG Pipeline

First, let's establish a baseline. We'll use a pre-chunked dataset of nine codebases (available in data/codebase_chunks.json) and 248 evaluation queries with known "golden chunks" (in data/evaluation_set.jsonl). Our metric is Pass@k—whether the golden chunk appears in the top-k retrieved results.

import voyageai
import numpy as np
from typing import List, Dict

Initialize Voyage AI client

vo = voyageai.Client(api_key="your-voyage-api-key")

Load chunks and evaluation data

(Assume chunks and eval_queries are loaded from JSON files)

Embed all chunks

chunk_texts = [chunk["content"] for chunk in chunks] chunk_embeddings = vo.embed( chunk_texts, model="voyage-2", input_type="document" ).embeddings

For each query, embed and find top-k matches

def search(query: str, k: int = 10) -> List[int]: query_emb = vo.embed( [query], model="voyage-2", input_type="query" ).embeddings[0] # Compute cosine similarity similarities = np.dot(chunk_embeddings, query_emb) top_indices = np.argsort(similarities)[-k:][::-1] return top_indices

Evaluate Pass@10

pass_at_10 = 0 for query in eval_queries: results = search(query["query"], k=10) if query["golden_chunk_id"] in results: pass_at_10 += 1

print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.2%}")

Expected: ~87%

This baseline gives us ~87% Pass@10. Not bad, but we can do better.

Step 2: Understanding Contextual Embeddings

The problem with basic chunking is context loss. A chunk from a function definition might say "def calculate_interest(principal, rate, time):" but the next chunk starts with "return principal rate time / 100"—and without the function signature, that chunk is meaningless.

Contextual Embeddings fix this by using Claude to generate a short, chunk-specific context that explains what the chunk is about. This context is prepended to the chunk text before embedding. For example:
  • Original chunk: return principal rate time / 100
  • With context: This is from a function called 'calculate_interest' that computes simple interest. The code returns: return principal rate time / 100
This enriched chunk is far more likely to match relevant queries.

Step 3: Implementing Contextual Embeddings

Here's where Claude shines. We'll use Claude to generate context for each chunk, and prompt caching to reduce costs by reusing the system prompt across multiple chunks.

import anthropic

client = anthropic.Anthropic(api_key="your-anthropic-api-key")

System prompt for context generation

SYSTEM_PROMPT = """You are a document context generator. Given a document and a chunk from it, generate a concise context (2-3 sentences) that explains what this chunk is about, including relevant surrounding information like function names, class names, or section headers."""

Use prompt caching for the system prompt

cached_system = client.beta.prompt_caching.create( model="claude-3-sonnet-20241022", system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"} } ] )

def generate_context(document: str, chunk: str) -> str: response = client.messages.create( model="claude-3-sonnet-20241022", system=cached_system, max_tokens=150, messages=[ {"role": "user", "content": f"Document: {document}\n\nChunk: {chunk}\n\nGenerate context:"} ] ) return response.content[0].text

Generate contexts for all chunks

contextual_chunks = [] for chunk in chunks: context = generate_context(chunk["document"], chunk["content"]) contextual_chunks.append(f"{context}\n\n{chunk['content']}")

Now embed contextual chunks

contextual_embeddings = vo.embed( contextual_chunks, model="voyage-2", input_type="document" ).embeddings

Re-evaluate

pass_at_10_contextual = 0 for query in eval_queries: query_emb = vo.embed([query["query"]], model="voyage-2", input_type="query").embeddings[0] similarities = np.dot(contextual_embeddings, query_emb) top_indices = np.argsort(similarities)[-10:][::-1] if query["golden_chunk_id"] in top_indices: pass_at_10_contextual += 1

print(f"Contextual Embeddings Pass@10: {pass_at_10_contextual / len(eval_queries):.2%}")

Expected: ~95%

Why prompt caching matters: Without caching, generating context for thousands of chunks would be expensive. With prompt caching, the system prompt is cached after the first request, reducing cost by ~90% for subsequent chunks.

Step 4: Adding Contextual BM25 for Hybrid Search

Contextual Embeddings improve semantic search, but BM25 (a keyword-based algorithm) can catch exact matches that embeddings miss. By applying the same context to BM25, we get Contextual BM25.

# Using a simple BM25 implementation (e.g., rank_bm25 library)
from rank_bm25 import BM25Okapi

Tokenize contextual chunks

tokenized_corpus = [chunk.split() for chunk in contextual_chunks] bm25 = BM25Okapi(tokenized_corpus)

def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[int]: # Semantic search scores query_emb = vo.embed([query], model="voyage-2", input_type="query").embeddings[0] semantic_scores = np.dot(contextual_embeddings, query_emb) # BM25 scores bm25_scores = bm25.get_scores(query.split()) # Normalize and combine semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min()) bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min()) combined = alpha semantic_scores + (1 - alpha) bm25_scores top_indices = np.argsort(combined)[-k:][::-1] return top_indices

Evaluate hybrid search

pass_at_10_hybrid = 0 for query in eval_queries: results = hybrid_search(query["query"], k=10) if query["golden_chunk_id"] in results: pass_at_10_hybrid += 1

print(f"Hybrid Contextual Search Pass@10: {pass_at_10_hybrid / len(eval_queries):.2%}")

Expected: ~96-97%

Step 5: Improving with Reranking

For even better results, add a reranking step using Cohere's rerank API. This reorders the top-20 results to push the most relevant chunks to the top.

import cohere

co = cohere.Client("your-cohere-api-key")

def rerank(query: str, chunks: List[str], top_k: int = 10) -> List[int]: results = co.rerank( query=query, documents=chunks, top_n=top_k, model="rerank-english-v2.0" ) return [result.index for result in results.results]

For each query, get top-20 from hybrid search, then rerank to top-10

pass_at_10_reranked = 0 for query in eval_queries: top_20 = hybrid_search(query["query"], k=20) top_20_chunks = [contextual_chunks[i] for i in top_20] reranked_indices = rerank(query["query"], top_20_chunks, top_k=10) final_indices = [top_20[i] for i in reranked_indices] if query["golden_chunk_id"] in final_indices: pass_at_10_reranked += 1

print(f"Reranked Contextual Search Pass@10: {pass_at_10_reranked / len(eval_queries):.2%}")

Expected: ~98-99%

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option. This automates context generation for new documents added to your knowledge base.

Cost Management

  • Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
  • For large corpora, consider generating context only once and storing it alongside your chunks.
  • Use smaller models (Claude 3 Haiku) for context generation if accuracy requirements are lower.

Key Takeaways

  • Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the context-loss problem in traditional RAG.
  • Prompt caching makes this practical by reducing the cost of generating context for thousands of chunks by ~90%.
  • Hybrid search with Contextual BM25 combines semantic and keyword matching for even better results, pushing Pass@10 from 87% to 96%+.
  • Reranking adds the final polish, boosting Pass@10 to 98-99% by reordering the top candidates.
  • This technique works with major cloud platforms—Anthropic provides ready-to-deploy Lambda functions for AWS Bedrock, with GCP Vertex AI support coming soon.