Guide2026-04-25

Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking

Learn how to improve RAG performance using Contextual Embeddings and Contextual BM25. This guide covers setup, implementation, and optimization with Claude AI and Anthropic's ecosystem.

Quick Answer

This guide teaches you how to enhance RAG systems by adding context to document chunks before embedding, reducing retrieval failure rates by 35% using Contextual Embeddings and Contextual BM25 with Claude AI.

RAGContextual EmbeddingsClaude AIRetrieval Augmented GenerationPrompt Caching

Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking

Retrieval Augmented Generation (RAG) is a cornerstone of enterprise AI applications, enabling Claude to tap into your internal knowledge bases, code repositories, and document libraries. But traditional RAG has a blind spot: when you split documents into chunks for embedding, those chunks often lose the surrounding context that makes them meaningful. A chunk that reads "the function returns True" is useless without knowing which function or what condition it checks.

Contextual Retrieval solves this by prepending relevant context to each chunk before embedding. This guide walks you through implementing Contextual Embeddings and Contextual BM25, showing how to reduce retrieval failure rates by up to 35%—all using Anthropic's ecosystem, including Claude and prompt caching to keep costs manageable.

What You'll Learn

How to set up a basic RAG pipeline as a baseline
What Contextual Embeddings are and why they work
How to implement Contextual Embeddings with Claude and Voyage AI
How to combine Contextual Embeddings with Contextual BM25 for hybrid search
How to further improve results with reranking

Prerequisites

Before diving in, make sure you have:

Technical Skills:

Intermediate Python programming
Basic understanding of RAG and vector databases
Familiarity with command-line tools

System Requirements:

Python 3.8+
Docker (optional, for BM25 search)
4GB+ RAM, ~5-10 GB disk space

API Keys:

Anthropic API key (free tier works)
Voyage AI API key
Cohere API key (for reranking)

Time & Cost:

Setup: 30-45 minutes
API costs: ~$5-10 for the full dataset

Step 1: Setting Up a Basic RAG Pipeline

First, let's establish a baseline. We'll use a pre-chunked dataset of nine codebases (available in data/codebase_chunks.json) and 248 evaluation queries with known "golden chunks" (in data/evaluation_set.jsonl). Our metric is Pass@k—whether the golden chunk appears in the top-k retrieved results.

import voyageai
import numpy as np
from typing import List, Dict
Initialize Voyage AI client
vo = voyageai.Client(api_key="your-voyage-api-key")
Load chunks and evaluation data
(Assume chunks and eval_queries are loaded from JSON files)
Embed all chunks
chunk_texts = [chunk["content"] for chunk in chunks]
chunk_embeddings = vo.embed(
    chunk_texts, 
    model="voyage-2", 
    input_type="document"
).embeddings
For each query, embed and find top-k matches
def search(query: str, k: int = 10) -> List[int]:
    query_emb = vo.embed(
        [query], 
        model="voyage-2", 
        input_type="query"
    ).embeddings[0]
    
    # Compute cosine similarity
    similarities = np.dot(chunk_embeddings, query_emb)
    top_indices = np.argsort(similarities)[-k:][::-1]
    return top_indices
Evaluate Pass@10
pass_at_10 = 0
for query in eval_queries:
    results = search(query["query"], k=10)
    if query["golden_chunk_id"] in results:
        pass_at_10 += 1
print(f"Baseline Pass@10: {pass_at_10 / len(eval_queries):.2%}")
Expected: ~87%

This baseline gives us ~87% Pass@10. Not bad, but we can do better.

Step 2: Understanding Contextual Embeddings

The problem with basic chunking is context loss. A chunk from a function definition might say "def calculate_interest(principal, rate, time):" but the next chunk starts with "return principal rate time / 100"—and without the function signature, that chunk is meaningless.

Contextual Embeddings fix this by using Claude to generate a short, chunk-specific context that explains what the chunk is about. This context is prepended to the chunk text before embedding. For example:

Original chunk: return principal rate time / 100
With context: This is from a function called 'calculate_interest' that computes simple interest. The code returns: return principal rate time / 100

This enriched chunk is far more likely to match relevant queries.

Step 3: Implementing Contextual Embeddings

Here's where Claude shines. We'll use Claude to generate context for each chunk, and prompt caching to reduce costs by reusing the system prompt across multiple chunks.

import anthropic
client = anthropic.Anthropic(api_key="your-anthropic-api-key")
System prompt for context generation
SYSTEM_PROMPT = """You are a document context generator. Given a document and a chunk from it, generate a concise context (2-3 sentences) that explains what this chunk is about, including relevant surrounding information like function names, class names, or section headers."""
Use prompt caching for the system prompt
cached_system = client.beta.prompt_caching.create(
    model="claude-3-sonnet-20241022",
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ]
)
def generate_context(document: str, chunk: str) -> str:
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        system=cached_system,
        max_tokens=150,
        messages=[
            {"role": "user", "content": f"Document: {document}\n\nChunk: {chunk}\n\nGenerate context:"}
        ]
    )
    return response.content[0].text
Generate contexts for all chunks
contextual_chunks = []
for chunk in chunks:
    context = generate_context(chunk["document"], chunk["content"])
    contextual_chunks.append(f"{context}\n\n{chunk['content']}")
Now embed contextual chunks
contextual_embeddings = vo.embed(
    contextual_chunks,
    model="voyage-2",
    input_type="document"
).embeddings
Re-evaluate
pass_at_10_contextual = 0
for query in eval_queries:
    query_emb = vo.embed([query["query"]], model="voyage-2", input_type="query").embeddings[0]
    similarities = np.dot(contextual_embeddings, query_emb)
    top_indices = np.argsort(similarities)[-10:][::-1]
    if query["golden_chunk_id"] in top_indices:
        pass_at_10_contextual += 1
print(f"Contextual Embeddings Pass@10: {pass_at_10_contextual / len(eval_queries):.2%}")
Expected: ~95%

Why prompt caching matters: Without caching, generating context for thousands of chunks would be expensive. With prompt caching, the system prompt is cached after the first request, reducing cost by ~90% for subsequent chunks.

Step 4: Adding Contextual BM25 for Hybrid Search

Contextual Embeddings improve semantic search, but BM25 (a keyword-based algorithm) can catch exact matches that embeddings miss. By applying the same context to BM25, we get Contextual BM25.

# Using a simple BM25 implementation (e.g., rank_bm25 library)
from rank_bm25 import BM25Okapi
Tokenize contextual chunks
tokenized_corpus = [chunk.split() for chunk in contextual_chunks]
bm25 = BM25Okapi(tokenized_corpus)
def hybrid_search(query: str, k: int = 10, alpha: float = 0.5) -> List[int]:
    # Semantic search scores
    query_emb = vo.embed([query], model="voyage-2", input_type="query").embeddings[0]
    semantic_scores = np.dot(contextual_embeddings, query_emb)
    
    # BM25 scores
    bm25_scores = bm25.get_scores(query.split())
    
    # Normalize and combine
    semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min())
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min())
    
    combined = alpha  semantic_scores + (1 - alpha)  bm25_scores
    top_indices = np.argsort(combined)[-k:][::-1]
    return top_indices
Evaluate hybrid search
pass_at_10_hybrid = 0
for query in eval_queries:
    results = hybrid_search(query["query"], k=10)
    if query["golden_chunk_id"] in results:
        pass_at_10_hybrid += 1
print(f"Hybrid Contextual Search Pass@10: {pass_at_10_hybrid / len(eval_queries):.2%}")
Expected: ~96-97%

Step 5: Improving with Reranking

For even better results, add a reranking step using Cohere's rerank API. This reorders the top-20 results to push the most relevant chunks to the top.

import cohere
co = cohere.Client("your-cohere-api-key")
def rerank(query: str, chunks: List[str], top_k: int = 10) -> List[int]:
    results = co.rerank(
        query=query,
        documents=chunks,
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    return [result.index for result in results.results]
For each query, get top-20 from hybrid search, then rerank to top-10
pass_at_10_reranked = 0
for query in eval_queries:
    top_20 = hybrid_search(query["query"], k=20)
    top_20_chunks = [contextual_chunks[i] for i in top_20]
    reranked_indices = rerank(query["query"], top_20_chunks, top_k=10)
    final_indices = [top_20[i] for i in reranked_indices]
    if query["golden_chunk_id"] in final_indices:
        pass_at_10_reranked += 1
print(f"Reranked Contextual Search Pass@10: {pass_at_10_reranked / len(eval_queries):.2%}")
Expected: ~98-99%

Production Considerations

AWS Bedrock Integration

If you're using AWS Bedrock Knowledge Bases, Anthropic provides a Lambda function (contextual-rag-lambda-function/lambda_function.py) that you can deploy as a custom chunking option. This automates context generation for new documents added to your knowledge base.

Cost Management

Prompt caching is available on Anthropic's first-party API and coming soon to AWS Bedrock and GCP Vertex AI.
For large corpora, consider generating context only once and storing it alongside your chunks.
Use smaller models (Claude 3 Haiku) for context generation if accuracy requirements are lower.

Key Takeaways

Contextual Embeddings reduce retrieval failure by 35% by adding chunk-specific context before embedding, solving the context-loss problem in traditional RAG.
Prompt caching makes this practical by reducing the cost of generating context for thousands of chunks by ~90%.
Hybrid search with Contextual BM25 combines semantic and keyword matching for even better results, pushing Pass@10 from 87% to 96%+.
Reranking adds the final polish, boosting Pass@10 to 98-99% by reordering the top candidates.
This technique works with major cloud platforms—Anthropic provides ready-to-deploy Lambda functions for AWS Bedrock, with GCP Vertex AI support coming soon.

Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking

Enhancing RAG with Contextual Retrieval: A Practical Guide to Smarter Document Chunking

What You'll Learn

Prerequisites

Step 1: Setting Up a Basic RAG Pipeline

Initialize Voyage AI client

Load chunks and evaluation data

(Assume chunks and eval_queries are loaded from JSON files)

Embed all chunks

For each query, embed and find top-k matches

Evaluate Pass@10

`Expected: ~87%`

Step 2: Understanding Contextual Embeddings

Step 3: Implementing Contextual Embeddings

System prompt for context generation

Use prompt caching for the system prompt

Generate contexts for all chunks

Now embed contextual chunks

Re-evaluate

`Expected: ~95%`

Step 4: Adding Contextual BM25 for Hybrid Search

Tokenize contextual chunks

Evaluate hybrid search

`Expected: ~96-97%`

Step 5: Improving with Reranking

For each query, get top-20 from hybrid search, then rerank to top-10

`Expected: ~98-99%`

Production Considerations

AWS Bedrock Integration

Cost Management

Key Takeaways