BeClaude
GuideBeginnerBest Practices2026-05-13

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings

Learn how to implement Contextual Retrieval with Claude to reduce retrieval failure rates by 35%. Step-by-step guide with code examples for Contextual Embeddings and BM25.

Quick Answer

This guide teaches you how to implement Contextual Retrieval—a technique that adds relevant context to document chunks before embedding—to improve RAG performance by 35% using Claude, Voyage AI, and Cohere APIs.

RAGContextual EmbeddingsRetrievalClaude APIPrompt Caching

Mastering Contextual Retrieval: Boost RAG Accuracy with Claude and Contextual Embeddings

Retrieval Augmented Generation (RAG) is the backbone of enterprise AI applications—from customer support bots to internal knowledge base Q&A systems. But traditional RAG has a critical flaw: when you split documents into chunks for retrieval, individual chunks often lose their surrounding context, leading to poor search results.

Enter Contextual Retrieval, a technique pioneered by Anthropic that adds relevant context to each chunk before embedding. The results speak for themselves: a 35% reduction in retrieval failure rates across diverse datasets.

In this guide, you'll learn how to implement Contextual Retrieval using Claude, Voyage AI embeddings, and Cohere reranking. We'll walk through building a complete system from scratch, with production-ready code and cost optimization strategies.

What You'll Build

By the end of this guide, you'll have a fully functional Contextual Retrieval system that:

  • Improves Pass@10 accuracy from ~87% to ~95%
  • Reduces retrieval failure rates by 35%
  • Works with both embedding-based and BM25 search
  • Includes reranking for maximum precision

Prerequisites

Skills:
  • Intermediate Python (3.8+)
  • Basic RAG understanding
  • Familiarity with vector embeddings
API Keys Needed: System:
  • 4GB+ RAM
  • 5-10GB disk space
  • Docker (optional, for BM25)

1. Setting Up Your Environment

First, install the required libraries:

pip install anthropic voyageai cohere numpy pandas

Initialize your clients:

import anthropic
import voyageai
import cohere

Initialize API clients

claude = anthropic.Anthropic(api_key="sk-ant-...") vo = voyageai.Client(api_key="pa-...") co = cohere.Client(api_key="...")

2. The Problem: Context-Less Chunks

Traditional RAG splits documents into chunks like this:

def basic_chunk(text, chunk_size=500, overlap=50):
    """Simple character-based chunking"""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

The problem? A chunk containing "def calculate_interest(principal, rate, time):" loses meaning when separated from the function's docstring and surrounding context.

3. Implementing Contextual Embeddings

Contextual Embeddings solve this by prepending relevant context to each chunk before embedding:

def generate_chunk_context(chunk, full_document, claude_client):
    """Generate context for a single chunk using Claude"""
    prompt = f"""<document>
{full_document}
</document>

Here is the chunk we want to situate within the whole document: <chunk> {chunk} </chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.""" response = claude_client.messages.create( model="claude-3-haiku-20240307", max_tokens=100, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Production Optimization with Prompt Caching

For large codebases, generating context for every chunk individually is expensive. Use prompt caching to reduce costs:

def generate_contexts_with_caching(chunks, full_document, claude_client):
    """Generate contexts using prompt caching for efficiency"""
    contexts = []
    
    # Cache the full document
    cached_doc = claude_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1,
        system=[{"type": "text", "text": full_document, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "Cache this document."}]
    )
    
    for chunk in chunks:
        response = claude_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=100,
            system=[{"type": "text", "text": full_document, "cache_control": {"type": "ephemeral"}}],
            messages=[{"role": "user", "content": f"Context for chunk: {chunk}"}]
        )
        contexts.append(response.content[0].text)
    
    return contexts
Cost Savings: Prompt caching reduces API costs by ~70-80% for large document collections.

4. Building the Retrieval Pipeline

Step 1: Create Contextual Embeddings

def create_contextual_embeddings(chunks, contexts):
    """Create embeddings for context-enriched chunks"""
    contextual_chunks = [
        f"{context}\n\n{chunk}" 
        for context, chunk in zip(contexts, chunks)
    ]
    
    # Generate embeddings using Voyage AI
    embeddings = vo.embed(
        texts=contextual_chunks,
        model="voyage-2",
        input_type="document"
    ).embeddings
    
    return contextual_chunks, embeddings

Step 2: Implement Hybrid Search with Contextual BM25

Combine embedding search with BM25 for better results:

from rank_bm25 import BM25Okapi

def hybrid_search(query, embeddings, bm25, chunks, alpha=0.5): """Hybrid search combining embeddings and BM25""" # Embedding search query_embedding = vo.embed( texts=[query], model="voyage-2", input_type="query" ).embeddings[0] # Cosine similarity embedding_scores = np.dot(embeddings, query_embedding) # BM25 search tokenized_query = query.split() bm25_scores = bm25.get_scores(tokenized_query) # Normalize and combine emb_norm = (embedding_scores - embedding_scores.min()) / (embedding_scores.max() - embedding_scores.min()) bm25_norm = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min()) hybrid_scores = alpha emb_norm + (1 - alpha) bm25_norm # Return top-k results top_indices = np.argsort(hybrid_scores)[::-1][:10] return [chunks[i] for i in top_indices]

Step 3: Add Reranking for Precision

def rerank_results(query, candidates, cohere_client):
    """Rerank candidates using Cohere's rerank model"""
    results = cohere_client.rerank(
        query=query,
        documents=candidates,
        model="rerank-english-v2.0",
        top_n=5
    )
    return [candidates[r.index] for r in results.results]

5. Complete Pipeline in Action

def contextual_rag_pipeline(query, chunks, full_document):
    """Complete Contextual Retrieval pipeline"""
    # 1. Generate contexts
    contexts = generate_contexts_with_caching(chunks, full_document, claude)
    
    # 2. Create contextual embeddings
    contextual_chunks, embeddings = create_contextual_embeddings(chunks, contexts)
    
    # 3. Build BM25 index
    tokenized_chunks = [chunk.split() for chunk in contextual_chunks]
    bm25 = BM25Okapi(tokenized_chunks)
    
    # 4. Hybrid search
    candidates = hybrid_search(query, embeddings, bm25, contextual_chunks)
    
    # 5. Rerank
    final_results = rerank_results(query, candidates, co)
    
    return final_results

Example usage

query = "How does the interest calculation function work?" results = contextual_rag_pipeline(query, codebase_chunks, full_codebase) print(f"Top result: {results[0]}")

Performance Results

On a dataset of 9 codebases with 248 queries:

MethodPass@10Improvement
Basic RAG87.1%Baseline
Contextual Embeddings94.8%+7.7%
Contextual Embeddings + BM2596.2%+9.1%
Full Pipeline (with reranking)97.5%+10.4%

Production Considerations

AWS Bedrock Integration

For AWS users, deploy a Lambda function for automatic contextual chunking:

# lambda_function.py (simplified)
def lambda_handler(event, context):
    """AWS Lambda for contextual chunking in Bedrock Knowledge Bases"""
    document = event['document']
    chunks = event['chunks']
    
    contexts = generate_contexts_with_caching(chunks, document, claude)
    
    return {
        'statusCode': 200,
        'chunks': [
            {'chunk': chunk, 'context': context}
            for chunk, context in zip(chunks, contexts)
        ]
    }

Cost Optimization Tips

  • Use Claude Haiku for context generation (cheapest model)
  • Batch context generation to minimize API calls
  • Cache embeddings for static documents
  • Set appropriate chunk sizes (500-1000 tokens recommended)

Key Takeaways

  • Contextual Embeddings reduce retrieval failures by 35% by adding document-level context to each chunk before embedding
  • Prompt caching makes this practical by reducing API costs by 70-80% for large document collections
  • Hybrid search (embeddings + BM25) outperforms either method alone, especially for codebases and technical documentation
  • Reranking adds 1-2% additional improvement and is worth implementing for production systems
  • The technique works across platforms—Anthropic API, AWS Bedrock, and GCP Vertex AI all support contextual retrieval with minor customization
Ready to supercharge your RAG system? Start by implementing contextual embeddings on your most challenging dataset—the 35% improvement in retrieval accuracy will transform your application's reliability.