Guide2026-04-19

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for better RAG systems.

Quick Answer

This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35% compared to basic RAG systems. You'll learn setup, implementation, and optimization with practical code examples.

RAGContextual EmbeddingsRetrievalClaude APIVector Search

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals.

In this guide, we'll explore Contextual Embeddings—a powerful technique that improved top-20-chunk retrieval failure rates by 35% across Anthropic's testing. We'll walk through implementation from basic RAG to optimized contextual retrieval with practical code examples.

Prerequisites and Setup

Before we begin, ensure you have:

Technical Requirements:

Python 3.8+
Basic understanding of RAG concepts
Familiarity with vector databases
Command-line proficiency

API Access:

Anthropic API key
Voyage AI API key for embeddings
Cohere API key for reranking (optional)

Install Required Libraries:

pip install anthropic voyageai cohere chromadb

Dataset: We'll use a dataset of 9 codebases with 248 queries, each with a known "golden chunk" for evaluation. You can find this in the Anthropic Cookbook repository.

1. Establishing a Baseline: Basic RAG

Let's first implement a traditional RAG system to understand our starting point. We'll use ChromaDB as our vector store and Voyage AI for embeddings.

import anthropic
import voyageai
from chromadb import PersistentClient
from chromadb.utils import embedding_functions
import json
Initialize APIs
vo = voyageai.Client(api_key="your_voyage_key")
client = anthropic.Anthropic(api_key="your_anthropic_key")
Load and chunk documents
def load_and_chunk_documents(filepath):
    with open(filepath, 'r') as f:
        chunks = json.load(f)
    return chunks
Create basic embeddings
def create_basic_embeddings(chunks):
    texts = [chunk["text"] for chunk in chunks]
    results = vo.embed(texts, model="voyage-code-2")
    return results.embeddings
Setup vector database
def setup_vector_db(chunks, embeddings):
    chroma_client = PersistentClient(path="./chroma_db")
    collection = chroma_client.create_collection("basic_rag")
    
    # Add documents with metadata
    ids = [f"doc_{i}" for i in range(len(chunks))]
    metadatas = [{"source": chunk["source"], "chunk_id": chunk["chunk_id"]} 
                 for chunk in chunks]
    
    collection.add(
        embeddings=embeddings,
        documents=[chunk["text"] for chunk in chunks],
        metadatas=metadatas,
        ids=ids
    )
    return collection
Basic retrieval
def basic_retrieve(query, collection, k=10):
    query_embedding = vo.embed([query], model="voyage-code-2").embeddings[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k
    )
    return results

This basic approach typically achieves ~87% Pass@10 accuracy (finding the golden chunk in top 10 results). Let's improve this.

2. Implementing Contextual Embeddings

Contextual Embeddings solve the "missing context" problem by adding relevant context to each chunk before creating embeddings. Here's how it works:

The Core Concept

Instead of embedding raw chunks like:

"def calculate_total(items):\n    total = 0"

We add context:

"Function from shopping_cart.py that calculates total price:\n\ndef calculate_total(items):\n    total = 0"

Implementation with Prompt Caching

Prompt caching makes this practical for production by reusing context generation:

def generate_contextual_chunks(chunks, use_caching=True):
    """Add context to chunks using Claude"""
    contextual_chunks = []
    cache = {} if use_caching else None
    
    for chunk in chunks:
        chunk_id = chunk["chunk_id"]
        
        # Check cache first
        if use_caching and chunk_id in cache:
            contextual_chunks.append(cache[chunk_id])
            continue
        
        # Generate context prompt
        context_prompt = f"""You are helping to add context to code chunks for better retrieval.
        
        Original chunk from {chunk['source']}:
        {chunk['text']}
        
        Provide 1-2 sentences of context about what this code does, what file it's from, 
        and its purpose. Return ONLY the context text."""
        
        # Get context from Claude
        response = client.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=100,
            messages=[{"role": "user", "content": context_prompt}]
        )
        
        context = response.content[0].text
        contextual_text = f"{context}\n\n{chunk['text']}"
        
        contextual_chunks.append({
            **chunk,
            "contextual_text": contextual_text,
            "context": context
        })
        
        if use_caching:
            cache[chunk_id] = contextual_text
    
    return contextual_chunks
Create contextual embeddings
def create_contextual_embeddings(contextual_chunks):
    texts = [chunk["contextual_text"] for chunk in contextual_chunks]
    results = vo.embed(texts, model="voyage-code-2")
    return results.embeddings

Performance Impact

When implemented, Contextual Embeddings improved Pass@10 performance from ~87% to ~95% on our codebase dataset—a significant improvement for production systems.

3. Enhancing with Contextual BM25

We can further improve results by combining contextual embeddings with BM25 search. Instead of traditional keyword matching, we use the contextual text:

from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
def setup_contextual_bm25(contextual_chunks):
    """Create BM25 index on contextual text"""
    tokenized_corpus = []
    for chunk in contextual_chunks:
        tokens = word_tokenize(chunk["contextual_text"].lower())
        tokenized_corpus.append(tokens)
    
    bm25 = BM25Okapi(tokenized_corpus)
    return bm25
def hybrid_retrieval(query, collection, bm25_index, contextual_chunks, alpha=0.5):
    """Combine vector and BM25 scores"""
    # Vector search
    vector_results = basic_retrieve(query, collection, k=20)
    
    # BM25 search on contextual text
    query_tokens = word_tokenize(query.lower())
    bm25_scores = bm25_index.get_scores(query_tokens)
    
    # Normalize and combine scores
    combined_scores = {}
    for i, chunk in enumerate(contextual_chunks):
        vector_score = 0
        if chunk["chunk_id"] in vector_results["ids"][0]:
            idx = vector_results["ids"][0].index(chunk["chunk_id"])
            vector_score = vector_results["distances"][0][idx]
        
        # Normalize BM25 score (0-1 range)
        normalized_bm25 = (bm25_scores[i] - min(bm25_scores)) / \
                         (max(bm25_scores) - min(bm25_scores) + 1e-8)
        
        combined = alpha  (1 - vector_score) + (1 - alpha)  normalized_bm25
        combined_scores[chunk["chunk_id"]] = combined
    
    # Sort by combined score
    sorted_results = sorted(combined_scores.items(), 
                          key=lambda x: x[1], reverse=True)
    return sorted_results[:10]

4. Adding Reranking for Final Polish

For the best results, add a reranking step using models specifically trained for relevance:

import cohere
def rerank_results(query, retrieved_chunks, top_k=5):
    """Use Cohere reranker to improve final ordering"""
    co = cohere.Client("your_cohere_key")
    
    documents = [chunk["contextual_text"] for chunk in retrieved_chunks]
    
    rerank_response = co.rerank(
        query=query,
        documents=documents,
        top_n=top_k,
        model="rerank-english-v2.0"
    )
    
    reranked_chunks = []
    for result in rerank_response.results:
        reranked_chunks.append(retrieved_chunks[result.index])
    
    return reranked_chunks

5. Production Considerations

AWS Bedrock Integration

For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking. Deploy this function and select it as a custom chunking option when configuring your Bedrock Knowledge Base.

Cost Management

Prompt caching is essential for cost-effective production use:

Cache context generation for identical chunks
Batch process documents offline
Use lighter models (Haiku) for context generation when possible

Evaluation Framework

Always measure performance with your specific dataset:

def evaluate_pass_at_k(retrieval_function, queries, golden_chunks, k=10):
    """Calculate Pass@k metric"""
    passes = 0
    total = len(queries)
    
    for query, golden_id in zip(queries, golden_chunks):
        results = retrieval_function(query, k=k)
        retrieved_ids = [r["chunk_id"] for r in results]
        
        if golden_id in retrieved_ids:
            passes += 1
    
    return passes / total

Key Takeaways

Contextual Embeddings improve retrieval accuracy by 35% on average by adding relevant context to document chunks before embedding, addressing the "missing context" problem in traditional RAG.

Prompt caching makes this production-ready by allowing reuse of generated context, significantly reducing API costs and latency while maintaining performance benefits.

Hybrid search with Contextual BM25 further enhances results by combining semantic search with keyword matching on contextualized text, leveraging the strengths of both approaches.

Always evaluate with your specific data using metrics like Pass@k to measure actual performance improvements, as results can vary based on document type and query patterns.

The technique is platform-agnostic and can be adapted for AWS Bedrock, Google Vertex AI, or custom implementations, with Anthropic providing reference code for major platforms.

By implementing Contextual Embeddings, you can significantly improve your RAG system's accuracy, leading to better answers, reduced hallucinations, and more reliable AI applications using Claude.