BeClaude
Guide2026-04-19

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for better RAG systems.

Quick Answer

This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35% compared to basic RAG systems. You'll learn setup, implementation, and optimization with practical code examples.

RAGContextual EmbeddingsRetrievalClaude APIVector Search

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals.

In this guide, we'll explore Contextual Embeddings—a powerful technique that improved top-20-chunk retrieval failure rates by 35% across Anthropic's testing. We'll walk through implementation from basic RAG to optimized contextual retrieval with practical code examples.

Prerequisites and Setup

Before we begin, ensure you have:

Technical Requirements:
  • Python 3.8+
  • Basic understanding of RAG concepts
  • Familiarity with vector databases
  • Command-line proficiency
API Access: Install Required Libraries:
pip install anthropic voyageai cohere chromadb
Dataset: We'll use a dataset of 9 codebases with 248 queries, each with a known "golden chunk" for evaluation. You can find this in the Anthropic Cookbook repository.

1. Establishing a Baseline: Basic RAG

Let's first implement a traditional RAG system to understand our starting point. We'll use ChromaDB as our vector store and Voyage AI for embeddings.

import anthropic
import voyageai
from chromadb import PersistentClient
from chromadb.utils import embedding_functions
import json

Initialize APIs

vo = voyageai.Client(api_key="your_voyage_key") client = anthropic.Anthropic(api_key="your_anthropic_key")

Load and chunk documents

def load_and_chunk_documents(filepath): with open(filepath, 'r') as f: chunks = json.load(f) return chunks

Create basic embeddings

def create_basic_embeddings(chunks): texts = [chunk["text"] for chunk in chunks] results = vo.embed(texts, model="voyage-code-2") return results.embeddings

Setup vector database

def setup_vector_db(chunks, embeddings): chroma_client = PersistentClient(path="./chroma_db") collection = chroma_client.create_collection("basic_rag") # Add documents with metadata ids = [f"doc_{i}" for i in range(len(chunks))] metadatas = [{"source": chunk["source"], "chunk_id": chunk["chunk_id"]} for chunk in chunks] collection.add( embeddings=embeddings, documents=[chunk["text"] for chunk in chunks], metadatas=metadatas, ids=ids ) return collection

Basic retrieval

def basic_retrieve(query, collection, k=10): query_embedding = vo.embed([query], model="voyage-code-2").embeddings[0] results = collection.query( query_embeddings=[query_embedding], n_results=k ) return results

This basic approach typically achieves ~87% Pass@10 accuracy (finding the golden chunk in top 10 results). Let's improve this.

2. Implementing Contextual Embeddings

Contextual Embeddings solve the "missing context" problem by adding relevant context to each chunk before creating embeddings. Here's how it works:

The Core Concept

Instead of embedding raw chunks like:
"def calculate_total(items):\n    total = 0"

We add context:

"Function from shopping_cart.py that calculates total price:\n\ndef calculate_total(items):\n    total = 0"

Implementation with Prompt Caching

Prompt caching makes this practical for production by reusing context generation:
def generate_contextual_chunks(chunks, use_caching=True):
    """Add context to chunks using Claude"""
    contextual_chunks = []
    cache = {} if use_caching else None
    
    for chunk in chunks:
        chunk_id = chunk["chunk_id"]
        
        # Check cache first
        if use_caching and chunk_id in cache:
            contextual_chunks.append(cache[chunk_id])
            continue
        
        # Generate context prompt
        context_prompt = f"""You are helping to add context to code chunks for better retrieval.
        
        Original chunk from {chunk['source']}:
        {chunk['text']}
        
        Provide 1-2 sentences of context about what this code does, what file it's from, 
        and its purpose. Return ONLY the context text."""
        
        # Get context from Claude
        response = client.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=100,
            messages=[{"role": "user", "content": context_prompt}]
        )
        
        context = response.content[0].text
        contextual_text = f"{context}\n\n{chunk['text']}"
        
        contextual_chunks.append({
            **chunk,
            "contextual_text": contextual_text,
            "context": context
        })
        
        if use_caching:
            cache[chunk_id] = contextual_text
    
    return contextual_chunks

Create contextual embeddings

def create_contextual_embeddings(contextual_chunks): texts = [chunk["contextual_text"] for chunk in contextual_chunks] results = vo.embed(texts, model="voyage-code-2") return results.embeddings

Performance Impact

When implemented, Contextual Embeddings improved Pass@10 performance from ~87% to ~95% on our codebase dataset—a significant improvement for production systems.

3. Enhancing with Contextual BM25

We can further improve results by combining contextual embeddings with BM25 search. Instead of traditional keyword matching, we use the contextual text:

from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')

def setup_contextual_bm25(contextual_chunks): """Create BM25 index on contextual text""" tokenized_corpus = [] for chunk in contextual_chunks: tokens = word_tokenize(chunk["contextual_text"].lower()) tokenized_corpus.append(tokens) bm25 = BM25Okapi(tokenized_corpus) return bm25

def hybrid_retrieval(query, collection, bm25_index, contextual_chunks, alpha=0.5): """Combine vector and BM25 scores""" # Vector search vector_results = basic_retrieve(query, collection, k=20) # BM25 search on contextual text query_tokens = word_tokenize(query.lower()) bm25_scores = bm25_index.get_scores(query_tokens) # Normalize and combine scores combined_scores = {} for i, chunk in enumerate(contextual_chunks): vector_score = 0 if chunk["chunk_id"] in vector_results["ids"][0]: idx = vector_results["ids"][0].index(chunk["chunk_id"]) vector_score = vector_results["distances"][0][idx] # Normalize BM25 score (0-1 range) normalized_bm25 = (bm25_scores[i] - min(bm25_scores)) / \ (max(bm25_scores) - min(bm25_scores) + 1e-8) combined = alpha (1 - vector_score) + (1 - alpha) normalized_bm25 combined_scores[chunk["chunk_id"]] = combined # Sort by combined score sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True) return sorted_results[:10]

4. Adding Reranking for Final Polish

For the best results, add a reranking step using models specifically trained for relevance:

import cohere

def rerank_results(query, retrieved_chunks, top_k=5): """Use Cohere reranker to improve final ordering""" co = cohere.Client("your_cohere_key") documents = [chunk["contextual_text"] for chunk in retrieved_chunks] rerank_response = co.rerank( query=query, documents=documents, top_n=top_k, model="rerank-english-v2.0" ) reranked_chunks = [] for result in rerank_response.results: reranked_chunks.append(retrieved_chunks[result.index]) return reranked_chunks

5. Production Considerations

AWS Bedrock Integration

For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking. Deploy this function and select it as a custom chunking option when configuring your Bedrock Knowledge Base.

Cost Management

Prompt caching is essential for cost-effective production use:
  • Cache context generation for identical chunks
  • Batch process documents offline
  • Use lighter models (Haiku) for context generation when possible

Evaluation Framework

Always measure performance with your specific dataset:
def evaluate_pass_at_k(retrieval_function, queries, golden_chunks, k=10):
    """Calculate Pass@k metric"""
    passes = 0
    total = len(queries)
    
    for query, golden_id in zip(queries, golden_chunks):
        results = retrieval_function(query, k=k)
        retrieved_ids = [r["chunk_id"] for r in results]
        
        if golden_id in retrieved_ids:
            passes += 1
    
    return passes / total

Key Takeaways

  • Contextual Embeddings improve retrieval accuracy by 35% on average by adding relevant context to document chunks before embedding, addressing the "missing context" problem in traditional RAG.
  • Prompt caching makes this production-ready by allowing reuse of generated context, significantly reducing API costs and latency while maintaining performance benefits.
  • Hybrid search with Contextual BM25 further enhances results by combining semantic search with keyword matching on contextualized text, leveraging the strengths of both approaches.
  • Always evaluate with your specific data using metrics like Pass@k to measure actual performance improvements, as results can vary based on document type and query patterns.
  • The technique is platform-agnostic and can be adapted for AWS Bedrock, Google Vertex AI, or custom implementations, with Anthropic providing reference code for major platforms.
By implementing Contextual Embeddings, you can significantly improve your RAG system's accuracy, leading to better answers, reduced hallucinations, and more reliable AI applications using Claude.