Guide2026-04-17

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for enhanced RAG systems.

Quick Answer

This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35% and boosting Pass@10 performance from 87% to 95%.

RAGContextual EmbeddingsRetrievalClaude APIVector Search

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals and suboptimal responses.

In this guide, we'll walk through implementing Contextual Embeddings—a powerful technique that reduces top-20-chunk retrieval failure rates by 35% on average. We'll use a dataset of 9 codebases with 248 queries to demonstrate practical improvements, moving from ~87% to ~95% Pass@10 performance.

Prerequisites and Setup

Before diving in, ensure you have the following:

Technical Requirements:

Python 3.8+ installed
Intermediate Python programming skills
Basic understanding of RAG concepts
Familiarity with vector databases
4GB+ RAM and 5-10GB disk space

API Access:

Anthropic API key (Claude access)
Voyage AI API key (embeddings)
Cohere API key (reranking, optional)

Time & Cost:

Completion time: 30-45 minutes
Estimated API cost: $5-10 for full dataset processing

Installation and Initial Setup

# Install required libraries
!pip install anthropic voyageai cohere chromadb pymupdf tiktoken
Import necessary modules
import anthropic
import voyageai
import cohere
import chromadb
from chromadb.utils.embedding_functions import VoyageAIEmbeddingFunction
import json
import numpy as np
from typing import List, Dict, Any
Initialize API clients
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGEAI_API_KEY")
co = cohere.Client("YOUR_COHERE_API_KEY")

Establishing a Baseline: Basic RAG System

First, let's set up a traditional RAG system to establish performance benchmarks. We'll use a dataset of codebase chunks and evaluation queries.

# Load the dataset
with open('data/codebase_chunks.json', 'r') as f:
    codebase_chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
    evaluation_queries = [json.loads(line) for line in f]
Basic chunk embedding function
def embed_chunks_basic(chunks: List[str]) -> List[List[float]]:
    """Generate embeddings for chunks without additional context"""
    embeddings = vo.embed(
        texts=chunks,
        model="voyage-code-2",
        input_type="document"
    ).embeddings
    return embeddings
Create vector store with basic embeddings
def create_basic_vector_store(chunks: Dict[str, Any]):
    """Set up ChromaDB with traditional embeddings"""
    chroma_client = chromadb.Client()
    
    # Prepare documents and metadata
    documents = []
    metadatas = []
    ids = []
    
    for chunk_id, chunk_data in chunks.items():
        documents.append(chunk_data['text'])
        metadatas.append({"source": chunk_data['source']})
        ids.append(chunk_id)
    
    # Generate embeddings
    embeddings = embed_chunks_basic(documents)
    
    # Create collection
    collection = chroma_client.create_collection(
        name="basic_rag",
        embedding_function=VoyageAIEmbeddingFunction(
            api_key=vo.api_key,
            model_name="voyage-code-2"
        )
    )
    
    # Add to collection
    collection.add(
        documents=documents,
        embeddings=embeddings,
        metadatas=metadatas,
        ids=ids
    )
    
    return collection
Evaluate baseline performance
def evaluate_pass_at_k(collection, queries: List[Dict], k: int = 10) -> float:
    """Calculate Pass@k metric"""
    correct = 0
    
    for query in queries:
        results = collection.query(
            query_texts=[query['query']],
            n_results=k
        )
        
        # Check if golden chunk is in results
        if query['golden_chunk_id'] in results['ids'][0]:
            correct += 1
    
    return correct / len(queries)
Create and evaluate baseline
basic_collection = create_basic_vector_store(codebase_chunks)
baseline_pass_10 = evaluate_pass_at_k(basic_collection, evaluation_queries, k=10)
print(f"Baseline Pass@10: {baseline_pass_10:.2%}")

Implementing Contextual Embeddings

Contextual Embeddings solve the context deficiency problem by adding relevant information to each chunk before embedding. This approach significantly improves retrieval accuracy.

How Contextual Embeddings Work

Traditional RAG splits documents into isolated chunks. Contextual Embeddings enrich each chunk with:

Previous context (n chunks before)
Current chunk (the main content)
Following context (n chunks after)

This creates a "context window" around each chunk, providing the embedding model with better understanding.

def add_context_to_chunk(chunk_id: str, chunks: Dict[str, Any], context_window: int = 2) -> str:
    """Add surrounding context to a chunk"""
    # Get all chunk IDs from the same source
    source = chunks[chunk_id]['source']
    source_chunks = [
        (cid, data) for cid, data in chunks.items() 
        if data['source'] == source
    ]
    
    # Sort by position
    source_chunks.sort(key=lambda x: x[1].get('position', 0))
    
    # Find current chunk index
    chunk_ids = [cid for cid, _ in source_chunks]
    current_idx = chunk_ids.index(chunk_id)
    
    # Get context window
    start_idx = max(0, current_idx - context_window)
    end_idx = min(len(source_chunks), current_idx + context_window + 1)
    
    # Build contextual chunk
    contextual_parts = []
    for idx in range(start_idx, end_idx):
        cid, data = source_chunks[idx]
        if idx == current_idx:
            contextual_parts.append(f"[CURRENT CHUNK]\n{data['text']}")
        else:
            contextual_parts.append(f"[CONTEXT]\n{data['text']}")
    
    return "\n\n".join(contextual_parts)
def create_contextual_embeddings(chunks: Dict[str, Any], context_window: int = 2) -> Dict[str, List[float]]:
    """Generate embeddings with added context"""
    contextual_texts = []
    chunk_ids = []
    
    for chunk_id in chunks.keys():
        contextual_text = add_context_to_chunk(chunk_id, chunks, context_window)
        contextual_texts.append(contextual_text)
        chunk_ids.append(chunk_id)
    
    # Generate embeddings
    embeddings = vo.embed(
        texts=contextual_texts,
        model="voyage-code-2",
        input_type="document"
    ).embeddings
    
    return dict(zip(chunk_ids, embeddings))
Create contextual vector store
def create_contextual_vector_store(chunks: Dict[str, Any]):
    """Set up vector store with contextual embeddings"""
    chroma_client = chromadb.Client()
    
    # Prepare documents and metadata
    documents = []
    metadatas = []
    ids = []
    
    for chunk_id, chunk_data in chunks.items():
        documents.append(chunk_data['text'])  # Store original text
        metadatas.append({"source": chunk_data['source']})
        ids.append(chunk_id)
    
    # Generate contextual embeddings
    contextual_embeddings = create_contextual_embeddings(chunks)
    embeddings_list = [contextual_embeddings[cid] for cid in ids]
    
    # Create collection
    collection = chroma_client.create_collection(
        name="contextual_rag",
        embedding_function=VoyageAIEmbeddingFunction(
            api_key=vo.api_key,
            model_name="voyage-code-2"
        )
    )
    
    # Add to collection
    collection.add(
        documents=documents,
        embeddings=embeddings_list,
        metadatas=metadatas,
        ids=ids
    )
    
    return collection
Evaluate contextual embeddings performance
contextual_collection = create_contextual_vector_store(codebase_chunks)
contextual_pass_10 = evaluate_pass_at_k(contextual_collection, evaluation_queries, k=10)
print(f"Contextual Embeddings Pass@10: {contextual_pass_10:.2%}")
print(f"Improvement: {contextual_pass_10 - baseline_pass_10:.2%} points")

Prompt Caching for Production Efficiency

Generating context for each chunk can be expensive. Prompt caching helps manage costs:

# Example of prompt caching implementation
class CachedContextualEmbedder:
    def __init__(self, chunks: Dict[str, Any], cache_file: str = "embedding_cache.json"):
        self.chunks = chunks
        self.cache_file = cache_file
        self.cache = self.load_cache()
    
    def load_cache(self) -> Dict[str, List[float]]:
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    
    def get_embedding(self, chunk_id: str, context_window: int = 2) -> List[float]:
        """Get embedding from cache or generate new"""
        cache_key = f"{chunk_id}_ctx{context_window}"
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        # Generate new embedding
        contextual_text = add_context_to_chunk(chunk_id, self.chunks, context_window)
        embedding = vo.embed(
            texts=[contextual_text],
            model="voyage-code-2",
            input_type="document"
        ).embeddings[0]
        
        # Cache result
        self.cache[cache_key] = embedding
        self.save_cache()
        
        return embedding

Advanced Techniques

Contextual BM25 Hybrid Search

Combine contextual embeddings with BM25 for even better performance:

from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
def create_contextual_bm25_index(chunks: Dict[str, Any], context_window: int = 1):
    """Create BM25 index with contextual chunks"""
    tokenized_corpus = []
    chunk_ids = []
    
    for chunk_id in chunks.keys():
        contextual_text = add_context_to_chunk(chunk_id, chunks, context_window)
        tokens = word_tokenize(contextual_text.lower())
        tokenized_corpus.append(tokens)
        chunk_ids.append(chunk_id)
    
    bm25 = BM25Okapi(tokenized_corpus)
    return bm25, chunk_ids
def hybrid_search(query: str, bm25_index, chunk_ids, vector_collection, alpha: float = 0.5):
    """Combine BM25 and vector search scores"""
    # BM25 search
    tokenized_query = word_tokenize(query.lower())
    bm25_scores = bm25_index.get_scores(tokenized_query)
    
    # Vector search
    vector_results = vector_collection.query(
        query_texts=[query],
        n_results=len(chunk_ids)
    )
    
    # Combine scores
    combined_scores = {}
    for i, chunk_id in enumerate(chunk_ids):
        # Get vector score (distance converted to similarity)
        vector_idx = vector_results['ids'][0].index(chunk_id) if chunk_id in vector_results['ids'][0] else -1
        vector_score = 1 - vector_results['distances'][0][vector_idx] if vector_idx != -1 else 0
        
        # Normalize BM25 score
        bm25_score = bm25_scores[i] / (max(bm25_scores) or 1)
        
        # Combine
        combined_scores[chunk_id] = alpha  bm25_score + (1 - alpha)  vector_score
    
    # Sort by combined score
    sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    return [chunk_id for chunk_id, score in sorted_results]

Reranking for Precision

Use Cohere's reranker to further improve results:

def rerank_results(query: str, retrieved_chunks: List[str], top_k: int = 10) -> List[str]:
    """Use Cohere reranker to improve result ordering"""
    if not retrieved_chunks:
        return []
    
    rerank_response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=retrieved_chunks,
        top_n=top_k
    )
    
    # Return reordered chunks
    return [retrieved_chunks[result.index] for result in rerank_response.results]

Production Considerations

AWS Bedrock Integration

For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking:

# Example Lambda function structure for Bedrock
import json
def lambda_handler(event, context):
    """AWS Lambda function for contextual chunking in Bedrock Knowledge Bases"""
    # Parse input
    chunk_text = event.get('chunkText', '')
    metadata = event.get('metadata', {})
    
    # Add context using surrounding chunks
    contextual_chunk = add_context_based_on_metadata(chunk_text, metadata)
    
    return {
        'statusCode': 200,
        'body': json.dumps({
            'contextualChunk': contextual_chunk,
            'metadata': metadata
        })
    }

Performance Optimization Tips

Context Window Size: Start with 1-2 chunks before/after. Test to find optimal size for your data.
Batch Processing: Process chunks in batches to optimize API calls.
Cache Strategically: Cache embeddings for static documents, refresh for frequently updated content.
Monitor Costs: Use prompt caching and track embedding generation costs.

Key Takeaways

Contextual Embeddings improve retrieval accuracy by 35% on average by adding surrounding context to document chunks before embedding.
Pass@10 performance jumps from ~87% to ~95% when implementing this technique with codebase datasets.
Prompt caching is essential for production deployments to manage costs effectively—available on Anthropic's API and coming to AWS Bedrock/GCP Vertex.
Hybrid approaches work best: Combine contextual embeddings with BM25 search and reranking for optimal results.
The technique is platform-agnostic: Implementable on Anthropic's API, AWS Bedrock (via custom Lambda), and GCP Vertex AI with minor adjustments.

By implementing Contextual Embeddings in your RAG pipeline, you'll significantly enhance Claude's ability to retrieve relevant information, leading to more accurate and helpful responses across all your applications.