BeClaude
Guide2026-04-17

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for enhanced RAG systems.

Quick Answer

This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35% and boosting Pass@10 performance from 87% to 95%.

RAGContextual EmbeddingsRetrievalClaude APIVector Search

Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude

Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals and suboptimal responses.

In this guide, we'll walk through implementing Contextual Embeddings—a powerful technique that reduces top-20-chunk retrieval failure rates by 35% on average. We'll use a dataset of 9 codebases with 248 queries to demonstrate practical improvements, moving from ~87% to ~95% Pass@10 performance.

Prerequisites and Setup

Before diving in, ensure you have the following:

Technical Requirements:
  • Python 3.8+ installed
  • Intermediate Python programming skills
  • Basic understanding of RAG concepts
  • Familiarity with vector databases
  • 4GB+ RAM and 5-10GB disk space
API Access: Time & Cost:
  • Completion time: 30-45 minutes
  • Estimated API cost: $5-10 for full dataset processing

Installation and Initial Setup

# Install required libraries
!pip install anthropic voyageai cohere chromadb pymupdf tiktoken

Import necessary modules

import anthropic import voyageai import cohere import chromadb from chromadb.utils.embedding_functions import VoyageAIEmbeddingFunction import json import numpy as np from typing import List, Dict, Any

Initialize API clients

client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY") vo = voyageai.Client(api_key="YOUR_VOYAGEAI_API_KEY") co = cohere.Client("YOUR_COHERE_API_KEY")

Establishing a Baseline: Basic RAG System

First, let's set up a traditional RAG system to establish performance benchmarks. We'll use a dataset of codebase chunks and evaluation queries.

# Load the dataset
with open('data/codebase_chunks.json', 'r') as f:
    codebase_chunks = json.load(f)

with open('data/evaluation_set.jsonl', 'r') as f: evaluation_queries = [json.loads(line) for line in f]

Basic chunk embedding function

def embed_chunks_basic(chunks: List[str]) -> List[List[float]]: """Generate embeddings for chunks without additional context""" embeddings = vo.embed( texts=chunks, model="voyage-code-2", input_type="document" ).embeddings return embeddings

Create vector store with basic embeddings

def create_basic_vector_store(chunks: Dict[str, Any]): """Set up ChromaDB with traditional embeddings""" chroma_client = chromadb.Client() # Prepare documents and metadata documents = [] metadatas = [] ids = [] for chunk_id, chunk_data in chunks.items(): documents.append(chunk_data['text']) metadatas.append({"source": chunk_data['source']}) ids.append(chunk_id) # Generate embeddings embeddings = embed_chunks_basic(documents) # Create collection collection = chroma_client.create_collection( name="basic_rag", embedding_function=VoyageAIEmbeddingFunction( api_key=vo.api_key, model_name="voyage-code-2" ) ) # Add to collection collection.add( documents=documents, embeddings=embeddings, metadatas=metadatas, ids=ids ) return collection

Evaluate baseline performance

def evaluate_pass_at_k(collection, queries: List[Dict], k: int = 10) -> float: """Calculate Pass@k metric""" correct = 0 for query in queries: results = collection.query( query_texts=[query['query']], n_results=k ) # Check if golden chunk is in results if query['golden_chunk_id'] in results['ids'][0]: correct += 1 return correct / len(queries)

Create and evaluate baseline

basic_collection = create_basic_vector_store(codebase_chunks) baseline_pass_10 = evaluate_pass_at_k(basic_collection, evaluation_queries, k=10) print(f"Baseline Pass@10: {baseline_pass_10:.2%}")

Implementing Contextual Embeddings

Contextual Embeddings solve the context deficiency problem by adding relevant information to each chunk before embedding. This approach significantly improves retrieval accuracy.

How Contextual Embeddings Work

Traditional RAG splits documents into isolated chunks. Contextual Embeddings enrich each chunk with:

  • Previous context (n chunks before)
  • Current chunk (the main content)
  • Following context (n chunks after)
This creates a "context window" around each chunk, providing the embedding model with better understanding.

def add_context_to_chunk(chunk_id: str, chunks: Dict[str, Any], context_window: int = 2) -> str:
    """Add surrounding context to a chunk"""
    # Get all chunk IDs from the same source
    source = chunks[chunk_id]['source']
    source_chunks = [
        (cid, data) for cid, data in chunks.items() 
        if data['source'] == source
    ]
    
    # Sort by position
    source_chunks.sort(key=lambda x: x[1].get('position', 0))
    
    # Find current chunk index
    chunk_ids = [cid for cid, _ in source_chunks]
    current_idx = chunk_ids.index(chunk_id)
    
    # Get context window
    start_idx = max(0, current_idx - context_window)
    end_idx = min(len(source_chunks), current_idx + context_window + 1)
    
    # Build contextual chunk
    contextual_parts = []
    for idx in range(start_idx, end_idx):
        cid, data = source_chunks[idx]
        if idx == current_idx:
            contextual_parts.append(f"[CURRENT CHUNK]\n{data['text']}")
        else:
            contextual_parts.append(f"[CONTEXT]\n{data['text']}")
    
    return "\n\n".join(contextual_parts)

def create_contextual_embeddings(chunks: Dict[str, Any], context_window: int = 2) -> Dict[str, List[float]]: """Generate embeddings with added context""" contextual_texts = [] chunk_ids = [] for chunk_id in chunks.keys(): contextual_text = add_context_to_chunk(chunk_id, chunks, context_window) contextual_texts.append(contextual_text) chunk_ids.append(chunk_id) # Generate embeddings embeddings = vo.embed( texts=contextual_texts, model="voyage-code-2", input_type="document" ).embeddings return dict(zip(chunk_ids, embeddings))

Create contextual vector store

def create_contextual_vector_store(chunks: Dict[str, Any]): """Set up vector store with contextual embeddings""" chroma_client = chromadb.Client() # Prepare documents and metadata documents = [] metadatas = [] ids = [] for chunk_id, chunk_data in chunks.items(): documents.append(chunk_data['text']) # Store original text metadatas.append({"source": chunk_data['source']}) ids.append(chunk_id) # Generate contextual embeddings contextual_embeddings = create_contextual_embeddings(chunks) embeddings_list = [contextual_embeddings[cid] for cid in ids] # Create collection collection = chroma_client.create_collection( name="contextual_rag", embedding_function=VoyageAIEmbeddingFunction( api_key=vo.api_key, model_name="voyage-code-2" ) ) # Add to collection collection.add( documents=documents, embeddings=embeddings_list, metadatas=metadatas, ids=ids ) return collection

Evaluate contextual embeddings performance

contextual_collection = create_contextual_vector_store(codebase_chunks) contextual_pass_10 = evaluate_pass_at_k(contextual_collection, evaluation_queries, k=10) print(f"Contextual Embeddings Pass@10: {contextual_pass_10:.2%}") print(f"Improvement: {contextual_pass_10 - baseline_pass_10:.2%} points")

Prompt Caching for Production Efficiency

Generating context for each chunk can be expensive. Prompt caching helps manage costs:

# Example of prompt caching implementation
class CachedContextualEmbedder:
    def __init__(self, chunks: Dict[str, Any], cache_file: str = "embedding_cache.json"):
        self.chunks = chunks
        self.cache_file = cache_file
        self.cache = self.load_cache()
    
    def load_cache(self) -> Dict[str, List[float]]:
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    
    def get_embedding(self, chunk_id: str, context_window: int = 2) -> List[float]:
        """Get embedding from cache or generate new"""
        cache_key = f"{chunk_id}_ctx{context_window}"
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        # Generate new embedding
        contextual_text = add_context_to_chunk(chunk_id, self.chunks, context_window)
        embedding = vo.embed(
            texts=[contextual_text],
            model="voyage-code-2",
            input_type="document"
        ).embeddings[0]
        
        # Cache result
        self.cache[cache_key] = embedding
        self.save_cache()
        
        return embedding

Advanced Techniques

Contextual BM25 Hybrid Search

Combine contextual embeddings with BM25 for even better performance:

from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')

def create_contextual_bm25_index(chunks: Dict[str, Any], context_window: int = 1): """Create BM25 index with contextual chunks""" tokenized_corpus = [] chunk_ids = [] for chunk_id in chunks.keys(): contextual_text = add_context_to_chunk(chunk_id, chunks, context_window) tokens = word_tokenize(contextual_text.lower()) tokenized_corpus.append(tokens) chunk_ids.append(chunk_id) bm25 = BM25Okapi(tokenized_corpus) return bm25, chunk_ids

def hybrid_search(query: str, bm25_index, chunk_ids, vector_collection, alpha: float = 0.5): """Combine BM25 and vector search scores""" # BM25 search tokenized_query = word_tokenize(query.lower()) bm25_scores = bm25_index.get_scores(tokenized_query) # Vector search vector_results = vector_collection.query( query_texts=[query], n_results=len(chunk_ids) ) # Combine scores combined_scores = {} for i, chunk_id in enumerate(chunk_ids): # Get vector score (distance converted to similarity) vector_idx = vector_results['ids'][0].index(chunk_id) if chunk_id in vector_results['ids'][0] else -1 vector_score = 1 - vector_results['distances'][0][vector_idx] if vector_idx != -1 else 0 # Normalize BM25 score bm25_score = bm25_scores[i] / (max(bm25_scores) or 1) # Combine combined_scores[chunk_id] = alpha bm25_score + (1 - alpha) vector_score # Sort by combined score sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True) return [chunk_id for chunk_id, score in sorted_results]

Reranking for Precision

Use Cohere's reranker to further improve results:

def rerank_results(query: str, retrieved_chunks: List[str], top_k: int = 10) -> List[str]:
    """Use Cohere reranker to improve result ordering"""
    if not retrieved_chunks:
        return []
    
    rerank_response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=retrieved_chunks,
        top_n=top_k
    )
    
    # Return reordered chunks
    return [retrieved_chunks[result.index] for result in rerank_response.results]

Production Considerations

AWS Bedrock Integration

For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking:

# Example Lambda function structure for Bedrock
import json

def lambda_handler(event, context): """AWS Lambda function for contextual chunking in Bedrock Knowledge Bases""" # Parse input chunk_text = event.get('chunkText', '') metadata = event.get('metadata', {}) # Add context using surrounding chunks contextual_chunk = add_context_based_on_metadata(chunk_text, metadata) return { 'statusCode': 200, 'body': json.dumps({ 'contextualChunk': contextual_chunk, 'metadata': metadata }) }

Performance Optimization Tips

  • Context Window Size: Start with 1-2 chunks before/after. Test to find optimal size for your data.
  • Batch Processing: Process chunks in batches to optimize API calls.
  • Cache Strategically: Cache embeddings for static documents, refresh for frequently updated content.
  • Monitor Costs: Use prompt caching and track embedding generation costs.

Key Takeaways

  • Contextual Embeddings improve retrieval accuracy by 35% on average by adding surrounding context to document chunks before embedding.
  • Pass@10 performance jumps from ~87% to ~95% when implementing this technique with codebase datasets.
  • Prompt caching is essential for production deployments to manage costs effectively—available on Anthropic's API and coming to AWS Bedrock/GCP Vertex.
  • Hybrid approaches work best: Combine contextual embeddings with BM25 search and reranking for optimal results.
  • The technique is platform-agnostic: Implementable on Anthropic's API, AWS Bedrock (via custom Lambda), and GCP Vertex AI with minor adjustments.
By implementing Contextual Embeddings in your RAG pipeline, you'll significantly enhance Claude's ability to retrieve relevant information, leading to more accurate and helpful responses across all your applications.