Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude
Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for enhanced RAG systems.
This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35% and boosting Pass@10 performance from 87% to 95%.
Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude
Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals and suboptimal responses.
In this guide, we'll walk through implementing Contextual Embeddings—a powerful technique that reduces top-20-chunk retrieval failure rates by 35% on average. We'll use a dataset of 9 codebases with 248 queries to demonstrate practical improvements, moving from ~87% to ~95% Pass@10 performance.
Prerequisites and Setup
Before diving in, ensure you have the following:
Technical Requirements:- Python 3.8+ installed
- Intermediate Python programming skills
- Basic understanding of RAG concepts
- Familiarity with vector databases
- 4GB+ RAM and 5-10GB disk space
- Anthropic API key (Claude access)
- Voyage AI API key (embeddings)
- Cohere API key (reranking, optional)
- Completion time: 30-45 minutes
- Estimated API cost: $5-10 for full dataset processing
Installation and Initial Setup
# Install required libraries
!pip install anthropic voyageai cohere chromadb pymupdf tiktoken
Import necessary modules
import anthropic
import voyageai
import cohere
import chromadb
from chromadb.utils.embedding_functions import VoyageAIEmbeddingFunction
import json
import numpy as np
from typing import List, Dict, Any
Initialize API clients
client = anthropic.Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
vo = voyageai.Client(api_key="YOUR_VOYAGEAI_API_KEY")
co = cohere.Client("YOUR_COHERE_API_KEY")
Establishing a Baseline: Basic RAG System
First, let's set up a traditional RAG system to establish performance benchmarks. We'll use a dataset of codebase chunks and evaluation queries.
# Load the dataset
with open('data/codebase_chunks.json', 'r') as f:
codebase_chunks = json.load(f)
with open('data/evaluation_set.jsonl', 'r') as f:
evaluation_queries = [json.loads(line) for line in f]
Basic chunk embedding function
def embed_chunks_basic(chunks: List[str]) -> List[List[float]]:
"""Generate embeddings for chunks without additional context"""
embeddings = vo.embed(
texts=chunks,
model="voyage-code-2",
input_type="document"
).embeddings
return embeddings
Create vector store with basic embeddings
def create_basic_vector_store(chunks: Dict[str, Any]):
"""Set up ChromaDB with traditional embeddings"""
chroma_client = chromadb.Client()
# Prepare documents and metadata
documents = []
metadatas = []
ids = []
for chunk_id, chunk_data in chunks.items():
documents.append(chunk_data['text'])
metadatas.append({"source": chunk_data['source']})
ids.append(chunk_id)
# Generate embeddings
embeddings = embed_chunks_basic(documents)
# Create collection
collection = chroma_client.create_collection(
name="basic_rag",
embedding_function=VoyageAIEmbeddingFunction(
api_key=vo.api_key,
model_name="voyage-code-2"
)
)
# Add to collection
collection.add(
documents=documents,
embeddings=embeddings,
metadatas=metadatas,
ids=ids
)
return collection
Evaluate baseline performance
def evaluate_pass_at_k(collection, queries: List[Dict], k: int = 10) -> float:
"""Calculate Pass@k metric"""
correct = 0
for query in queries:
results = collection.query(
query_texts=[query['query']],
n_results=k
)
# Check if golden chunk is in results
if query['golden_chunk_id'] in results['ids'][0]:
correct += 1
return correct / len(queries)
Create and evaluate baseline
basic_collection = create_basic_vector_store(codebase_chunks)
baseline_pass_10 = evaluate_pass_at_k(basic_collection, evaluation_queries, k=10)
print(f"Baseline Pass@10: {baseline_pass_10:.2%}")
Implementing Contextual Embeddings
Contextual Embeddings solve the context deficiency problem by adding relevant information to each chunk before embedding. This approach significantly improves retrieval accuracy.
How Contextual Embeddings Work
Traditional RAG splits documents into isolated chunks. Contextual Embeddings enrich each chunk with:
- Previous context (n chunks before)
- Current chunk (the main content)
- Following context (n chunks after)
def add_context_to_chunk(chunk_id: str, chunks: Dict[str, Any], context_window: int = 2) -> str:
"""Add surrounding context to a chunk"""
# Get all chunk IDs from the same source
source = chunks[chunk_id]['source']
source_chunks = [
(cid, data) for cid, data in chunks.items()
if data['source'] == source
]
# Sort by position
source_chunks.sort(key=lambda x: x[1].get('position', 0))
# Find current chunk index
chunk_ids = [cid for cid, _ in source_chunks]
current_idx = chunk_ids.index(chunk_id)
# Get context window
start_idx = max(0, current_idx - context_window)
end_idx = min(len(source_chunks), current_idx + context_window + 1)
# Build contextual chunk
contextual_parts = []
for idx in range(start_idx, end_idx):
cid, data = source_chunks[idx]
if idx == current_idx:
contextual_parts.append(f"[CURRENT CHUNK]\n{data['text']}")
else:
contextual_parts.append(f"[CONTEXT]\n{data['text']}")
return "\n\n".join(contextual_parts)
def create_contextual_embeddings(chunks: Dict[str, Any], context_window: int = 2) -> Dict[str, List[float]]:
"""Generate embeddings with added context"""
contextual_texts = []
chunk_ids = []
for chunk_id in chunks.keys():
contextual_text = add_context_to_chunk(chunk_id, chunks, context_window)
contextual_texts.append(contextual_text)
chunk_ids.append(chunk_id)
# Generate embeddings
embeddings = vo.embed(
texts=contextual_texts,
model="voyage-code-2",
input_type="document"
).embeddings
return dict(zip(chunk_ids, embeddings))
Create contextual vector store
def create_contextual_vector_store(chunks: Dict[str, Any]):
"""Set up vector store with contextual embeddings"""
chroma_client = chromadb.Client()
# Prepare documents and metadata
documents = []
metadatas = []
ids = []
for chunk_id, chunk_data in chunks.items():
documents.append(chunk_data['text']) # Store original text
metadatas.append({"source": chunk_data['source']})
ids.append(chunk_id)
# Generate contextual embeddings
contextual_embeddings = create_contextual_embeddings(chunks)
embeddings_list = [contextual_embeddings[cid] for cid in ids]
# Create collection
collection = chroma_client.create_collection(
name="contextual_rag",
embedding_function=VoyageAIEmbeddingFunction(
api_key=vo.api_key,
model_name="voyage-code-2"
)
)
# Add to collection
collection.add(
documents=documents,
embeddings=embeddings_list,
metadatas=metadatas,
ids=ids
)
return collection
Evaluate contextual embeddings performance
contextual_collection = create_contextual_vector_store(codebase_chunks)
contextual_pass_10 = evaluate_pass_at_k(contextual_collection, evaluation_queries, k=10)
print(f"Contextual Embeddings Pass@10: {contextual_pass_10:.2%}")
print(f"Improvement: {contextual_pass_10 - baseline_pass_10:.2%} points")
Prompt Caching for Production Efficiency
Generating context for each chunk can be expensive. Prompt caching helps manage costs:
# Example of prompt caching implementation
class CachedContextualEmbedder:
def __init__(self, chunks: Dict[str, Any], cache_file: str = "embedding_cache.json"):
self.chunks = chunks
self.cache_file = cache_file
self.cache = self.load_cache()
def load_cache(self) -> Dict[str, List[float]]:
try:
with open(self.cache_file, 'r') as f:
return json.load(f)
except FileNotFoundError:
return {}
def save_cache(self):
with open(self.cache_file, 'w') as f:
json.dump(self.cache, f)
def get_embedding(self, chunk_id: str, context_window: int = 2) -> List[float]:
"""Get embedding from cache or generate new"""
cache_key = f"{chunk_id}_ctx{context_window}"
if cache_key in self.cache:
return self.cache[cache_key]
# Generate new embedding
contextual_text = add_context_to_chunk(chunk_id, self.chunks, context_window)
embedding = vo.embed(
texts=[contextual_text],
model="voyage-code-2",
input_type="document"
).embeddings[0]
# Cache result
self.cache[cache_key] = embedding
self.save_cache()
return embedding
Advanced Techniques
Contextual BM25 Hybrid Search
Combine contextual embeddings with BM25 for even better performance:
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
def create_contextual_bm25_index(chunks: Dict[str, Any], context_window: int = 1):
"""Create BM25 index with contextual chunks"""
tokenized_corpus = []
chunk_ids = []
for chunk_id in chunks.keys():
contextual_text = add_context_to_chunk(chunk_id, chunks, context_window)
tokens = word_tokenize(contextual_text.lower())
tokenized_corpus.append(tokens)
chunk_ids.append(chunk_id)
bm25 = BM25Okapi(tokenized_corpus)
return bm25, chunk_ids
def hybrid_search(query: str, bm25_index, chunk_ids, vector_collection, alpha: float = 0.5):
"""Combine BM25 and vector search scores"""
# BM25 search
tokenized_query = word_tokenize(query.lower())
bm25_scores = bm25_index.get_scores(tokenized_query)
# Vector search
vector_results = vector_collection.query(
query_texts=[query],
n_results=len(chunk_ids)
)
# Combine scores
combined_scores = {}
for i, chunk_id in enumerate(chunk_ids):
# Get vector score (distance converted to similarity)
vector_idx = vector_results['ids'][0].index(chunk_id) if chunk_id in vector_results['ids'][0] else -1
vector_score = 1 - vector_results['distances'][0][vector_idx] if vector_idx != -1 else 0
# Normalize BM25 score
bm25_score = bm25_scores[i] / (max(bm25_scores) or 1)
# Combine
combined_scores[chunk_id] = alpha bm25_score + (1 - alpha) vector_score
# Sort by combined score
sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
return [chunk_id for chunk_id, score in sorted_results]
Reranking for Precision
Use Cohere's reranker to further improve results:
def rerank_results(query: str, retrieved_chunks: List[str], top_k: int = 10) -> List[str]:
"""Use Cohere reranker to improve result ordering"""
if not retrieved_chunks:
return []
rerank_response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=retrieved_chunks,
top_n=top_k
)
# Return reordered chunks
return [retrieved_chunks[result.index] for result in rerank_response.results]
Production Considerations
AWS Bedrock Integration
For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking:
# Example Lambda function structure for Bedrock
import json
def lambda_handler(event, context):
"""AWS Lambda function for contextual chunking in Bedrock Knowledge Bases"""
# Parse input
chunk_text = event.get('chunkText', '')
metadata = event.get('metadata', {})
# Add context using surrounding chunks
contextual_chunk = add_context_based_on_metadata(chunk_text, metadata)
return {
'statusCode': 200,
'body': json.dumps({
'contextualChunk': contextual_chunk,
'metadata': metadata
})
}
Performance Optimization Tips
- Context Window Size: Start with 1-2 chunks before/after. Test to find optimal size for your data.
- Batch Processing: Process chunks in batches to optimize API calls.
- Cache Strategically: Cache embeddings for static documents, refresh for frequently updated content.
- Monitor Costs: Use prompt caching and track embedding generation costs.
Key Takeaways
- Contextual Embeddings improve retrieval accuracy by 35% on average by adding surrounding context to document chunks before embedding.
- Pass@10 performance jumps from ~87% to ~95% when implementing this technique with codebase datasets.
- Prompt caching is essential for production deployments to manage costs effectively—available on Anthropic's API and coming to AWS Bedrock/GCP Vertex.
- Hybrid approaches work best: Combine contextual embeddings with BM25 search and reranking for optimal results.
- The technique is platform-agnostic: Implementable on Anthropic's API, AWS Bedrock (via custom Lambda), and GCP Vertex AI with minor adjustments.