Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude
Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for better RAG systems.
This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35% compared to basic RAG systems. You'll learn setup, implementation, and optimization with practical code examples.
Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude
Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals.
In this guide, we'll explore Contextual Embeddings—a powerful technique that improved top-20-chunk retrieval failure rates by 35% across Anthropic's testing. We'll walk through implementation from basic RAG to optimized contextual retrieval with practical code examples.
Prerequisites and Setup
Before we begin, ensure you have:
Technical Requirements:- Python 3.8+
- Basic understanding of RAG concepts
- Familiarity with vector databases
- Command-line proficiency
- Anthropic API key
- Voyage AI API key for embeddings
- Cohere API key for reranking (optional)
pip install anthropic voyageai cohere chromadb
Dataset:
We'll use a dataset of 9 codebases with 248 queries, each with a known "golden chunk" for evaluation. You can find this in the Anthropic Cookbook repository.
1. Establishing a Baseline: Basic RAG
Let's first implement a traditional RAG system to understand our starting point. We'll use ChromaDB as our vector store and Voyage AI for embeddings.
import anthropic
import voyageai
from chromadb import PersistentClient
from chromadb.utils import embedding_functions
import json
Initialize APIs
vo = voyageai.Client(api_key="your_voyage_key")
client = anthropic.Anthropic(api_key="your_anthropic_key")
Load and chunk documents
def load_and_chunk_documents(filepath):
with open(filepath, 'r') as f:
chunks = json.load(f)
return chunks
Create basic embeddings
def create_basic_embeddings(chunks):
texts = [chunk["text"] for chunk in chunks]
results = vo.embed(texts, model="voyage-code-2")
return results.embeddings
Setup vector database
def setup_vector_db(chunks, embeddings):
chroma_client = PersistentClient(path="./chroma_db")
collection = chroma_client.create_collection("basic_rag")
# Add documents with metadata
ids = [f"doc_{i}" for i in range(len(chunks))]
metadatas = [{"source": chunk["source"], "chunk_id": chunk["chunk_id"]}
for chunk in chunks]
collection.add(
embeddings=embeddings,
documents=[chunk["text"] for chunk in chunks],
metadatas=metadatas,
ids=ids
)
return collection
Basic retrieval
def basic_retrieve(query, collection, k=10):
query_embedding = vo.embed([query], model="voyage-code-2").embeddings[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=k
)
return results
This basic approach typically achieves ~87% Pass@10 accuracy (finding the golden chunk in top 10 results). Let's improve this.
2. Implementing Contextual Embeddings
Contextual Embeddings solve the "missing context" problem by adding relevant context to each chunk before creating embeddings. Here's how it works:
The Core Concept
Instead of embedding raw chunks like:"def calculate_total(items):\n total = 0"
We add context:
"Function from shopping_cart.py that calculates total price:\n\ndef calculate_total(items):\n total = 0"
Implementation with Prompt Caching
Prompt caching makes this practical for production by reusing context generation:def generate_contextual_chunks(chunks, use_caching=True):
"""Add context to chunks using Claude"""
contextual_chunks = []
cache = {} if use_caching else None
for chunk in chunks:
chunk_id = chunk["chunk_id"]
# Check cache first
if use_caching and chunk_id in cache:
contextual_chunks.append(cache[chunk_id])
continue
# Generate context prompt
context_prompt = f"""You are helping to add context to code chunks for better retrieval.
Original chunk from {chunk['source']}:
{chunk['text']}
Provide 1-2 sentences of context about what this code does, what file it's from,
and its purpose. Return ONLY the context text."""
# Get context from Claude
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=100,
messages=[{"role": "user", "content": context_prompt}]
)
context = response.content[0].text
contextual_text = f"{context}\n\n{chunk['text']}"
contextual_chunks.append({
**chunk,
"contextual_text": contextual_text,
"context": context
})
if use_caching:
cache[chunk_id] = contextual_text
return contextual_chunks
Create contextual embeddings
def create_contextual_embeddings(contextual_chunks):
texts = [chunk["contextual_text"] for chunk in contextual_chunks]
results = vo.embed(texts, model="voyage-code-2")
return results.embeddings
Performance Impact
When implemented, Contextual Embeddings improved Pass@10 performance from ~87% to ~95% on our codebase dataset—a significant improvement for production systems.3. Enhancing with Contextual BM25
We can further improve results by combining contextual embeddings with BM25 search. Instead of traditional keyword matching, we use the contextual text:
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
def setup_contextual_bm25(contextual_chunks):
"""Create BM25 index on contextual text"""
tokenized_corpus = []
for chunk in contextual_chunks:
tokens = word_tokenize(chunk["contextual_text"].lower())
tokenized_corpus.append(tokens)
bm25 = BM25Okapi(tokenized_corpus)
return bm25
def hybrid_retrieval(query, collection, bm25_index, contextual_chunks, alpha=0.5):
"""Combine vector and BM25 scores"""
# Vector search
vector_results = basic_retrieve(query, collection, k=20)
# BM25 search on contextual text
query_tokens = word_tokenize(query.lower())
bm25_scores = bm25_index.get_scores(query_tokens)
# Normalize and combine scores
combined_scores = {}
for i, chunk in enumerate(contextual_chunks):
vector_score = 0
if chunk["chunk_id"] in vector_results["ids"][0]:
idx = vector_results["ids"][0].index(chunk["chunk_id"])
vector_score = vector_results["distances"][0][idx]
# Normalize BM25 score (0-1 range)
normalized_bm25 = (bm25_scores[i] - min(bm25_scores)) / \
(max(bm25_scores) - min(bm25_scores) + 1e-8)
combined = alpha (1 - vector_score) + (1 - alpha) normalized_bm25
combined_scores[chunk["chunk_id"]] = combined
# Sort by combined score
sorted_results = sorted(combined_scores.items(),
key=lambda x: x[1], reverse=True)
return sorted_results[:10]
4. Adding Reranking for Final Polish
For the best results, add a reranking step using models specifically trained for relevance:
import cohere
def rerank_results(query, retrieved_chunks, top_k=5):
"""Use Cohere reranker to improve final ordering"""
co = cohere.Client("your_cohere_key")
documents = [chunk["contextual_text"] for chunk in retrieved_chunks]
rerank_response = co.rerank(
query=query,
documents=documents,
top_n=top_k,
model="rerank-english-v2.0"
)
reranked_chunks = []
for result in rerank_response.results:
reranked_chunks.append(retrieved_chunks[result.index])
return reranked_chunks
5. Production Considerations
AWS Bedrock Integration
For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking. Deploy this function and select it as a custom chunking option when configuring your Bedrock Knowledge Base.Cost Management
Prompt caching is essential for cost-effective production use:- Cache context generation for identical chunks
- Batch process documents offline
- Use lighter models (Haiku) for context generation when possible
Evaluation Framework
Always measure performance with your specific dataset:def evaluate_pass_at_k(retrieval_function, queries, golden_chunks, k=10):
"""Calculate Pass@k metric"""
passes = 0
total = len(queries)
for query, golden_id in zip(queries, golden_chunks):
results = retrieval_function(query, k=k)
retrieved_ids = [r["chunk_id"] for r in results]
if golden_id in retrieved_ids:
passes += 1
return passes / total
Key Takeaways
- Contextual Embeddings improve retrieval accuracy by 35% on average by adding relevant context to document chunks before embedding, addressing the "missing context" problem in traditional RAG.
- Prompt caching makes this production-ready by allowing reuse of generated context, significantly reducing API costs and latency while maintaining performance benefits.
- Hybrid search with Contextual BM25 further enhances results by combining semantic search with keyword matching on contextualized text, leveraging the strengths of both approaches.
- Always evaluate with your specific data using metrics like Pass@k to measure actual performance improvements, as results can vary based on document type and query patterns.
- The technique is platform-agnostic and can be adapted for AWS Bedrock, Google Vertex AI, or custom implementations, with Anthropic providing reference code for major platforms.