Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude
Learn how to implement Contextual Embeddings to improve Claude's retrieval accuracy by 35%. Step-by-step guide with code examples for enhanced RAG systems.
This guide teaches you to implement Contextual Embeddings, a technique that adds relevant context to document chunks before embedding, improving Claude's retrieval accuracy by 35%. You'll learn setup, implementation, and optimization with practical Python examples.
Boost Your RAG Performance: A Practical Guide to Contextual Embeddings with Claude
Retrieval Augmented Generation (RAG) has revolutionized how Claude interacts with your knowledge bases, enabling applications in customer support, legal analysis, code generation, and more. However, traditional RAG systems often struggle when document chunks lack sufficient context, leading to inaccurate retrievals.
In this guide, we'll explore Contextual Embeddings—a powerful technique that improves retrieval performance by 35% on average. We'll walk through implementation, optimization, and practical deployment considerations.
Prerequisites and Setup
Before diving in, ensure you have the following:
Technical Requirements:- Python 3.8+
- Basic understanding of RAG and vector databases
- Intermediate Python programming skills
- Anthropic API key
- Voyage AI API key for embeddings
- Cohere API key for reranking (optional)
pip install anthropic voyageai cohere chromadb
Dataset:
We'll use a dataset of 9 codebases with 248 queries, each with a "golden chunk" for evaluation. You can find this in the Anthropic Cookbook repository.
Establishing a Baseline: Basic RAG
Let's first set up a traditional RAG system to understand our starting point. We'll use ChromaDB as our vector store and Voyage AI for embeddings.
import chromadb
from chromadb.config import Settings
from voyageai import Client as VoyageClient
import json
Initialize clients
voyage_client = VoyageClient(api_key="your_voyage_key")
chroma_client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="./chroma_db"
))
Load and chunk documents
with open('data/codebase_chunks.json', 'r') as f:
chunks = json.load(f)
Generate embeddings for basic RAG
basic_embeddings = voyage_client.embed(
texts=[chunk["text"] for chunk in chunks],
model="voyage-code-2",
input_type="document"
).embeddings
Store in vector database
collection = chroma_client.create_collection("basic_rag")
for i, (chunk, embedding) in enumerate(zip(chunks, basic_embeddings)):
collection.add(
embeddings=[embedding],
documents=[chunk["text"]],
metadatas=[{"source": chunk["source"]}],
ids=[str(i)]
)
Evaluation Metric: We'll use Pass@k—whether the "golden document" appears in the top k retrieved documents. Our baseline shows ~87% Pass@10 performance.
Implementing Contextual Embeddings
Contextual Embeddings solve the context deficiency problem by adding relevant information to each chunk before embedding. Here's how it works:
The Core Concept
Instead of embedding raw chunks like:
"def calculate_total(items):\n total = 0"
We add context:
"Function: calculate_total\nPurpose: Sums all items in a shopping cart\nCode:\ndef calculate_total(items):\n total = 0"
Implementation Steps
- Generate Context for Each Chunk
import anthropic
client = anthropic.Anthropic(api_key="your_anthropic_key")
def add_context_to_chunk(chunk_text, surrounding_chunks=None):
"""Add relevant context to a chunk using Claude"""
prompt = f"""You are a helpful coding assistant. Given the following code chunk, provide a concise summary that includes:
- The function/class name (if present)
- Its purpose
- Key parameters or variables
- Return value (if applicable)
Code chunk:
{chunk_text}
Context from surrounding code (if available):
{surrounding_chunks if surrounding_chunks else 'No additional context'}
Provide only the summary, no additional commentary:"""
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return f"Summary: {response.content[0].text}\n\nCode:\n{chunk_text}"
Apply to all chunks
contextual_chunks = []
for i, chunk in enumerate(chunks):
# Get surrounding chunks for context (optional)
surrounding = chunks[max(0, i-1):min(len(chunks), i+2)]
contextual_text = add_context_to_chunk(chunk["text"], surrounding)
contextual_chunks.append({
"original_text": chunk["text"],
"contextual_text": contextual_text,
"source": chunk["source"]
})
- Embed Contextualized Chunks
# Generate embeddings for contextual chunks
contextual_embeddings = voyage_client.embed(
texts=[chunk["contextual_text"] for chunk in contextual_chunks],
model="voyage-code-2",
input_type="document"
).embeddings
Store in separate collection
contextual_collection = chroma_client.create_collection("contextual_rag")
for i, (chunk, embedding) in enumerate(zip(contextual_chunks, contextual_embeddings)):
contextual_collection.add(
embeddings=[embedding],
documents=[chunk["contextual_text"]],
metadatas=[{
"source": chunk["source"],
"original_text": chunk["original_text"]
}],
ids=[f"contextual_{i}"]
)
Cost Optimization with Prompt Caching
Generating context for each chunk can be expensive. Use prompt caching (available on Anthropic's API) to reduce costs:
# With prompt caching enabled
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=150,
messages=[{"role": "user", "content": prompt}],
cache_control={"type": "ephemeral"} # Enables caching
)
Contextual BM25: Hybrid Search Enhancement
Combine contextual embeddings with BM25 search for even better performance:
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
Download NLTK data if needed
nltk.download('punkt')
def contextual_bm25_search(query, contextual_chunks, k=10):
"""Perform BM25 search on contextualized chunks"""
# Tokenize contextual texts
tokenized_contexts = [
word_tokenize(chunk["contextual_text"].lower())
for chunk in contextual_chunks
]
# Create BM25 index
bm25 = BM25Okapi(tokenized_contexts)
# Tokenize query
tokenized_query = word_tokenize(query.lower())
# Get scores
scores = bm25.get_scores(tokenized_query)
# Return top k results
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]
return [contextual_chunks[i] for i in top_indices]
Reranking for Precision
After retrieval, use a reranker to improve final results:
import cohere
co = cohere.Client("your_cohere_key")
def rerank_results(query, retrieved_chunks, top_n=5):
"""Rerank retrieved chunks using Cohere's reranker"""
documents = [chunk["contextual_text"] for chunk in retrieved_chunks]
results = co.rerank(
query=query,
documents=documents,
top_n=top_n,
model="rerank-english-v2.0"
)
reranked_chunks = []
for result in results:
reranked_chunks.append(retrieved_chunks[result.index])
return reranked_chunks
Deployment Considerations
AWS Bedrock Integration
For AWS Bedrock users, Anthropic provides a Lambda function for contextual chunking:
# Example Lambda function (simplified)
def lambda_handler(event, context):
"""Add context to documents for Bedrock Knowledge Base"""
chunk = event['chunk']
# Generate context using Claude via Bedrock
contextual_chunk = add_context_to_chunk(chunk)
return {
'statusCode': 200,
'body': {
'contextualized_chunk': contextual_chunk
}
}
Deploy this function and select it as a custom chunking option when configuring your Bedrock Knowledge Base.
Production Best Practices
- Batch Processing: Process chunks in batches to optimize API calls
- Cache Strategically: Use prompt caching for identical or similar chunks
- Monitor Costs: Track embedding and context generation costs separately
- Update Strategy: Implement incremental updates rather than full re-embeddings
- Evaluation Pipeline: Regularly evaluate retrieval performance with new queries
Performance Results
Our implementation shows significant improvements:
- Pass@10: Improved from ~87% to ~95%
- Top-20-chunk retrieval failure rate: Reduced by 35%
- Query relevance: Noticeably improved for complex, context-dependent queries
Key Takeaways
- Contextual Embeddings improve retrieval accuracy by 35% by adding relevant context to document chunks before embedding, solving the context deficiency problem in traditional RAG.
- Prompt caching is essential for cost management when generating context at scale, significantly reducing API costs for production deployments.
- Hybrid approaches work best—combine Contextual Embeddings with BM25 search and reranking for optimal performance across different query types.
- The technique is platform-agnostic and can be implemented on Anthropic's API, AWS Bedrock, or GCP Vertex AI with appropriate adaptations.
- Start with a baseline evaluation using metrics like Pass@k to measure improvements and justify the additional complexity and cost of contextual retrieval.