Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG Techniques
Learn how to summarize long documents with Claude AI, including prompt engineering, handling token limits, RAG-based summarization, and automated quality evaluation using ROUGE scores.
A practical guide to summarizing documents with Claude, covering basic prompts, multi-shot techniques, guided summarization, handling long documents via RAG, and evaluating summary quality with automated metrics.
Introduction
Summarization is one of the most powerful and practical applications of large language models like Claude. Whether you're a legal professional drowning in contracts, a researcher sifting through papers, or a business analyst reviewing quarterly reports, the ability to condense lengthy documents into concise, accurate summaries saves time and improves decision-making.
This guide walks you through the complete workflow of document summarization with Claude — from basic prompt techniques to advanced Retrieval-Augmented Generation (RAG) approaches for documents that exceed token limits. We'll also cover how to evaluate summary quality using automated metrics like ROUGE scores and tools like Promptfoo.
By the end, you'll have a reusable framework for building, testing, and refining summarization systems tailored to your specific domain.
Setup and Environment
Before we start, install the required packages:
pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo
You'll also need a valid Claude API key. Set it as an environment variable:
export ANTHROPIC_API_KEY="your-api-key-here"
Data Preparation
Most real-world documents come as PDFs. Here's a Python function to extract text from a PDF and clean it for summarization:
import pypdf
import re
def extract_text_from_pdf(pdf_path):
reader = pypdf.PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
def clean_text(text):
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text)
# Remove non-ASCII characters if needed
text = text.encode('ascii', 'ignore').decode()
return text.strip()
Example usage
raw_text = extract_text_from_pdf("sublease_agreement.pdf")
clean_text = clean_text(raw_text)
print(f"Extracted {len(clean_text)} characters")
If you don't have a PDF, you can skip this step and define a text variable directly.
Basic Summarization
Let's start with a simple summarization function using Claude's Messages API:
import anthropic
client = anthropic.Anthropic()
def summarize_text(text, max_summary_length=200):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=max_summary_length,
messages=[
{
"role": "user",
"content": f"Please summarize the following text concisely:\n\n{text}"
}
]
)
return response.content[0].text
summary = summarize_text(clean_text)
print(summary)
This basic approach works, but it has limitations: the summary may miss key details, and it won't handle documents longer than Claude's context window (200K tokens).
Multi-Shot Basic Summarization
A simple improvement is to use a multi-shot approach — breaking the document into chunks, summarizing each chunk, then summarizing the summaries:
def chunk_text(text, chunk_size=50000):
"""Split text into chunks of roughly equal size."""
words = text.split()
chunks = []
current_chunk = []
current_length = 0
for word in words:
current_length += len(word) + 1
if current_length > chunk_size:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_length = len(word)
else:
current_chunk.append(word)
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def multi_shot_summarize(text, chunk_size=50000):
chunks = chunk_text(text, chunk_size)
chunk_summaries = []
for i, chunk in enumerate(chunks):
print(f"Summarizing chunk {i+1}/{len(chunks)}...")
chunk_summaries.append(summarize_text(chunk, max_summary_length=300))
# Now summarize the summaries
combined_summaries = "\n\n".join(chunk_summaries)
final_summary = summarize_text(
f"Combine these section summaries into one coherent overall summary:\n\n{combined_summaries}",
max_summary_length=400
)
return final_summary
This technique effectively extends Claude's summarization capability to arbitrarily long documents.
Advanced Techniques
Guided Summarization
Instead of a generic "summarize this" prompt, guide Claude with specific instructions:
def guided_summarize(text, focus_areas=None):
prompt = "Summarize the following document. Focus on:\n"
if focus_areas:
for area in focus_areas:
prompt += f"- {area}\n"
prompt += f"\nDocument:\n{text}"
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Example: Legal document focus
summary = guided_summarize(clean_text, focus_areas=[
"Key obligations of each party",
"Termination conditions",
"Financial terms and payment schedules",
"Liability and indemnification clauses"
])
Domain-Specific Guided Summarization
For specialized domains like legal or medical, include domain-specific instructions:
def legal_summarize(text):
prompt = """You are a legal document analyst. Summarize this contract with:
- PARTIES: Who are the involved parties?
- TERM: Duration and renewal terms
- OBLIGATIONS: Key responsibilities of each party
- FINANCIAL: Payment amounts, schedules, penalties
- TERMINATION: Conditions for early termination
- LIABILITY: Indemnification, limitations of liability
- GOVERNING LAW: Jurisdiction and dispute resolution
Document:
{text}
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=800,
messages=[{"role": "user", "content": prompt.format(text=text)}]
)
return response.content[0].text
Meta-Summarization
For very long documents, use a hierarchical approach: summarize sections, then summarize the summaries, and optionally extract metadata:
def meta_summarize(text, chunk_size=30000):
# Step 1: Extract metadata (title, date, parties, etc.)
metadata_prompt = f"Extract key metadata from this document: title, date, parties involved, document type.\n\n{text[:10000]}"
metadata = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": metadata_prompt}]
).content[0].text
# Step 2: Chunk and summarize
chunks = chunk_text(text, chunk_size)
summaries = []
for chunk in chunks:
summaries.append(summarize_text(chunk, max_summary_length=300))
# Step 3: Combine into final summary
combined = "\n\n".join(summaries)
final_prompt = f"Metadata:\n{metadata}\n\nSection Summaries:\n{combined}\n\nCreate a cohesive final summary."
final_summary = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": final_prompt}]
).content[0].text
return {"metadata": metadata, "summary": final_summary}
Summary Indexed Documents: An Advanced RAG Approach
When documents are extremely long (hundreds of pages), even chunked summarization can lose context. A more robust approach is to build a summary-indexed RAG system:
- Chunk the document into sections
- Summarize each chunk and store both the chunk and its summary
- Index the summaries for retrieval
- Retrieve relevant summaries based on a query, then use the corresponding full chunks for detailed answers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SummaryIndexedRAG:
def __init__(self, text, chunk_size=20000):
self.chunks = chunk_text(text, chunk_size)
self.summaries = []
for chunk in self.chunks:
self.summaries.append(summarize_text(chunk, max_summary_length=200))
self.vectorizer = TfidfVectorizer().fit(self.summaries)
self.summary_vectors = self.vectorizer.transform(self.summaries)
def query(self, question, top_k=3):
question_vec = self.vectorizer.transform([question])
similarities = cosine_similarity(question_vec, self.summary_vectors)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
context = ""
for idx in top_indices:
context += f"--- Section {idx+1} ---\n{self.chunks[idx]}\n\n"
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Based on the following document sections, answer: {question}\n\n{context}"
}]
)
return response.content[0].text
Usage
rag = SummaryIndexedRAG(clean_text)
answer = rag.query("What are the termination conditions?")
print(answer)
Best Practices for Summarization RAG
- Chunk size: 10,000–30,000 characters works well for most documents
- Overlap: Add 10% overlap between chunks to avoid cutting off important context
- Summary granularity: Keep summaries concise (100–200 words) for fast retrieval
- Hybrid search: Combine semantic search (embeddings) with keyword search for better recall
Evaluations
Evaluating summary quality is notoriously difficult. Here are three practical methods:
1. ROUGE Scores
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compares generated summaries against reference summaries:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "The sublease agreement outlines terms between parties A and B..."
generated = "This agreement defines the relationship between party A and party B..."
scores = scorer.score(reference, generated)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")
2. Promptfoo for Custom Evaluation
Promptfoo allows you to define custom evaluation criteria:# promptfooconfig.yaml
prompts:
- "Summarize: {{text}}"
- "You are a legal expert. Summarize: {{text}}"
providers:
- id: anthropic:claude-3-5-sonnet-20241022
tests:
- vars:
text: "..."
assert:
- type: llm-rubric
value: "Does the summary include all key parties?"
- type: llm-rubric
value: "Is the summary factually accurate?"
- type: cost
threshold: 0.01
3. Human Evaluation with Rubrics
Create a scoring rubric for human reviewers:
| Criteria | Score (1-5) | Description |
|---|---|---|
| Completeness | 1-5 | All key points covered |
| Accuracy | 1-5 | No factual errors |
| Conciseness | 1-5 | No unnecessary details |
| Coherence | 1-5 | Flows logically |
Iterative Improvement
Use evaluation results to iteratively improve your summarization pipeline:
- Baseline: Run basic summarization and measure ROUGE scores
- Prompt engineering: Refine prompts based on missing elements
- Chunking strategy: Adjust chunk size and overlap
- Domain tuning: Add domain-specific instructions
- Re-evaluate: Compare new scores against baseline
def iterative_improvement(text, reference_summary, iterations=3):
best_score = 0
best_prompt = ""
prompts = [
"Summarize the following:",
"Provide a concise summary covering all key points:",
"As an expert analyst, create a structured summary with sections:"
]
for i in range(min(iterations, len(prompts))):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": f"{prompts[i]}\n\n{text}"}]
)
generated = response.content[0].text
scores = scorer.score(reference_summary, generated)
avg_score = (scores['rouge1'].fmeasure + scores['rouge2'].fmeasure + scores['rougeL'].fmeasure) / 3
if avg_score > best_score:
best_score = avg_score
best_prompt = prompts[i]
print(f"Iteration {i+1}: ROUGE-L F1 = {scores['rougeL'].fmeasure:.3f}")
return best_prompt, best_score
Conclusion and Best Practices
Summarization with Claude is both powerful and flexible. Here are the key takeaways:
- Start simple, then iterate: Begin with basic prompts and refine based on evaluation
- Guide with structure: Use domain-specific prompts to extract exactly what you need
- Handle long documents with chunking and RAG: Don't let token limits stop you
- Evaluate systematically: Combine automated metrics (ROUGE) with human review
- Optimize for your domain: Legal, medical, and technical documents each need tailored approaches
Key Takeaways
- Prompt engineering matters: Structured, domain-specific prompts produce significantly better summaries than generic "summarize this" requests
- Chunking + meta-summarization handles any document length: Break long texts into chunks, summarize each, then summarize the summaries for coherent results
- RAG-based summarization enables query-specific answers: Index chunk summaries for fast retrieval, then use full chunks for detailed responses
- Evaluate with both automated and human methods: ROUGE scores provide a quick baseline, but human review with rubrics catches nuance
- Iterative improvement is essential: Small prompt tweaks and chunking adjustments can dramatically improve summary quality over time