Mastering Document Summarization with Claude: From Basics to Advanced RAG Techniques
Learn how to summarize legal documents and long texts using Claude AI. Covers prompt engineering, ROUGE evaluation, meta-summarization, and RAG-based indexing for production workflows.
This guide teaches you to summarize long documents with Claude using basic prompts, guided summarization, meta-summarization, and RAG-based indexing. You'll also learn to evaluate summary quality with ROUGE scores and Promptfoo, then iteratively improve your results.
Mastering Document Summarization with Claude: From Basics to Advanced RAG Techniques
Summarization is one of the most powerful and practical applications of large language models. Whether you're a legal analyst drowning in contracts, a researcher sifting through papers, or a product manager reviewing customer feedback, the ability to condense lengthy documents into clear, actionable summaries saves time and improves decision-making.
Claude excels at summarization due to its large context window, nuanced language understanding, and strong instruction-following capabilities. In this guide, you'll learn a complete workflow: from basic summarization prompts to advanced techniques like meta-summarization and RAG-based indexing. We'll also cover how to evaluate summary quality using both automated metrics (ROUGE) and custom evaluation frameworks like Promptfoo.
By the end, you'll have a production-ready approach to summarization that you can adapt to any domain.
Why Summarization Is Hard (And Why Claude Helps)
Summarization evaluation is notoriously subjective. Unlike classification or extraction tasks, there's rarely a single "correct" summary. Different readers value different aspects: conciseness, factual accuracy, tone, or inclusion of specific details. Traditional metrics like ROUGE measure n-gram overlap but miss coherence, relevance, and faithfulness.
Claude addresses these challenges by:
- Understanding context and nuance in long documents
- Following detailed instructions about summary format and focus
- Handling documents well beyond typical token limits via its extended context window
Setup: What You'll Need
Before diving in, install the required packages:
pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo
You'll also need a Claude API key. Set it as an environment variable:
export ANTHROPIC_API_KEY=sk-ant-...
Data Preparation: Extracting Text from PDFs
For this guide, we'll use a publicly available Sublease Agreement from SEC.gov. Legal documents are ideal for testing summarization because they're dense, structured, and full of jargon.
Here's a Python function to extract text from a PDF:
import pypdf
def extract_text_from_pdf(pdf_path):
reader = pypdf.PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
text = extract_text_from_pdf("sublease_agreement.pdf")
If you don't have a PDF, you can skip this step and define text = "your long text here".
Basic Summarization: Your First Claude Prompt
Let's start simple. Here's a minimal summarization function using the Claude API:
import anthropic
client = anthropic.Anthropic()
def summarize_basic(text, max_summary_length=200):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Please summarize the following text in {max_summary_length} words or fewer.\n\n{text}"
}
]
)
return response.content[0].text
summary = summarize_basic(text)
print(summary)
This works, but it's naive. The summary may miss key details, include irrelevant information, or fail to capture the document's structure. Let's improve it.
Multi-Shot Basic Summarization
A simple improvement is to ask Claude to generate multiple candidate summaries and then select or merge the best one. This reduces the chance of a single poor generation:
def summarize_multishot(text, num_shots=3):
summaries = []
for _ in range(num_shots):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": f"Summarize this document concisely:\n\n{text}"
}
]
)
summaries.append(response.content[0].text)
# Ask Claude to merge the best elements
merge_prompt = f"Here are {num_shots} summaries of the same document. Combine them into one coherent, comprehensive summary:\n\n"
for i, s in enumerate(summaries):
merge_prompt += f"Summary {i+1}: {s}\n\n"
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": merge_prompt}]
)
return response.content[0].text
Advanced Techniques: Guided and Domain-Specific Summarization
Guided Summarization
Instead of a generic "summarize this" prompt, guide Claude with a structured outline:
def guided_summarize(text):
prompt = f"""Please analyze the following legal document and provide a structured summary with these sections:
- Parties Involved: Who are the signatories?
- Key Dates: Effective date, termination date, renewal options
- Financial Terms: Rent amounts, payment schedules, deposits
- Obligations: Responsibilities of each party
- Termination Clauses: Conditions for early termination
- Risk Factors: Indemnification, liability limits, dispute resolution
Document:\n{text}
Provide the summary in bullet points under each heading."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Domain-Specific Guided Summarization
For legal documents, you can add domain-specific instructions. For example, ask Claude to identify standard vs. non-standard clauses, flag unusual terms, or extract specific legal references:
def legal_summarize(text):
prompt = f"""You are a legal document analyst. Summarize this contract with special attention to:
- Non-standard clauses: Any terms that deviate from industry norms
- Ambiguous language: Phrases that could be interpreted multiple ways
- Key risks: Financial or legal exposure for either party
- Action items: What each party must do and by when
Document:\n{text}
Format your summary as a legal memo with a 'Risks & Recommendations' section."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Meta-Summarization: Including the Context of the Entire Document
When documents are extremely long (e.g., 100+ pages), you can use a hierarchical approach:
- Chunk the document into sections
- Summarize each chunk individually
- Summarize the summaries (meta-summarization)
def chunk_text(text, chunk_size=3000):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size):
chunk = " ".join(words[i:i+chunk_size])
chunks.append(chunk)
return chunks
def hierarchical_summarize(text):
chunks = chunk_text(text)
chunk_summaries = []
for chunk in chunks:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[{"role": "user", "content": f"Summarize this section:\n\n{chunk}"}]
)
chunk_summaries.append(response.content[0].text)
# Meta-summarize
combined = "\n\n".join(chunk_summaries)
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": f"Combine these section summaries into one coherent overall summary:\n\n{combined}"}]
)
return response.content[0].text
Summary Indexed Documents: An Advanced RAG Approach
For production systems where you need to query many documents, consider a Retrieval-Augmented Generation (RAG) approach with summary indexing. Instead of storing raw chunks, store summaries of each chunk. This improves retrieval quality because summaries are denser and more relevant.
# Pseudocode for summary-indexed RAG
1. Chunk each document
2. Generate a summary for each chunk
3. Embed the summaries (not the raw chunks)
4. Store in a vector database
5. At query time, retrieve relevant summaries
6. Use retrieved summaries + original chunks as context for Claude
Best Practices for Summarization RAG
- Summary granularity: Match summary length to chunk size. A 500-word chunk might get a 2-sentence summary.
- Metadata preservation: Include document title, date, and section headers in the indexed summary.
- Hybrid retrieval: Combine semantic search (embeddings) with keyword search for better recall.
- Re-ranking: After initial retrieval, use Claude to re-rank results by relevance to the query.
Evaluating Summary Quality
You can't improve what you can't measure. Here are two evaluation approaches:
ROUGE Scores (Automated)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between a generated summary and reference summaries:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "The sublease agreement transfers obligations from party A to party B."
candidate = "This contract transfers obligations from A to B."
scores = scorer.score(reference, candidate)
print(scores)
Custom Evaluation with Promptfoo
Promptfoo allows you to define custom evaluation criteria using LLM-as-judge. For example, you can ask Claude to rate summaries on:- Completeness: Does it cover all key points?
- Conciseness: Is it free of unnecessary detail?
- Accuracy: Are all facts correct?
- Coherence: Does it flow logically?
# promptfoo config example
prompts:
- "Summarize this document: {{text}}"
providers:
- anthropic:claude-3-5-sonnet-20241022
tests:
- vars:
text: "..."
assert:
- type: llm-rubric
value: "The summary must include all parties mentioned in the document"
- type: llm-rubric
value: "The summary must be under 150 words"
Iterative Improvement: A Practical Workflow
- Baseline: Run basic summarization on a test set of 5-10 documents.
- Evaluate: Use ROUGE and Promptfoo to score each summary.
- Analyze failures: Identify patterns (e.g., missing dates, incorrect party names).
- Refine prompts: Add specific instructions to address failures.
- Re-evaluate: Compare new scores against baseline.
- Repeat: Continue until scores plateau.
Conclusion and Best Practices
Summarization with Claude is both powerful and flexible. Here are the key takeaways:
- Start simple, then guide: Begin with a basic prompt, then add structure and domain-specific instructions.
- Use multi-shot for consistency: Generate multiple summaries and merge them for more reliable output.
- Handle long documents hierarchically: Chunk, summarize, then meta-summarize.
- Evaluate rigorously: Combine ROUGE scores with LLM-based evaluation for a complete picture.
- Iterate systematically: Track your prompt changes and their impact on quality metrics.
Key Takeaways
- Guided prompts outperform generic ones: Structuring your summary with specific sections (parties, dates, risks) yields more useful output.
- Hierarchical summarization scales: For long documents, chunk-then-summarize-then-meta-summarize is a proven pattern.
- Evaluation is essential: Use ROUGE for quick automated checks and Promptfoo for nuanced, criteria-based assessment.
- RAG with summary indexing improves retrieval: Storing summaries instead of raw chunks leads to higher-quality search results.
- Iterate with data: Track prompt versions and evaluation scores to systematically improve your summarization pipeline.