Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG
Learn how to summarize long documents using Claude API. Covers prompt engineering, metadata extraction, handling token limits, ROUGE evaluation, and iterative improvement.
This guide teaches you to summarize long documents with Claude, including basic prompts, guided summarization, meta-summarization for token limits, and evaluation using ROUGE scores and Promptfoo.
Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG
Summarization is one of the most practical applications of large language models. Whether you're a legal professional drowning in contracts, a researcher scanning dozens of papers, or a product manager trying to distill customer feedback, Claude can help you extract the signal from the noise.
This guide walks you through a complete summarization workflow using the Claude API. We'll start with a simple prompt, then layer in advanced techniques like guided summarization, meta-summarization for long documents, and a summary-indexed RAG approach. Along the way, we'll cover how to evaluate summary quality using both automated metrics (ROUGE) and custom evaluation frameworks like Promptfoo.
Why Summarization Is Hard (and Why Claude Excels)
Summarization evaluation is notoriously subjective. Two human readers can disagree on what constitutes a "good" summary. Traditional metrics like ROUGE measure n-gram overlap but miss coherence, factual accuracy, and relevance. Claude's strength lies in its ability to follow nuanced instructions, maintain context over long passages, and generate summaries that are both concise and faithful to the source.
Setup: Installing Dependencies
Before you start, install the required packages:
pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo
You'll also need a valid Anthropic API key. Set it as an environment variable:
export ANTHROPIC_API_KEY="sk-ant-..."
Data Preparation: Extracting Text from PDFs
For this guide, we'll use a publicly available Sublease Agreement from the SEC's EDGAR system. If you have your own PDF, you can adapt the file path.
import pypdf
def extract_text_from_pdf(pdf_path: str) -> str:
reader = pypdf.PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
text = extract_text_from_pdf("sublease_agreement.pdf")
If you prefer to work with a plain text blob, simply assign text = "...".
Basic Summarization: Your First Prompt
Let's start with a simple summarization function. Even this basic approach uses important Claude features like the assistant role and stop sequences.
import anthropic
client = anthropic.Anthropic()
def summarize_basic(text: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system="You are an expert summarizer. Provide a concise summary of the following document.",
messages=[
{"role": "user", "content": f"Please summarize this document:\n\n{text}"}
]
)
return response.content[0].text
This works, but it's naive. The summary might miss key details or include irrelevant information. Let's improve it.
Multi-Shot Basic Summarization
Instead of a single prompt, you can use a multi-shot approach where you ask Claude to first identify key sections, then summarize each, and finally produce a consolidated summary.
def summarize_multishot(text: str) -> str:
# Step 1: Identify key sections
sections_response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[
{"role": "user", "content": f"List the main sections of this document:\n\n{text}"}
]
)
sections = sections_response.content[0].text
# Step 2: Summarize each section
summary_response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": f"Based on these sections:\n{sections}\n\nProvide a concise summary of the entire document."}
]
)
return summary_response.content[0].text
Advanced Techniques
Guided Summarization
Instead of a generic summary, guide Claude to extract specific information. This is especially useful for legal documents where you need to capture parties, dates, obligations, and termination clauses.
def guided_summarize(text: str) -> dict:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": f"""Extract the following metadata from this legal document. Return as JSON:
- parties_involved: list of all named parties
- effective_date: date the agreement takes effect
- termination_conditions: how the agreement can be terminated
- key_obligations: list of major obligations for each party
- governing_law: which jurisdiction's law applies
Document:\n{text}"""}
]
)
return response.content[0].text
Domain-Specific Guided Summarization
For legal documents, you can add domain-specific instructions. For example, ask Claude to highlight unusual clauses, indemnification terms, or non-compete restrictions.
def legal_summarize(text: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": f"""You are a legal document analyst. Summarize this agreement with focus on:
- Risk factors (unusual clauses, penalties, auto-renewal)
- Financial terms (rent, fees, deposits)
- Termination rights
- Dispute resolution
Document:\n{text}"""}
]
)
return response.content[0].text
Meta-Summarization: Handling Long Documents
Claude has a large context window (200K tokens), but some documents—like multi-year contracts or regulatory filings—can still exceed that. The solution is meta-summarization: chunk the document, summarize each chunk, then summarize the summaries.
def chunk_text(text: str, chunk_size: int = 50000) -> list:
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size):
chunk = " ".join(words[i:i+chunk_size])
chunks.append(chunk)
return chunks
def meta_summarize(text: str) -> str:
chunks = chunk_text(text)
chunk_summaries = []
for chunk in chunks:
summary = summarize_basic(chunk)
chunk_summaries.append(summary)
combined = "\n\n".join(chunk_summaries)
final_summary = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": f"Combine these section summaries into one coherent executive summary:\n\n{combined}"}
]
)
return final_summary.content[0].text
Summary-Indexed Documents: An Advanced RAG Approach
For very large document collections, you can build a summary-indexed RAG system. Instead of indexing raw chunks, you index summaries of each document. When a user asks a question, you retrieve the most relevant document summaries and then use Claude to answer based on the full text of those documents.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def build_summary_index(documents: list) -> dict:
"""Build a dictionary mapping document IDs to their summaries."""
index = {}
for doc_id, doc_text in enumerate(documents):
summary = summarize_basic(doc_text)
index[doc_id] = {
"summary": summary,
"full_text": doc_text
}
return index
def retrieve_relevant_documents(query: str, index: dict, top_k: int = 3) -> list:
summaries = [entry["summary"] for entry in index.values()]
vectorizer = TfidfVectorizer().fit(summaries + [query])
query_vec = vectorizer.transform([query])
summary_vecs = vectorizer.transform(summaries)
similarities = cosine_similarity(query_vec, summary_vecs).flatten()
top_indices = similarities.argsort()[-top_k:][::-1]
return [list(index.values())[i]["full_text"] for i in top_indices]
Best Practices for Summarization RAG
- Summary granularity: Summarize at the document level, not the chunk level, for better retrieval relevance.
- Hybrid search: Combine summary similarity with keyword matching for robust retrieval.
- Re-ranking: After retrieval, use Claude to re-rank results based on the specific query.
Evaluations: Measuring Summary Quality
ROUGE Scores
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between the generated summary and a reference summary.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "The sublease agreement transfers rights from Party A to Party B..."
generated = "Party A subleases property to Party B..."
scores = scorer.score(reference, generated)
print(scores)
Custom Evaluation with Promptfoo
Promptfoo allows you to define custom evaluation criteria. For example, you can check that the summary includes all named parties, mentions the effective date, and does not hallucinate terms.# promptfoo config
evaluators:
- name: "contains-parties"
type: "regex"
pattern: "Party A|Party B"
- name: "no-hallucination"
type: "llm-judge"
prompt: "Does this summary contain any information not present in the original document?"
Iterative Improvement
Summarization is rarely perfect on the first try. Use this feedback loop:
- Generate a summary using your current prompt.
- Evaluate using ROUGE, regex checks, or a human reviewer.
- Identify gaps: Is the summary missing key terms? Too verbose? Hallucinating?
- Refine the prompt: Add instructions to fix specific issues (e.g., "Always include the effective date" or "Use bullet points for obligations").
- Repeat until quality meets your threshold.
Conclusion and Best Practices
- Start simple, then iterate: A basic prompt often works well. Add complexity only when needed.
- Use guided summarization for structured output: JSON extraction makes downstream processing easier.
- Chunk and meta-summarize for long documents: Don't rely on the context window alone.
- Evaluate with multiple methods: Combine ROUGE with custom checks and human review.
- Tailor to your domain: Legal, medical, and technical documents each benefit from domain-specific instructions.
Key Takeaways
- Guided prompts outperform generic ones: Specify exactly what information you need (parties, dates, obligations) for better results.
- Meta-summarization handles any document length: Chunk, summarize, then summarize the summaries.
- Summary-indexed RAG improves retrieval: Index document summaries, not raw chunks, for faster and more relevant search.
- Evaluate with ROUGE and custom checks: Automated metrics catch n-gram overlap; custom checks catch hallucinations and missing details.
- Iterate on your prompts: Small tweaks—like adding "use bullet points" or "include all named entities"—can dramatically improve quality.