Guide2026-05-01

Mastering Document Summarization with Claude: From Basics to Advanced RAG Techniques

Learn how to summarize legal documents and long texts using Claude AI. Covers prompt engineering, ROUGE evaluation, meta-summarization, and RAG-based indexing for production workflows.

Quick Answer

This guide teaches you to summarize long documents with Claude using basic prompts, guided summarization, meta-summarization, and RAG-based indexing. You'll also learn to evaluate summary quality with ROUGE scores and Promptfoo, then iteratively improve your results.

summarizationClaude APIprompt engineeringRAGevaluation

Mastering Document Summarization with Claude: From Basics to Advanced RAG Techniques

Summarization is one of the most powerful and practical applications of large language models. Whether you're a legal analyst drowning in contracts, a researcher sifting through papers, or a product manager reviewing customer feedback, the ability to condense lengthy documents into clear, actionable summaries saves time and improves decision-making.

Claude excels at summarization due to its large context window, nuanced language understanding, and strong instruction-following capabilities. In this guide, you'll learn a complete workflow: from basic summarization prompts to advanced techniques like meta-summarization and RAG-based indexing. We'll also cover how to evaluate summary quality using both automated metrics (ROUGE) and custom evaluation frameworks like Promptfoo.

By the end, you'll have a production-ready approach to summarization that you can adapt to any domain.

Why Summarization Is Hard (And Why Claude Helps)

Summarization evaluation is notoriously subjective. Unlike classification or extraction tasks, there's rarely a single "correct" summary. Different readers value different aspects: conciseness, factual accuracy, tone, or inclusion of specific details. Traditional metrics like ROUGE measure n-gram overlap but miss coherence, relevance, and faithfulness.

Claude addresses these challenges by:

Understanding context and nuance in long documents
Following detailed instructions about summary format and focus
Handling documents well beyond typical token limits via its extended context window

But even with a powerful model, your prompt design and workflow architecture matter enormously.

Setup: What You'll Need

Before diving in, install the required packages:

pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo

You'll also need a Claude API key. Set it as an environment variable:

export ANTHROPIC_API_KEY=sk-ant-...

Data Preparation: Extracting Text from PDFs

For this guide, we'll use a publicly available Sublease Agreement from SEC.gov. Legal documents are ideal for testing summarization because they're dense, structured, and full of jargon.

Here's a Python function to extract text from a PDF:

import pypdf
def extract_text_from_pdf(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text
text = extract_text_from_pdf("sublease_agreement.pdf")

If you don't have a PDF, you can skip this step and define text = "your long text here".

Basic Summarization: Your First Claude Prompt

Let's start simple. Here's a minimal summarization function using the Claude API:

import anthropic
client = anthropic.Anthropic()
def summarize_basic(text, max_summary_length=200):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": f"Please summarize the following text in {max_summary_length} words or fewer.\n\n{text}"
            }
        ]
    )
    return response.content[0].text
summary = summarize_basic(text)
print(summary)

This works, but it's naive. The summary may miss key details, include irrelevant information, or fail to capture the document's structure. Let's improve it.

Multi-Shot Basic Summarization

A simple improvement is to ask Claude to generate multiple candidate summaries and then select or merge the best one. This reduces the chance of a single poor generation:

def summarize_multishot(text, num_shots=3):
    summaries = []
    for _ in range(num_shots):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": f"Summarize this document concisely:\n\n{text}"
                }
            ]
        )
        summaries.append(response.content[0].text)
    
    # Ask Claude to merge the best elements
    merge_prompt = f"Here are {num_shots} summaries of the same document. Combine them into one coherent, comprehensive summary:\n\n"
    for i, s in enumerate(summaries):
        merge_prompt += f"Summary {i+1}: {s}\n\n"
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": merge_prompt}]
    )
    return response.content[0].text

Advanced Techniques: Guided and Domain-Specific Summarization

Guided Summarization

Instead of a generic "summarize this" prompt, guide Claude with a structured outline:

def guided_summarize(text):
    prompt = f"""Please analyze the following legal document and provide a structured summary with these sections:
Parties Involved: Who are the signatories?
Key Dates: Effective date, termination date, renewal options
Financial Terms: Rent amounts, payment schedules, deposits
Obligations: Responsibilities of each party
Termination Clauses: Conditions for early termination
Risk Factors: Indemnification, liability limits, dispute resolution

Document:\n{text}
Provide the summary in bullet points under each heading."""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Domain-Specific Guided Summarization

For legal documents, you can add domain-specific instructions. For example, ask Claude to identify standard vs. non-standard clauses, flag unusual terms, or extract specific legal references:

def legal_summarize(text):
    prompt = f"""You are a legal document analyst. Summarize this contract with special attention to:
Non-standard clauses: Any terms that deviate from industry norms
Ambiguous language: Phrases that could be interpreted multiple ways
Key risks: Financial or legal exposure for either party
Action items: What each party must do and by when

Document:\n{text}
Format your summary as a legal memo with a 'Risks & Recommendations' section."""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Meta-Summarization: Including the Context of the Entire Document

When documents are extremely long (e.g., 100+ pages), you can use a hierarchical approach:

Chunk the document into sections
Summarize each chunk individually
Summarize the summaries (meta-summarization)

def chunk_text(text, chunk_size=3000):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        chunks.append(chunk)
    return chunks
def hierarchical_summarize(text):
    chunks = chunk_text(text)
    chunk_summaries = []
    
    for chunk in chunks:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=512,
            messages=[{"role": "user", "content": f"Summarize this section:\n\n{chunk}"}]
        )
        chunk_summaries.append(response.content[0].text)
    
    # Meta-summarize
    combined = "\n\n".join(chunk_summaries)
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"Combine these section summaries into one coherent overall summary:\n\n{combined}"}]
    )
    return response.content[0].text

Summary Indexed Documents: An Advanced RAG Approach

For production systems where you need to query many documents, consider a Retrieval-Augmented Generation (RAG) approach with summary indexing. Instead of storing raw chunks, store summaries of each chunk. This improves retrieval quality because summaries are denser and more relevant.

# Pseudocode for summary-indexed RAG
1. Chunk each document
2. Generate a summary for each chunk
3. Embed the summaries (not the raw chunks)
4. Store in a vector database
5. At query time, retrieve relevant summaries
6. Use retrieved summaries + original chunks as context for Claude

Best Practices for Summarization RAG

Summary granularity: Match summary length to chunk size. A 500-word chunk might get a 2-sentence summary.
Metadata preservation: Include document title, date, and section headers in the indexed summary.
Hybrid retrieval: Combine semantic search (embeddings) with keyword search for better recall.
Re-ranking: After initial retrieval, use Claude to re-rank results by relevance to the query.

Evaluating Summary Quality

You can't improve what you can't measure. Here are two evaluation approaches:

ROUGE Scores (Automated)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between a generated summary and reference summaries:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "The sublease agreement transfers obligations from party A to party B."
candidate = "This contract transfers obligations from A to B."
scores = scorer.score(reference, candidate)
print(scores)

Custom Evaluation with Promptfoo

Promptfoo allows you to define custom evaluation criteria using LLM-as-judge. For example, you can ask Claude to rate summaries on:

Completeness: Does it cover all key points?
Conciseness: Is it free of unnecessary detail?
Accuracy: Are all facts correct?
Coherence: Does it flow logically?

# promptfoo config example
prompts:
  - "Summarize this document: {{text}}"
providers:
  - anthropic:claude-3-5-sonnet-20241022
tests:
  - vars:
      text: "..."
    assert:
      - type: llm-rubric
        value: "The summary must include all parties mentioned in the document"
      - type: llm-rubric
        value: "The summary must be under 150 words"

Iterative Improvement: A Practical Workflow

Baseline: Run basic summarization on a test set of 5-10 documents.
Evaluate: Use ROUGE and Promptfoo to score each summary.
Analyze failures: Identify patterns (e.g., missing dates, incorrect party names).
Refine prompts: Add specific instructions to address failures.
Re-evaluate: Compare new scores against baseline.
Repeat: Continue until scores plateau.

Conclusion and Best Practices

Summarization with Claude is both powerful and flexible. Here are the key takeaways:

Start simple, then guide: Begin with a basic prompt, then add structure and domain-specific instructions.
Use multi-shot for consistency: Generate multiple summaries and merge them for more reliable output.
Handle long documents hierarchically: Chunk, summarize, then meta-summarize.
Evaluate rigorously: Combine ROUGE scores with LLM-based evaluation for a complete picture.
Iterate systematically: Track your prompt changes and their impact on quality metrics.

By following this guide, you can build a robust summarization pipeline that handles everything from short emails to hundred-page legal contracts.

Key Takeaways

Guided prompts outperform generic ones: Structuring your summary with specific sections (parties, dates, risks) yields more useful output.
Hierarchical summarization scales: For long documents, chunk-then-summarize-then-meta-summarize is a proven pattern.
Evaluation is essential: Use ROUGE for quick automated checks and Promptfoo for nuanced, criteria-based assessment.
RAG with summary indexing improves retrieval: Storing summaries instead of raw chunks leads to higher-quality search results.
Iterate with data: Track prompt versions and evaluation scores to systematically improve your summarization pipeline.