BeClaude
Guide2026-04-26

Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG

Learn how to summarize long documents with Claude AI. This guide covers prompt engineering, metadata extraction, handling token limits, ROUGE evaluation, and iterative improvement techniques.

Quick Answer

This guide teaches you how to use Claude for summarizing complex documents like legal contracts. You'll learn basic summarization, guided extraction, handling long documents via chunking, evaluating quality with ROUGE scores, and building a summary-indexed RAG system.

Claude APISummarizationPrompt EngineeringRAGEvaluation

Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG

Summarization is one of the most powerful and practical applications of large language models. In a world drowning in information—legal contracts, research papers, financial reports—the ability to distill lengthy documents into concise, accurate summaries is invaluable.

This guide walks you through the complete workflow of building a summarization system with Claude. We'll start with a simple prompt, then progressively layer in advanced techniques: guided extraction, handling documents beyond token limits, evaluating summary quality, and finally, a summary-indexed RAG approach.

By the end, you'll have a production-ready framework you can adapt to your own use cases.

Why Summarization Is Hard (and Why Claude Excels)

Summarization evaluation is notoriously subjective. What one reader considers a perfect summary, another might find lacking. Traditional metrics like ROUGE measure n-gram overlap but miss coherence, factual accuracy, and relevance.

Claude excels here because of its long context window (up to 200K tokens) and its ability to follow nuanced instructions. You can guide Claude to focus on specific aspects—legal obligations, financial risks, key dates—rather than just producing a generic paragraph.

Setup and Data Preparation

First, install the required packages:

pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo

You'll also need a Claude API key. Set it as an environment variable:

export ANTHROPIC_API_KEY="sk-ant-..."

Extracting Text from PDFs

For this guide, we'll use a publicly available Sublease Agreement from the SEC website. Here's how to extract text from a PDF:

import pypdf

def extract_text_from_pdf(pdf_path): reader = pypdf.PdfReader(pdf_path) text = "" for page in reader.pages: text += page.extract_text() return text

Load your document

text = extract_text_from_pdf("sublease_agreement.pdf")

If you don't have a PDF, you can simply define text = "your document content here".

Basic Summarization

Let's start with the simplest possible approach:

import anthropic

client = anthropic.Anthropic()

def basic_summarize(text, max_tokens=1000): response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=max_tokens, system="You are an expert summarizer. Provide a concise summary of the key points.", messages=[ {"role": "user", "content": f"Please summarize the following document:\n\n{text}"} ] ) return response.content[0].text

summary = basic_summarize(text) print(summary)

This works, but it has limitations:

  • No control over summary structure
  • No extraction of specific metadata
  • Fails if the document exceeds Claude's context window

Guided Summarization: Extracting Specific Information

For legal documents, you often need structured output—parties, dates, obligations, risks. Use Claude's ability to follow structured prompts:

def guided_summarize(text):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1500,
        system="You are a legal document analyst. Extract key information in a structured format.",
        messages=[
            {"role": "user", "content": f"""
Analyze this legal document and provide:
  • Parties Involved: List all named parties and their roles.
  • Effective Date: The date the agreement becomes active.
  • Key Obligations: Bullet list of each party's main responsibilities.
  • Financial Terms: Payment amounts, schedules, penalties.
  • Termination Conditions: How and when the agreement can be terminated.
  • Risk Factors: Any clauses that could pose legal or financial risk.
Document: {text} """} ] ) return response.content[0].text

This approach gives you a structured, actionable output rather than a generic paragraph.

Handling Long Documents: Chunking and Meta-Summarization

What if your document is 100 pages long? Claude's context window is large, but you may still hit limits or degrade quality. The solution: chunk and summarize hierarchically.

Step 1: Chunk the Document

def chunk_text(text, chunk_size=4000, overlap=200):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

chunks = chunk_text(text)

Step 2: Summarize Each Chunk

def summarize_chunk(chunk):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[
            {"role": "user", "content": f"Summarize this section:\n\n{chunk}"}
        ]
    )
    return response.content[0].text

chunk_summaries = [summarize_chunk(c) for c in chunks]

Step 3: Meta-Summarization

Now summarize the summaries:

def meta_summarize(summaries):
    combined = "\n\n---\n\n".join(summaries)
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[
            {"role": "user", "content": f"Combine these section summaries into a coherent overall summary:\n\n{combined}"}
        ]
    )
    return response.content[0].text

final_summary = meta_summarize(chunk_summaries)

This technique preserves context across the entire document while staying within token limits.

Summary-Indexed RAG: An Advanced Approach

For even larger document collections, consider a summary-indexed Retrieval-Augmented Generation (RAG) system. Instead of indexing raw chunks, you index summaries of those chunks. This has two benefits:

  • Better retrieval: Summaries are denser and more relevant than raw text.
  • Faster search: Fewer tokens to embed and compare.

Implementation Sketch

from sentence_transformers import SentenceTransformer
import numpy as np

Generate summaries for each chunk

chunk_summaries = [summarize_chunk(c) for c in chunks]

Embed the summaries

model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(chunk_summaries)

On query, find the most relevant summary

query = "What are the termination conditions?" query_embedding = model.encode([query]) scores = np.dot(embeddings, query_embedding.T).flatten() best_idx = np.argmax(scores)

Use the corresponding raw chunk for detailed answer

relevant_chunk = chunks[best_idx]

Best Practices for Summarization RAG

  • Chunk size: 2000-4000 tokens per chunk. Too small loses context; too large dilutes relevance.
  • Overlap: 10-20% overlap between chunks to avoid cutting off important sentences.
  • Summary length: Keep chunk summaries to 100-200 words—dense but complete.
  • Metadata: Include document title, section number, and date in each summary for traceability.

Evaluating Summary Quality

You can't improve what you don't measure. Here are three evaluation methods:

1. ROUGE Scores

ROUGE measures n-gram overlap between your summary and a reference summary.

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) scores = scorer.score(reference_summary, generated_summary) print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}") print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}") print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")

2. Promptfoo for Custom Evaluation

Promptfoo lets you define custom evaluation criteria:
# promptfoo config
evaluators:
  - factuality:
      prompt: "Does the summary contain any factual errors compared to the source?"
  - completeness:
      prompt: "Does the summary cover all key sections of the document?"
  - conciseness:
      max_words: 200

3. LLM-as-Judge

Use Claude itself to evaluate summaries:

def evaluate_summary(source, summary):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[
            {"role": "user", "content": f"""
Rate this summary on a scale of 1-5 for:
  • Accuracy: Does it contain any factual errors?
  • Completeness: Does it cover all major points?
  • Conciseness: Is it appropriately brief?
  • Clarity: Is it easy to understand?
Source document: {source[:2000]}...

Summary: {summary}

Provide scores and brief justifications. """} ] ) return response.content[0].text

Iterative Improvement

Summarization is rarely perfect on the first try. Use this feedback loop:

  • Generate a summary using your current prompt.
  • Evaluate using automated metrics and human review.
  • Identify gaps: Is it missing key facts? Too verbose? Inaccurate?
  • Refine the prompt: Add instructions like "Focus on financial terms" or "Use bullet points."
  • Repeat until quality meets your threshold.
Example refinement cycle:
# Version 1: Too verbose
prompt_v1 = "Summarize this document."

Version 2: More specific

prompt_v2 = "Summarize this legal document in 3 paragraphs: parties, obligations, risks."

Version 3: With constraints

prompt_v3 = """ Summarize this legal document. Requirements:
  • Maximum 200 words
  • Use bullet points
  • Include all dollar amounts
  • Flag any ambiguous language
"""

Conclusion and Best Practices

Summarization with Claude is a skill that improves with practice and systematic refinement. Here are the key principles to remember:

  • Start simple, then guide: Begin with a basic prompt, then add structure and constraints.
  • Chunk strategically: For long documents, use overlapping chunks and meta-summarization.
  • Evaluate rigorously: Combine automated metrics (ROUGE) with LLM-as-judge and human review.
  • Iterate: Treat your prompt as a living artifact that evolves with your requirements.
  • Consider RAG: For document collections, summary-indexed retrieval dramatically improves relevance.

Key Takeaways

  • Guided prompts produce better summaries: Instruct Claude to extract specific fields (parties, dates, risks) rather than asking for a generic summary.
  • Chunk + meta-summarize handles any document length: Break long texts into overlapping chunks, summarize each, then combine.
  • Summary-indexed RAG outperforms raw chunk retrieval: Summaries are denser and more relevant, leading to better search results.
  • Evaluate with multiple methods: Use ROUGE for baseline, Promptfoo for custom checks, and Claude itself as a judge for nuanced quality.
  • Iterate relentlessly: Small prompt tweaks—adding constraints, changing output format—can dramatically improve summary quality.