BeClaude
Guide2026-04-28

Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG Techniques

Learn how to summarize long documents with Claude AI, including prompt engineering, handling token limits, RAG-based summarization, and automated quality evaluation using ROUGE scores.

Quick Answer

A practical guide to summarizing documents with Claude, covering basic prompts, multi-shot techniques, guided summarization, handling long documents via RAG, and evaluating summary quality with automated metrics.

summarizationprompt engineeringRAGClaude APIevaluation

Introduction

Summarization is one of the most powerful and practical applications of large language models like Claude. Whether you're a legal professional drowning in contracts, a researcher sifting through papers, or a business analyst reviewing quarterly reports, the ability to condense lengthy documents into concise, accurate summaries saves time and improves decision-making.

This guide walks you through the complete workflow of document summarization with Claude — from basic prompt techniques to advanced Retrieval-Augmented Generation (RAG) approaches for documents that exceed token limits. We'll also cover how to evaluate summary quality using automated metrics like ROUGE scores and tools like Promptfoo.

By the end, you'll have a reusable framework for building, testing, and refining summarization systems tailored to your specific domain.

Setup and Environment

Before we start, install the required packages:

pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo

You'll also need a valid Claude API key. Set it as an environment variable:

export ANTHROPIC_API_KEY="your-api-key-here"

Data Preparation

Most real-world documents come as PDFs. Here's a Python function to extract text from a PDF and clean it for summarization:

import pypdf
import re

def extract_text_from_pdf(pdf_path): reader = pypdf.PdfReader(pdf_path) text = "" for page in reader.pages: text += page.extract_text() return text

def clean_text(text): # Remove excessive whitespace text = re.sub(r'\s+', ' ', text) # Remove non-ASCII characters if needed text = text.encode('ascii', 'ignore').decode() return text.strip()

Example usage

raw_text = extract_text_from_pdf("sublease_agreement.pdf") clean_text = clean_text(raw_text) print(f"Extracted {len(clean_text)} characters")

If you don't have a PDF, you can skip this step and define a text variable directly.

Basic Summarization

Let's start with a simple summarization function using Claude's Messages API:

import anthropic

client = anthropic.Anthropic()

def summarize_text(text, max_summary_length=200): response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=max_summary_length, messages=[ { "role": "user", "content": f"Please summarize the following text concisely:\n\n{text}" } ] ) return response.content[0].text

summary = summarize_text(clean_text) print(summary)

This basic approach works, but it has limitations: the summary may miss key details, and it won't handle documents longer than Claude's context window (200K tokens).

Multi-Shot Basic Summarization

A simple improvement is to use a multi-shot approach — breaking the document into chunks, summarizing each chunk, then summarizing the summaries:

def chunk_text(text, chunk_size=50000):
    """Split text into chunks of roughly equal size."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    for word in words:
        current_length += len(word) + 1
        if current_length > chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_length = len(word)
        else:
            current_chunk.append(word)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

def multi_shot_summarize(text, chunk_size=50000): chunks = chunk_text(text, chunk_size) chunk_summaries = [] for i, chunk in enumerate(chunks): print(f"Summarizing chunk {i+1}/{len(chunks)}...") chunk_summaries.append(summarize_text(chunk, max_summary_length=300)) # Now summarize the summaries combined_summaries = "\n\n".join(chunk_summaries) final_summary = summarize_text( f"Combine these section summaries into one coherent overall summary:\n\n{combined_summaries}", max_summary_length=400 ) return final_summary

This technique effectively extends Claude's summarization capability to arbitrarily long documents.

Advanced Techniques

Guided Summarization

Instead of a generic "summarize this" prompt, guide Claude with specific instructions:

def guided_summarize(text, focus_areas=None):
    prompt = "Summarize the following document. Focus on:\n"
    if focus_areas:
        for area in focus_areas:
            prompt += f"- {area}\n"
    prompt += f"\nDocument:\n{text}"
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Example: Legal document focus

summary = guided_summarize(clean_text, focus_areas=[ "Key obligations of each party", "Termination conditions", "Financial terms and payment schedules", "Liability and indemnification clauses" ])

Domain-Specific Guided Summarization

For specialized domains like legal or medical, include domain-specific instructions:

def legal_summarize(text):
    prompt = """You are a legal document analyst. Summarize this contract with:
  • PARTIES: Who are the involved parties?
  • TERM: Duration and renewal terms
  • OBLIGATIONS: Key responsibilities of each party
  • FINANCIAL: Payment amounts, schedules, penalties
  • TERMINATION: Conditions for early termination
  • LIABILITY: Indemnification, limitations of liability
  • GOVERNING LAW: Jurisdiction and dispute resolution
Document: {text} """ response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=800, messages=[{"role": "user", "content": prompt.format(text=text)}] ) return response.content[0].text

Meta-Summarization

For very long documents, use a hierarchical approach: summarize sections, then summarize the summaries, and optionally extract metadata:

def meta_summarize(text, chunk_size=30000):
    # Step 1: Extract metadata (title, date, parties, etc.)
    metadata_prompt = f"Extract key metadata from this document: title, date, parties involved, document type.\n\n{text[:10000]}"
    metadata = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": metadata_prompt}]
    ).content[0].text
    
    # Step 2: Chunk and summarize
    chunks = chunk_text(text, chunk_size)
    summaries = []
    for chunk in chunks:
        summaries.append(summarize_text(chunk, max_summary_length=300))
    
    # Step 3: Combine into final summary
    combined = "\n\n".join(summaries)
    final_prompt = f"Metadata:\n{metadata}\n\nSection Summaries:\n{combined}\n\nCreate a cohesive final summary."
    final_summary = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": final_prompt}]
    ).content[0].text
    
    return {"metadata": metadata, "summary": final_summary}

Summary Indexed Documents: An Advanced RAG Approach

When documents are extremely long (hundreds of pages), even chunked summarization can lose context. A more robust approach is to build a summary-indexed RAG system:

  • Chunk the document into sections
  • Summarize each chunk and store both the chunk and its summary
  • Index the summaries for retrieval
  • Retrieve relevant summaries based on a query, then use the corresponding full chunks for detailed answers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class SummaryIndexedRAG: def __init__(self, text, chunk_size=20000): self.chunks = chunk_text(text, chunk_size) self.summaries = [] for chunk in self.chunks: self.summaries.append(summarize_text(chunk, max_summary_length=200)) self.vectorizer = TfidfVectorizer().fit(self.summaries) self.summary_vectors = self.vectorizer.transform(self.summaries) def query(self, question, top_k=3): question_vec = self.vectorizer.transform([question]) similarities = cosine_similarity(question_vec, self.summary_vectors)[0] top_indices = np.argsort(similarities)[-top_k:][::-1] context = "" for idx in top_indices: context += f"--- Section {idx+1} ---\n{self.chunks[idx]}\n\n" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, messages=[{ "role": "user", "content": f"Based on the following document sections, answer: {question}\n\n{context}" }] ) return response.content[0].text

Usage

rag = SummaryIndexedRAG(clean_text) answer = rag.query("What are the termination conditions?") print(answer)

Best Practices for Summarization RAG

  • Chunk size: 10,000–30,000 characters works well for most documents
  • Overlap: Add 10% overlap between chunks to avoid cutting off important context
  • Summary granularity: Keep summaries concise (100–200 words) for fast retrieval
  • Hybrid search: Combine semantic search (embeddings) with keyword search for better recall

Evaluations

Evaluating summary quality is notoriously difficult. Here are three practical methods:

1. ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compares generated summaries against reference summaries:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True) reference = "The sublease agreement outlines terms between parties A and B..." generated = "This agreement defines the relationship between party A and party B..." scores = scorer.score(reference, generated) print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}") print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}") print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")

2. Promptfoo for Custom Evaluation

Promptfoo allows you to define custom evaluation criteria:
# promptfooconfig.yaml
prompts:
  - "Summarize: {{text}}"
  - "You are a legal expert. Summarize: {{text}}"

providers: - id: anthropic:claude-3-5-sonnet-20241022

tests: - vars: text: "..." assert: - type: llm-rubric value: "Does the summary include all key parties?" - type: llm-rubric value: "Is the summary factually accurate?" - type: cost threshold: 0.01

3. Human Evaluation with Rubrics

Create a scoring rubric for human reviewers:

CriteriaScore (1-5)Description
Completeness1-5All key points covered
Accuracy1-5No factual errors
Conciseness1-5No unnecessary details
Coherence1-5Flows logically

Iterative Improvement

Use evaluation results to iteratively improve your summarization pipeline:

  • Baseline: Run basic summarization and measure ROUGE scores
  • Prompt engineering: Refine prompts based on missing elements
  • Chunking strategy: Adjust chunk size and overlap
  • Domain tuning: Add domain-specific instructions
  • Re-evaluate: Compare new scores against baseline
Example iterative loop:
def iterative_improvement(text, reference_summary, iterations=3):
    best_score = 0
    best_prompt = ""
    
    prompts = [
        "Summarize the following:",
        "Provide a concise summary covering all key points:",
        "As an expert analyst, create a structured summary with sections:"
    ]
    
    for i in range(min(iterations, len(prompts))):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": f"{prompts[i]}\n\n{text}"}]
        )
        generated = response.content[0].text
        scores = scorer.score(reference_summary, generated)
        avg_score = (scores['rouge1'].fmeasure + scores['rouge2'].fmeasure + scores['rougeL'].fmeasure) / 3
        
        if avg_score > best_score:
            best_score = avg_score
            best_prompt = prompts[i]
        
        print(f"Iteration {i+1}: ROUGE-L F1 = {scores['rougeL'].fmeasure:.3f}")
    
    return best_prompt, best_score

Conclusion and Best Practices

Summarization with Claude is both powerful and flexible. Here are the key takeaways:

  • Start simple, then iterate: Begin with basic prompts and refine based on evaluation
  • Guide with structure: Use domain-specific prompts to extract exactly what you need
  • Handle long documents with chunking and RAG: Don't let token limits stop you
  • Evaluate systematically: Combine automated metrics (ROUGE) with human review
  • Optimize for your domain: Legal, medical, and technical documents each need tailored approaches
The techniques in this guide give you a complete toolkit for building production-quality summarization systems with Claude. Adapt them to your specific use case, and you'll be able to extract insights from even the longest documents with confidence.

Key Takeaways

  • Prompt engineering matters: Structured, domain-specific prompts produce significantly better summaries than generic "summarize this" requests
  • Chunking + meta-summarization handles any document length: Break long texts into chunks, summarize each, then summarize the summaries for coherent results
  • RAG-based summarization enables query-specific answers: Index chunk summaries for fast retrieval, then use full chunks for detailed responses
  • Evaluate with both automated and human methods: ROUGE scores provide a quick baseline, but human review with rubrics catches nuance
  • Iterative improvement is essential: Small prompt tweaks and chunking adjustments can dramatically improve summary quality over time