Guide2026-04-28

Mastering Document Summarization with Claude: From Basic Prompts to Advanced RAG Techniques

Learn how to summarize long documents with Claude AI, including prompt engineering, handling token limits, RAG-based summarization, and automated quality evaluation using ROUGE scores.

Quick Answer

A practical guide to summarizing documents with Claude, covering basic prompts, multi-shot techniques, guided summarization, handling long documents via RAG, and evaluating summary quality with automated metrics.

summarizationprompt engineeringRAGClaude APIevaluation

Introduction

Summarization is one of the most powerful and practical applications of large language models like Claude. Whether you're a legal professional drowning in contracts, a researcher sifting through papers, or a business analyst reviewing quarterly reports, the ability to condense lengthy documents into concise, accurate summaries saves time and improves decision-making.

This guide walks you through the complete workflow of document summarization with Claude — from basic prompt techniques to advanced Retrieval-Augmented Generation (RAG) approaches for documents that exceed token limits. We'll also cover how to evaluate summary quality using automated metrics like ROUGE scores and tools like Promptfoo.

By the end, you'll have a reusable framework for building, testing, and refining summarization systems tailored to your specific domain.

Setup and Environment

Before we start, install the required packages:

pip install anthropic pypdf pandas matplotlib sklearn numpy rouge-score nltk seaborn promptfoo

You'll also need a valid Claude API key. Set it as an environment variable:

export ANTHROPIC_API_KEY="your-api-key-here"

Data Preparation

Most real-world documents come as PDFs. Here's a Python function to extract text from a PDF and clean it for summarization:

import pypdf
import re
def extract_text_from_pdf(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text
def clean_text(text):
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove non-ASCII characters if needed
    text = text.encode('ascii', 'ignore').decode()
    return text.strip()
Example usage
raw_text = extract_text_from_pdf("sublease_agreement.pdf")
clean_text = clean_text(raw_text)
print(f"Extracted {len(clean_text)} characters")

If you don't have a PDF, you can skip this step and define a text variable directly.

Basic Summarization

Let's start with a simple summarization function using Claude's Messages API:

import anthropic
client = anthropic.Anthropic()
def summarize_text(text, max_summary_length=200):
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=max_summary_length,
        messages=[
            {
                "role": "user",
                "content": f"Please summarize the following text concisely:\n\n{text}"
            }
        ]
    )
    return response.content[0].text
summary = summarize_text(clean_text)
print(summary)

This basic approach works, but it has limitations: the summary may miss key details, and it won't handle documents longer than Claude's context window (200K tokens).

Multi-Shot Basic Summarization

A simple improvement is to use a multi-shot approach — breaking the document into chunks, summarizing each chunk, then summarizing the summaries:

def chunk_text(text, chunk_size=50000):
    """Split text into chunks of roughly equal size."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    for word in words:
        current_length += len(word) + 1
        if current_length > chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = [word]
            current_length = len(word)
        else:
            current_chunk.append(word)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks
def multi_shot_summarize(text, chunk_size=50000):
    chunks = chunk_text(text, chunk_size)
    chunk_summaries = []
    for i, chunk in enumerate(chunks):
        print(f"Summarizing chunk {i+1}/{len(chunks)}...")
        chunk_summaries.append(summarize_text(chunk, max_summary_length=300))
    
    # Now summarize the summaries
    combined_summaries = "\n\n".join(chunk_summaries)
    final_summary = summarize_text(
        f"Combine these section summaries into one coherent overall summary:\n\n{combined_summaries}",
        max_summary_length=400
    )
    return final_summary

This technique effectively extends Claude's summarization capability to arbitrarily long documents.

Advanced Techniques

Guided Summarization

Instead of a generic "summarize this" prompt, guide Claude with specific instructions:

def guided_summarize(text, focus_areas=None):
    prompt = "Summarize the following document. Focus on:\n"
    if focus_areas:
        for area in focus_areas:
            prompt += f"- {area}\n"
    prompt += f"\nDocument:\n{text}"
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Example: Legal document focus
summary = guided_summarize(clean_text, focus_areas=[
    "Key obligations of each party",
    "Termination conditions",
    "Financial terms and payment schedules",
    "Liability and indemnification clauses"
])

Domain-Specific Guided Summarization

For specialized domains like legal or medical, include domain-specific instructions:

def legal_summarize(text):
    prompt = """You are a legal document analyst. Summarize this contract with:
PARTIES: Who are the involved parties?
TERM: Duration and renewal terms
OBLIGATIONS: Key responsibilities of each party
FINANCIAL: Payment amounts, schedules, penalties
TERMINATION: Conditions for early termination
LIABILITY: Indemnification, limitations of liability
GOVERNING LAW: Jurisdiction and dispute resolution

Document:
{text}
"""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=800,
        messages=[{"role": "user", "content": prompt.format(text=text)}]
    )
    return response.content[0].text

Meta-Summarization

For very long documents, use a hierarchical approach: summarize sections, then summarize the summaries, and optionally extract metadata:

def meta_summarize(text, chunk_size=30000):
    # Step 1: Extract metadata (title, date, parties, etc.)
    metadata_prompt = f"Extract key metadata from this document: title, date, parties involved, document type.\n\n{text[:10000]}"
    metadata = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": metadata_prompt}]
    ).content[0].text
    
    # Step 2: Chunk and summarize
    chunks = chunk_text(text, chunk_size)
    summaries = []
    for chunk in chunks:
        summaries.append(summarize_text(chunk, max_summary_length=300))
    
    # Step 3: Combine into final summary
    combined = "\n\n".join(summaries)
    final_prompt = f"Metadata:\n{metadata}\n\nSection Summaries:\n{combined}\n\nCreate a cohesive final summary."
    final_summary = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": final_prompt}]
    ).content[0].text
    
    return {"metadata": metadata, "summary": final_summary}

Summary Indexed Documents: An Advanced RAG Approach

When documents are extremely long (hundreds of pages), even chunked summarization can lose context. A more robust approach is to build a summary-indexed RAG system:

Chunk the document into sections
Summarize each chunk and store both the chunk and its summary
Index the summaries for retrieval
Retrieve relevant summaries based on a query, then use the corresponding full chunks for detailed answers

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SummaryIndexedRAG:
    def __init__(self, text, chunk_size=20000):
        self.chunks = chunk_text(text, chunk_size)
        self.summaries = []
        for chunk in self.chunks:
            self.summaries.append(summarize_text(chunk, max_summary_length=200))
        self.vectorizer = TfidfVectorizer().fit(self.summaries)
        self.summary_vectors = self.vectorizer.transform(self.summaries)
    
    def query(self, question, top_k=3):
        question_vec = self.vectorizer.transform([question])
        similarities = cosine_similarity(question_vec, self.summary_vectors)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        context = ""
        for idx in top_indices:
            context += f"--- Section {idx+1} ---\n{self.chunks[idx]}\n\n"
        
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"Based on the following document sections, answer: {question}\n\n{context}"
            }]
        )
        return response.content[0].text
Usage
rag = SummaryIndexedRAG(clean_text)
answer = rag.query("What are the termination conditions?")
print(answer)

Best Practices for Summarization RAG

Chunk size: 10,000–30,000 characters works well for most documents
Overlap: Add 10% overlap between chunks to avoid cutting off important context
Summary granularity: Keep summaries concise (100–200 words) for fast retrieval
Hybrid search: Combine semantic search (embeddings) with keyword search for better recall

Evaluations

Evaluating summary quality is notoriously difficult. Here are three practical methods:

1. ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) compares generated summaries against reference summaries:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
reference = "The sublease agreement outlines terms between parties A and B..."
generated = "This agreement defines the relationship between party A and party B..."
scores = scorer.score(reference, generated)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")

2. Promptfoo for Custom Evaluation

Promptfoo allows you to define custom evaluation criteria:

# promptfooconfig.yaml prompts: - "Summarize: {{text}}" - "You are a legal expert. Summarize: {{text}}" providers: - id: anthropic:claude-3-5-sonnet-20241022

tests: - vars: text: "..." assert: - type: llm-rubric value: "Does the summary include all key parties?" - type: llm-rubric value: "Is the summary factually accurate?" - type: cost threshold: 0.01

3. Human Evaluation with Rubrics

Create a scoring rubric for human reviewers:

Criteria	Score (1-5)	Description
Completeness	1-5	All key points covered
Accuracy	1-5	No factual errors
Conciseness	1-5	No unnecessary details
Coherence	1-5	Flows logically

Iterative Improvement

Use evaluation results to iteratively improve your summarization pipeline:

Baseline: Run basic summarization and measure ROUGE scores
Prompt engineering: Refine prompts based on missing elements
Chunking strategy: Adjust chunk size and overlap
Domain tuning: Add domain-specific instructions
Re-evaluate: Compare new scores against baseline

Example iterative loop:

def iterative_improvement(text, reference_summary, iterations=3):
    best_score = 0
    best_prompt = ""
    
    prompts = [
        "Summarize the following:",
        "Provide a concise summary covering all key points:",
        "As an expert analyst, create a structured summary with sections:"
    ]
    
    for i in range(min(iterations, len(prompts))):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": f"{prompts[i]}\n\n{text}"}]
        )
        generated = response.content[0].text
        scores = scorer.score(reference_summary, generated)
        avg_score = (scores['rouge1'].fmeasure + scores['rouge2'].fmeasure + scores['rougeL'].fmeasure) / 3
        
        if avg_score > best_score:
            best_score = avg_score
            best_prompt = prompts[i]
        
        print(f"Iteration {i+1}: ROUGE-L F1 = {scores['rougeL'].fmeasure:.3f}")
    
    return best_prompt, best_score

Conclusion and Best Practices

Summarization with Claude is both powerful and flexible. Here are the key takeaways:

Start simple, then iterate: Begin with basic prompts and refine based on evaluation
Guide with structure: Use domain-specific prompts to extract exactly what you need
Handle long documents with chunking and RAG: Don't let token limits stop you
Evaluate systematically: Combine automated metrics (ROUGE) with human review
Optimize for your domain: Legal, medical, and technical documents each need tailored approaches

The techniques in this guide give you a complete toolkit for building production-quality summarization systems with Claude. Adapt them to your specific use case, and you'll be able to extract insights from even the longest documents with confidence.

Key Takeaways

Prompt engineering matters: Structured, domain-specific prompts produce significantly better summaries than generic "summarize this" requests
Chunking + meta-summarization handles any document length: Break long texts into chunks, summarize each, then summarize the summaries for coherent results
RAG-based summarization enables query-specific answers: Index chunk summaries for fast retrieval, then use full chunks for detailed responses
Evaluate with both automated and human methods: ROUGE scores provide a quick baseline, but human review with rubrics catches nuance
Iterative improvement is essential: Small prompt tweaks and chunking adjustments can dramatically improve summary quality over time