Master Document Summarization with Claude: From Basic Techniques to Advanced RAG
Learn how to use Claude AI for effective document summarization. This guide covers prompt engineering, handling long documents, evaluation methods, and advanced RAG techniques for legal and complex texts.
This guide teaches you to implement and refine document summarization using Claude AI. You'll learn effective prompting strategies, techniques for handling long documents, methods to evaluate summary quality, and how to build advanced summary-indexed RAG systems for complex texts like legal agreements.
Master Document Summarization with Claude: From Basic Techniques to Advanced RAG
Introduction
In today's information-saturated world, the ability to quickly distill lengthy documents into concise summaries is invaluable. Whether you're reviewing legal contracts, research papers, or business reports, summarization saves time and enhances comprehension. Claude AI excels at this task, offering sophisticated text understanding and generation capabilities.
This guide provides a practical framework for implementing document summarization with Claude, with particular emphasis on challenging documents like legal agreements. We'll progress from basic techniques to advanced approaches, including evaluation methods and RAG (Retrieval-Augmented Generation) systems built on summaries.
Prerequisites and Setup
Before beginning, ensure you have the necessary tools installed:
pip install anthropic pypdf pandas rouge-score nltk
You'll also need a Claude API key. Set it in your environment:
import anthropic
import os
client = anthropic.Anthropic(
api_key=os.environ.get("ANTHROPIC_API_KEY")
)
For document processing, we'll use a sample Sublease Agreement from SEC filings, but you can adapt the code for any PDF or text document.
Basic Summarization Techniques
Simple Single-Shot Summarization
Let's start with a fundamental approach using Claude's API:
def basic_summarize(text, max_tokens=500):
"""Basic summarization function for documents within token limits."""
prompt = f"""Please summarize the following document concisely, focusing on the key points and main obligations:
{text}
Summary:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=max_tokens,
messages=[
{
"role": "user",
"content": prompt
}
]
)
return response.content[0].text
This approach works well for documents under Claude's context window (typically 200K tokens). For longer documents, we need more sophisticated techniques.
Multi-Shot Prompting for Consistency
Multi-shot examples help Claude understand your specific summarization style and requirements:
def multi_shot_summarize(text):
examples = """
Example 1:
Document: A software license agreement between Company A and Customer B...
Summary: This agreement grants Customer B a non-exclusive license to use Software X for internal business purposes. Key terms include: 1-year term, $10,000 annual fee, limited technical support, and confidentiality obligations.
Example 2:
Document: An employment contract for a senior developer...
Summary: This contract outlines employment terms for John Doe as Senior Developer. Key elements: $120,000 annual salary, 2-year term, intellectual property assignment to Company, and standard termination clauses with 30-day notice.
"""
prompt = f"""{examples}
Now summarize the following document in the same style:
{text}
Summary:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=600,
messages=[
{
"role": "user",
"content": prompt
}
]
)
return response.content[0].text
Advanced Summarization Strategies
Guided Summarization with Specific Instructions
For complex documents like legal agreements, guided summarization yields better results:
def guided_legal_summary(text):
prompt = f"""Analyze this legal document and provide a structured summary with the following sections:
1. PARTIES: Identify all parties involved
2. KEY TERMS: List the 5-7 most important contractual terms
3. OBLIGATIONS: Summarize main obligations for each party
4. DURATION: Note the agreement term and renewal conditions
5. TERMINATION: Describe termination clauses and conditions
6. RISK FACTORS: Highlight any unusual or risky provisions
Document:
{text}
Structured Summary:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=800,
messages=[
{
"role": "user",
"content": prompt
}
]
)
return response.content[0].text
Handling Long Documents with Chunking
For documents exceeding Claude's context window, implement a chunk-and-summarize approach:
def summarize_long_document(full_text, chunk_size=100000):
"""Summarize documents longer than Claude's context window."""
# Split document into manageable chunks
chunks = [full_text[i:i+chunk_size] for i in range(0, len(full_text), chunk_size)]
chunk_summaries = []
for i, chunk in enumerate(chunks):
print(f"Summarizing chunk {i+1}/{len(chunks)}")
chunk_prompt = f"""Summarize this section of a larger document. Focus on the key points that would be important for an overall document summary:
{chunk}
Section Summary:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[
{
"role": "user",
"content": chunk_prompt
}
]
)
chunk_summaries.append(response.content[0].text)
# Create a meta-summary from all chunk summaries
combined_summaries = "\n\n".join(chunk_summaries)
meta_prompt = f"""Below are summaries of different sections from a single document. Provide a comprehensive overall summary of the entire document:
{combined_summaries}
Overall Document Summary:"""
final_response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[
{
"role": "user",
"content": meta_prompt
}
]
)
return final_response.content[0].text
Building Summary-Indexed RAG Systems
An advanced application of summarization is creating a RAG system where summaries serve as the retrieval index:
class SummaryIndexedRAG:
def __init__(self):
self.document_store = []
def add_document(self, doc_id, full_text):
"""Generate and store a summary for document retrieval."""
summary = self._generate_search_summary(full_text)
self.document_store.append({
"id": doc_id,
"summary": summary,
"full_text": full_text,
"metadata": self._extract_metadata(full_text)
})
def _generate_search_summary(self, text):
"""Generate a summary optimized for retrieval purposes."""
prompt = f"""Create a detailed, keyword-rich summary of this document that would help in information retrieval. Include:
1. Document type and purpose
2. Key entities mentioned
3. Main topics covered
4. Important dates and numbers
5. Any unique or distinctive elements
Document:
{text}
Retrieval-Optimized Summary:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=400,
messages=[
{
"role": "user",
"content": prompt
}
]
)
return response.content[0].text
def _extract_metadata(self, text):
"""Extract structured metadata from document."""
prompt = f"""Extract the following metadata from this document as JSON:
- document_type
- parties (list)
- effective_date
- expiration_date
- key_terms (list of 5-7 terms)
Document:
{text[:20000]} # First 20K chars for metadata extraction
JSON Metadata:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[
{
"role": "user",
"content": prompt
}
]
)
# Parse JSON response (implementation depends on your needs)
return self._parse_json_response(response.content[0].text)
def query(self, question, top_k=3):
"""Query the document store using summaries for retrieval."""
# Simple semantic search on summaries (in production, use embeddings)
relevant_docs = self._find_relevant_docs(question, top_k)
# Generate answer using retrieved documents
context = "\n\n".join([doc["summary"] for doc in relevant_docs])
answer_prompt = f"""Based on the following document summaries, answer the question:
Question: {question}
Document Summaries:
{context}
Answer:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[
{
"role": "user",
"content": answer_prompt
}
]
)
return {
"answer": response.content[0].text,
"source_docs": relevant_docs
}
Evaluating Summary Quality
Evaluating summaries is challenging but crucial. Here are practical approaches:
Using ROUGE Metrics
from rouge_score import rouge_scorer
def evaluate_summary_rouge(generated_summary, reference_summary):
"""Calculate ROUGE scores between generated and reference summaries."""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference_summary, generated_summary)
return {
"rouge1": scores["rouge1"].fmeasure,
"rouge2": scores["rouge2"].fmeasure,
"rougeL": scores["rougeL"].fmeasure
}
Custom Evaluation with Claude
You can also use Claude to evaluate summaries against specific criteria:
def evaluate_with_claude(original, summary, criteria):
prompt = f"""Evaluate this summary based on the following criteria:
{criteria}
Original Document (excerpt):
{original[:5000]}
Generated Summary:
{summary}
Provide scores (1-5) for each criterion and brief justification:"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=400,
messages=[
{
"role": "user",
"content": prompt
}
]
)
return response.content[0].text
Iterative Improvement Workflow
- Start Simple: Begin with basic summarization and evaluate results
- Add Structure: Implement guided summarization for complex documents
- Handle Scale: Add chunking for long documents
- Extract Metadata: Enhance with structured data extraction
- Build Systems: Create summary-indexed RAG applications
- Continuous Evaluation: Regularly assess quality and refine prompts
Key Takeaways
- Prompt engineering is crucial: Specific, structured prompts yield significantly better summaries than generic requests. Include examples and clear formatting instructions.
- Chunk strategically for long documents: When documents exceed context limits, implement a chunk-summarize-meta-summarize pipeline to maintain coherence.
- Summaries can power advanced systems: Use document summaries as indexes for RAG systems, enabling efficient retrieval and question-answering over large document collections.
- Evaluation requires multiple approaches: Combine automated metrics (ROUGE) with LLM-based evaluation and human review for comprehensive quality assessment.
- Iterate and refine: Treat summarization as an iterative process. Test different prompting strategies and document-specific adaptations for optimal results.