GuideIntermediateAPI2026-05-15

Claude Vision API: Complete Guide to Image and Multimodal Input

Learn how to use Claude's Vision capabilities to analyze images, extract data from PDFs, process screenshots, and build multimodal AI applications with the Claude API.

Quick Answer

Claude Vision allows you to send images (PNG, JPEG, WEBP, GIF) alongside text prompts via the API. Images are transmitted as base64-encoded data or URL references. Claude can analyze charts, extract text from documents, describe photos, and answer questions about visual content. Image processing costs vary by model — approximately 1,600 tokens per image for Sonnet 4 and Opus 4.6.

visionmultimodalimage-analysisapipdf

What is Claude Vision?

Claude Vision is Claude's multimodal capability that allows it to process and analyze images alongside text. Unlike text-only models, Claude can look at photographs, diagrams, charts, screenshots, PDFs, and handwritten notes — then answer questions about them, extract information, or take actions based on what it sees.

This capability transforms Claude from a language model into a multimodal assistant that can help with tasks ranging from document analysis to UI testing to scientific figure interpretation.

Supported Image Formats

Claude supports these image formats as input:

Format	MIME Type	Max Resolution	Use Case
PNG	`image/png`	8,000 x 8,000 px	Screenshots, diagrams, documents
JPEG	`image/jpeg`	8,000 x 8,000 px	Photos, scanned documents
WEBP	`image/webp`	8,000 x 8,000 px	Web images, optimized photos
GIF	`image/gif`	8,000 x 8,000 px	Simple animations (static frame)

Important limits:

Maximum file size: 100 MB per image (after base64 encoding: ~137 MB)
For best results, keep images under 20 MB
Very large images are automatically resized — Claude processes at 1,600 x 1,600 px internally

Getting Started with Image Analysis

Using the Claude API

import anthropic
import base64
client = anthropic.Anthropic()
with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data,
                },
            },
            {
                "type": "text",
                "text": "Describe this chart in detail. What are the key trends?"
            }
        ],
    }],
)
print(response.content[0].text)

Using Image URLs

You can also reference images by URL:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/dashboard-screenshot.png"
                },
            },
            {
                "type": "text",
                "text": "What metrics are shown on this dashboard?"
            }
        ],
    }],
)

URL requirements:

Must be publicly accessible (no authentication)
Must use HTTPS
Response must include Content-Type header with the image MIME type
Image must be served over a stable connection with reasonable latency

Practical Vision Use Cases

1. Document and PDF Analysis

Claude Vision excels at extracting structured data from documents:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": invoice_image}},
            {"type": "text", "text": """Extract the following fields from this invoice:
Invoice number
Date
Vendor name
Line items (description, quantity, unit price, total)
Subtotal, tax, grand total
Payment terms

Format as JSON."""}
        ]
    }],
)

2. Chart and Data Visualization Analysis

Perfect for analyzing business dashboards, scientific figures, and financial charts:

Extract the key data points from this line chart:
What is the trend for each series?
Identify any anomalies or outliers
What is the approximate value at each labeled point?

3. UI/UX Review and Testing

Claude can review screenshots of your application:

Review this UI screenshot for:
Visual alignment issues
Missing or inconsistent elements
Accessibility concerns (color contrast, font sizes)
Layout problems at this viewport size

4. Handwriting Recognition

Claude can read handwritten notes and forms:

Transcribe the handwritten text in this image.
Preserve the original formatting and layout where possible.
Note any words you're uncertain about with [brackets].

Image Processing Costs

Vision requests are priced based on image size. Each image consumes tokens proportional to its dimensions:

Image Size	Approximate Token Cost (Sonnet 4 / Opus 4.6)
Small (< 500x500)	~400 tokens
Medium (1000x1000)	~1,000 tokens
Large (2000x2000+)	~1,600 tokens
Max (8000x8000)	~1,600 tokens (auto-resized)

Pricing example with Sonnet 4 ($3/M input tokens):

One medium image (1,000 tokens) + 500 text tokens = 1,500 input tokens
Cost per request: ~$0.0045

Compare model pricing on our Claude API Pricing Guide and use the Pricing Calculator to estimate your costs.

Best Practices for Vision Prompts

1. Be Specific About What to Look At

Instead of: "What's in this image?" Try: "Look at the table in the bottom-right section of this dashboard. What are the top 3 rows by revenue?"

2. Provide Context

Give Claude context about what the image represents:

This is a screenshot of a customer support dashboard taken at 3:00 PM on a Monday.
The data shown is for the past 24 hours.

3. Request Structured Output

For extraction tasks, always specify the output format:

Extract the data from this table and format it as a markdown table.
Then provide a summary of the key insights in bullet points.

4. Use Multiple Images

You can send multiple images in a single request to compare or combine information:

messages=[{
    "role": "user",
    "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": before_image}},
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": after_image}},
        {"type": "text", "text": "Compare these two screenshots and list all the changes you notice."}
    ]
}]

Vision with Claude Code

Claude Code also supports image input. You can drag and drop images directly into your terminal session:

# Claude will analyze the image
claude "Analyze this UI mockup and generate React code for it" -i mockup.png

This is particularly useful for:

Generating code from design mockups
Debugging UI issues by sharing screenshots
Converting diagrams into code implementations

Model Comparison for Vision Tasks

Capability	Opus 4.6	Sonnet 4	Haiku
Image understanding	Best	Excellent	Good
Text extraction	Excellent	Excellent	Good
Chart analysis	Best	Excellent	Fair
Handwriting	Excellent	Very Good	Fair
Speed	Slowest	Fast	Fastest
Cost per image	~$0.024	~$0.0048	~$0.0004

Common Issues and Troubleshooting

Image Quality Issues

Blurry text: Ensure minimum 300 DPI for scanned documents
Small text: Claude works best with text at least 10px tall in screenshots
Low contrast: High contrast images produce better results

Error Handling

try:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[...]
    )
except anthropic.BadRequestError as e:
    if "image" in str(e).lower():
        print("Check image format, size, or encoding")
    raise

Key Takeaways

Claude Vision supports PNG, JPEG, WEBP, and GIF formats through base64 encoding or URL references
Be specific about what you want Claude to analyze in the image — don't rely on vague instructions
Image processing costs ~1,600 tokens max per image, making it economical for most use cases
Multiple images can be sent in a single request for comparison tasks
Best results come from high-contrast, well-lit images with readable text

For more API patterns and best practices, see our Getting Started with Claude API guide and Building AI Agents with Claude tutorial for combining vision with tool use.