Claude Vision API: Complete Guide to Image and Multimodal Input
Learn how to use Claude's Vision capabilities to analyze images, extract data from PDFs, process screenshots, and build multimodal AI applications with the Claude API.
Claude Vision allows you to send images (PNG, JPEG, WEBP, GIF) alongside text prompts via the API. Images are transmitted as base64-encoded data or URL references. Claude can analyze charts, extract text from documents, describe photos, and answer questions about visual content. Image processing costs vary by model — approximately 1,600 tokens per image for Sonnet 4 and Opus 4.6.
What is Claude Vision?
Claude Vision is Claude's multimodal capability that allows it to process and analyze images alongside text. Unlike text-only models, Claude can look at photographs, diagrams, charts, screenshots, PDFs, and handwritten notes — then answer questions about them, extract information, or take actions based on what it sees.
This capability transforms Claude from a language model into a multimodal assistant that can help with tasks ranging from document analysis to UI testing to scientific figure interpretation.
Supported Image Formats
Claude supports these image formats as input:
| Format | MIME Type | Max Resolution | Use Case |
|---|---|---|---|
| PNG | image/png | 8,000 x 8,000 px | Screenshots, diagrams, documents |
| JPEG | image/jpeg | 8,000 x 8,000 px | Photos, scanned documents |
| WEBP | image/webp | 8,000 x 8,000 px | Web images, optimized photos |
| GIF | image/gif | 8,000 x 8,000 px | Simple animations (static frame) |
- Maximum file size: 100 MB per image (after base64 encoding: ~137 MB)
- For best results, keep images under 20 MB
- Very large images are automatically resized — Claude processes at 1,600 x 1,600 px internally
Getting Started with Image Analysis
Using the Claude API
import anthropic
import base64
client = anthropic.Anthropic()
with open("chart.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "Describe this chart in detail. What are the key trends?"
}
],
}],
)
print(response.content[0].text)
Using Image URLs
You can also reference images by URL:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/dashboard-screenshot.png"
},
},
{
"type": "text",
"text": "What metrics are shown on this dashboard?"
}
],
}],
)
URL requirements:
- Must be publicly accessible (no authentication)
- Must use HTTPS
- Response must include Content-Type header with the image MIME type
- Image must be served over a stable connection with reasonable latency
Practical Vision Use Cases
1. Document and PDF Analysis
Claude Vision excels at extracting structured data from documents:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": invoice_image}},
{"type": "text", "text": """Extract the following fields from this invoice:
- Invoice number
- Date
- Vendor name
- Line items (description, quantity, unit price, total)
- Subtotal, tax, grand total
- Payment terms
Format as JSON."""}
]
}],
)
2. Chart and Data Visualization Analysis
Perfect for analyzing business dashboards, scientific figures, and financial charts:
Extract the key data points from this line chart:
- What is the trend for each series?
- Identify any anomalies or outliers
- What is the approximate value at each labeled point?
3. UI/UX Review and Testing
Claude can review screenshots of your application:
Review this UI screenshot for:
- Visual alignment issues
- Missing or inconsistent elements
- Accessibility concerns (color contrast, font sizes)
- Layout problems at this viewport size
4. Handwriting Recognition
Claude can read handwritten notes and forms:
Transcribe the handwritten text in this image.
Preserve the original formatting and layout where possible.
Note any words you're uncertain about with [brackets].
Image Processing Costs
Vision requests are priced based on image size. Each image consumes tokens proportional to its dimensions:
| Image Size | Approximate Token Cost (Sonnet 4 / Opus 4.6) |
|---|---|
| Small (< 500x500) | ~400 tokens |
| Medium (1000x1000) | ~1,000 tokens |
| Large (2000x2000+) | ~1,600 tokens |
| Max (8000x8000) | ~1,600 tokens (auto-resized) |
- One medium image (1,000 tokens) + 500 text tokens = 1,500 input tokens
- Cost per request: ~$0.0045
Best Practices for Vision Prompts
1. Be Specific About What to Look At
Instead of: "What's in this image?" Try: "Look at the table in the bottom-right section of this dashboard. What are the top 3 rows by revenue?"2. Provide Context
Give Claude context about what the image represents:
This is a screenshot of a customer support dashboard taken at 3:00 PM on a Monday.
The data shown is for the past 24 hours.
3. Request Structured Output
For extraction tasks, always specify the output format:
Extract the data from this table and format it as a markdown table.
Then provide a summary of the key insights in bullet points.
4. Use Multiple Images
You can send multiple images in a single request to compare or combine information:
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": before_image}},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": after_image}},
{"type": "text", "text": "Compare these two screenshots and list all the changes you notice."}
]
}]
Vision with Claude Code
Claude Code also supports image input. You can drag and drop images directly into your terminal session:# Claude will analyze the image
claude "Analyze this UI mockup and generate React code for it" -i mockup.png
This is particularly useful for:
- Generating code from design mockups
- Debugging UI issues by sharing screenshots
- Converting diagrams into code implementations
Model Comparison for Vision Tasks
| Capability | Opus 4.6 | Sonnet 4 | Haiku |
|---|---|---|---|
| Image understanding | Best | Excellent | Good |
| Text extraction | Excellent | Excellent | Good |
| Chart analysis | Best | Excellent | Fair |
| Handwriting | Excellent | Very Good | Fair |
| Speed | Slowest | Fast | Fastest |
| Cost per image | ~$0.024 | ~$0.0048 | ~$0.0004 |
Common Issues and Troubleshooting
Image Quality Issues
- Blurry text: Ensure minimum 300 DPI for scanned documents
- Small text: Claude works best with text at least 10px tall in screenshots
- Low contrast: High contrast images produce better results
Error Handling
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[...]
)
except anthropic.BadRequestError as e:
if "image" in str(e).lower():
print("Check image format, size, or encoding")
raise
Key Takeaways
- Claude Vision supports PNG, JPEG, WEBP, and GIF formats through base64 encoding or URL references
- Be specific about what you want Claude to analyze in the image — don't rely on vague instructions
- Image processing costs ~1,600 tokens max per image, making it economical for most use cases
- Multiple images can be sent in a single request for comparison tasks
- Best results come from high-contrast, well-lit images with readable text