Mastering Claude’s Extended Thinking: Adaptive Mode, Effort Budgets, and Fast Mode
Learn how to use Claude’s Extended Thinking features—Adaptive Thinking, Effort Budgets, and Fast Mode—to control reasoning depth, cost, and speed in your API applications.
This guide explains how to configure Claude’s Extended Thinking modes—Adaptive Thinking, Effort Budgets (beta), and Fast Mode (research preview)—to balance reasoning depth, token cost, and response speed in your API calls.
Mastering Claude’s Extended Thinking: Adaptive Mode, Effort Budgets, and Fast Mode
Claude’s Extended Thinking capability allows the model to perform deeper reasoning before generating a response. This is especially valuable for complex tasks like mathematical proofs, multi-step planning, code generation, and nuanced analysis. However, deeper thinking consumes more tokens and increases latency.
To give you fine-grained control, Anthropic has introduced three complementary features:
- Adaptive Thinking – Automatically adjusts thinking depth based on task complexity.
- Effort Budgets (beta) – Lets you set a maximum token limit for thinking.
- Fast Mode (beta: research preview) – Reduces thinking overhead for simpler tasks.
---
Understanding Extended Thinking
Extended Thinking is not a separate model—it’s a parameter you enable in the Messages API. When activated, Claude spends additional tokens on internal reasoning before producing the final output. The thinking process is invisible to the user but improves the quality of complex responses.
Key Concepts
- Thinking tokens – Tokens used for internal reasoning. They are billed at the same rate as output tokens.
- Thinking budget – The maximum number of tokens Claude can use for thinking. This is separate from the output token limit.
- Adaptive mode – Lets Claude decide how many thinking tokens to use based on the prompt’s complexity.
- Effort Budgets – You explicitly cap the thinking tokens, giving you predictable cost and latency.
- Fast Mode – Minimizes thinking for straightforward tasks, reducing latency.
Adaptive Thinking: Let Claude Decide
Adaptive Thinking is the default behavior when you enable Extended Thinking without specifying a budget. Claude analyzes the prompt and allocates thinking tokens dynamically. This is ideal when you don’t know the exact complexity of the user’s request in advance.When to Use Adaptive Thinking
- Open-ended Q&A where complexity varies
- Multi-turn conversations where later turns may require deeper reasoning
- Prototyping and experimentation
Python Example
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
thinking={"type": "enabled"}, # Adaptive mode
messages=[
{"role": "user", "content": "Prove that the square root of 2 is irrational."}
]
)
print(response.content[0].text)
TypeScript Example
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
thinking: { type: 'enabled' },
messages: [
{ role: 'user', content: 'Prove that the square root of 2 is irrational.' }
]
});
console.log(response.content[0].text);
Note: Whenthinkingis enabled,max_tokensmust be at least 1,024 tokens greater than the thinking budget. In adaptive mode, Claude sets the budget automatically, but you still need a sufficiently highmax_tokens.
---
Effort Budgets (Beta): Predictable Cost and Latency
Effort Budgets let you set an explicit maximum number of tokens Claude can use for thinking. This gives you predictable billing and response times. The budget must be less thanmax_tokens by at least 1,024 tokens.
When to Use Effort Budgets
- Cost-sensitive applications
- Real-time systems with strict latency SLAs
- Tasks with known complexity (e.g., code review, document summarization)
Python Example
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=8192,
thinking={
"type": "enabled",
"budget_tokens": 4096 # Cap thinking at 4096 tokens
},
messages=[
{"role": "user", "content": "Write a detailed business plan for a SaaS startup."}
]
)
TypeScript Example
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 8192,
thinking: {
type: 'enabled',
budget_tokens: 4096
},
messages: [
{ role: 'user', content: 'Write a detailed business plan for a SaaS startup.' }
]
});
Best Practices for Effort Budgets
- Start with a budget of 50% of your
max_tokensand adjust based on results. - For simple tasks (e.g., translation, formatting), use a small budget (512–1024 tokens).
- For complex reasoning (e.g., legal analysis, multi-step math), use 4096–8192 tokens.
- Monitor the
usage.thinking_tokensfield in the response to see actual consumption.
Fast Mode (Beta: Research Preview): Speed Over Depth
Fast Mode is designed for tasks that don’t require deep reasoning. It reduces the thinking overhead, resulting in faster responses and lower token usage. This is a research preview feature, so behavior may change.When to Use Fast Mode
- Simple Q&A (e.g., “What’s the capital of France?”)
- Text formatting or extraction
- High-throughput applications where latency is critical
- Tasks where you’ve verified that deep thinking doesn’t improve quality
Python Example
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
thinking={
"type": "enabled",
"fast_mode": True # Enable fast mode
},
messages=[
{"role": "user", "content": "Summarize this article in three bullet points."}
]
)
TypeScript Example
const response = await client.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 2048,
thinking: {
type: 'enabled',
fast_mode: true
},
messages: [
{ role: 'user', content: 'Summarize this article in three bullet points.' }
]
});
Important: Fast Mode is experimental. Always test your use case to ensure output quality remains acceptable.
---
Combining Features: A Practical Strategy
You can combine these features for optimal results. Here’s a decision framework:
| Task Type | Recommended Configuration |
|---|---|
| Complex reasoning (math, code, analysis) | Adaptive Thinking or Effort Budget (4096+) |
| Simple Q&A, extraction | Fast Mode |
| Mixed workloads | Adaptive Thinking (default) |
| Cost-sensitive production | Effort Budget with conservative cap |
| Real-time chat | Fast Mode + small Effort Budget |
Example: Hybrid Approach
def get_response(user_query: str, complexity: str):
if complexity == "simple":
thinking_config = {"type": "enabled", "fast_mode": True}
max_tokens = 1024
elif complexity == "complex":
thinking_config = {"type": "enabled", "budget_tokens": 4096}
max_tokens = 8192
else:
thinking_config = {"type": "enabled"} # Adaptive
max_tokens = 4096
return client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=max_tokens,
thinking=thinking_config,
messages=[{"role": "user", "content": user_query}]
)
---
Monitoring Thinking Token Usage
Always inspect the response object to understand how many tokens were used for thinking:
response = client.messages.create(...)
print(f"Thinking tokens: {response.usage.thinking_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
This data helps you calibrate your Effort Budgets and decide when Fast Mode is appropriate.
---
Limitations and Considerations
- Thinking is not streamable – When using Extended Thinking, streaming is disabled. The entire thinking process completes before any output is returned.
- Fast Mode is a research preview – It may not be available in all regions or models. Check the latest changelog for updates.
- Effort Budgets are a cap, not a target – Claude may use fewer tokens than the budget if the task doesn’t require it.
- Token counting – Thinking tokens count toward your output token usage for billing purposes.
Key Takeaways
- Adaptive Thinking is the easiest way to start—Claude automatically allocates thinking tokens based on task complexity.
- Effort Budgets give you predictable cost and latency by capping thinking tokens; use them in production for cost-sensitive workloads.
- Fast Mode reduces latency for simple tasks but is experimental—always validate output quality.
- Combine features strategically: use Fast Mode for simple queries, Effort Budgets for complex ones, and Adaptive Thinking for mixed workloads.
- Monitor
usage.thinking_tokensto fine-tune your configuration and avoid overspending.