BeClaude
Guide2026-04-27

Mastering Claude’s Extended Thinking: Adaptive Mode, Effort Budgets, and Fast Mode

Learn how to use Claude’s Extended Thinking features—Adaptive Thinking, Effort Budgets, and Fast Mode—to control reasoning depth, cost, and speed in your API applications.

Quick Answer

This guide explains how to configure Claude’s Extended Thinking modes—Adaptive Thinking, Effort Budgets (beta), and Fast Mode (research preview)—to balance reasoning depth, token cost, and response speed in your API calls.

Extended ThinkingAdaptive ThinkingEffort BudgetsFast ModeClaude API

Mastering Claude’s Extended Thinking: Adaptive Mode, Effort Budgets, and Fast Mode

Claude’s Extended Thinking capability allows the model to perform deeper reasoning before generating a response. This is especially valuable for complex tasks like mathematical proofs, multi-step planning, code generation, and nuanced analysis. However, deeper thinking consumes more tokens and increases latency.

To give you fine-grained control, Anthropic has introduced three complementary features:

  • Adaptive Thinking – Automatically adjusts thinking depth based on task complexity.
  • Effort Budgets (beta) – Lets you set a maximum token limit for thinking.
  • Fast Mode (beta: research preview) – Reduces thinking overhead for simpler tasks.
In this guide, you’ll learn how each mode works, when to use them, and how to implement them in your Claude API calls with practical Python and TypeScript examples.

---

Understanding Extended Thinking

Extended Thinking is not a separate model—it’s a parameter you enable in the Messages API. When activated, Claude spends additional tokens on internal reasoning before producing the final output. The thinking process is invisible to the user but improves the quality of complex responses.

Key Concepts

  • Thinking tokens – Tokens used for internal reasoning. They are billed at the same rate as output tokens.
  • Thinking budget – The maximum number of tokens Claude can use for thinking. This is separate from the output token limit.
  • Adaptive mode – Lets Claude decide how many thinking tokens to use based on the prompt’s complexity.
  • Effort Budgets – You explicitly cap the thinking tokens, giving you predictable cost and latency.
  • Fast Mode – Minimizes thinking for straightforward tasks, reducing latency.
---

Adaptive Thinking: Let Claude Decide

Adaptive Thinking is the default behavior when you enable Extended Thinking without specifying a budget. Claude analyzes the prompt and allocates thinking tokens dynamically. This is ideal when you don’t know the exact complexity of the user’s request in advance.

When to Use Adaptive Thinking

  • Open-ended Q&A where complexity varies
  • Multi-turn conversations where later turns may require deeper reasoning
  • Prototyping and experimentation

Python Example

import anthropic

client = anthropic.Anthropic()

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, thinking={"type": "enabled"}, # Adaptive mode messages=[ {"role": "user", "content": "Prove that the square root of 2 is irrational."} ] )

print(response.content[0].text)

TypeScript Example

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const response = await client.messages.create({ model: 'claude-sonnet-4-20250514', max_tokens: 4096, thinking: { type: 'enabled' }, messages: [ { role: 'user', content: 'Prove that the square root of 2 is irrational.' } ] });

console.log(response.content[0].text);

Note: When thinking is enabled, max_tokens must be at least 1,024 tokens greater than the thinking budget. In adaptive mode, Claude sets the budget automatically, but you still need a sufficiently high max_tokens.

---

Effort Budgets (Beta): Predictable Cost and Latency

Effort Budgets let you set an explicit maximum number of tokens Claude can use for thinking. This gives you predictable billing and response times. The budget must be less than max_tokens by at least 1,024 tokens.

When to Use Effort Budgets

  • Cost-sensitive applications
  • Real-time systems with strict latency SLAs
  • Tasks with known complexity (e.g., code review, document summarization)

Python Example

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=8192,
    thinking={
        "type": "enabled",
        "budget_tokens": 4096  # Cap thinking at 4096 tokens
    },
    messages=[
        {"role": "user", "content": "Write a detailed business plan for a SaaS startup."}
    ]
)

TypeScript Example

const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 8192,
  thinking: {
    type: 'enabled',
    budget_tokens: 4096
  },
  messages: [
    { role: 'user', content: 'Write a detailed business plan for a SaaS startup.' }
  ]
});

Best Practices for Effort Budgets

  • Start with a budget of 50% of your max_tokens and adjust based on results.
  • For simple tasks (e.g., translation, formatting), use a small budget (512–1024 tokens).
  • For complex reasoning (e.g., legal analysis, multi-step math), use 4096–8192 tokens.
  • Monitor the usage.thinking_tokens field in the response to see actual consumption.
---

Fast Mode (Beta: Research Preview): Speed Over Depth

Fast Mode is designed for tasks that don’t require deep reasoning. It reduces the thinking overhead, resulting in faster responses and lower token usage. This is a research preview feature, so behavior may change.

When to Use Fast Mode

  • Simple Q&A (e.g., “What’s the capital of France?”)
  • Text formatting or extraction
  • High-throughput applications where latency is critical
  • Tasks where you’ve verified that deep thinking doesn’t improve quality

Python Example

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    thinking={
        "type": "enabled",
        "fast_mode": True  # Enable fast mode
    },
    messages=[
        {"role": "user", "content": "Summarize this article in three bullet points."}
    ]
)

TypeScript Example

const response = await client.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 2048,
  thinking: {
    type: 'enabled',
    fast_mode: true
  },
  messages: [
    { role: 'user', content: 'Summarize this article in three bullet points.' }
  ]
});
Important: Fast Mode is experimental. Always test your use case to ensure output quality remains acceptable.

---

Combining Features: A Practical Strategy

You can combine these features for optimal results. Here’s a decision framework:

Task TypeRecommended Configuration
Complex reasoning (math, code, analysis)Adaptive Thinking or Effort Budget (4096+)
Simple Q&A, extractionFast Mode
Mixed workloadsAdaptive Thinking (default)
Cost-sensitive productionEffort Budget with conservative cap
Real-time chatFast Mode + small Effort Budget

Example: Hybrid Approach

def get_response(user_query: str, complexity: str):
    if complexity == "simple":
        thinking_config = {"type": "enabled", "fast_mode": True}
        max_tokens = 1024
    elif complexity == "complex":
        thinking_config = {"type": "enabled", "budget_tokens": 4096}
        max_tokens = 8192
    else:
        thinking_config = {"type": "enabled"}  # Adaptive
        max_tokens = 4096

return client.messages.create( model="claude-sonnet-4-20250514", max_tokens=max_tokens, thinking=thinking_config, messages=[{"role": "user", "content": user_query}] )

---

Monitoring Thinking Token Usage

Always inspect the response object to understand how many tokens were used for thinking:

response = client.messages.create(...)
print(f"Thinking tokens: {response.usage.thinking_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

This data helps you calibrate your Effort Budgets and decide when Fast Mode is appropriate.

---

Limitations and Considerations

  • Thinking is not streamable – When using Extended Thinking, streaming is disabled. The entire thinking process completes before any output is returned.
  • Fast Mode is a research preview – It may not be available in all regions or models. Check the latest changelog for updates.
  • Effort Budgets are a cap, not a target – Claude may use fewer tokens than the budget if the task doesn’t require it.
  • Token counting – Thinking tokens count toward your output token usage for billing purposes.
---

Key Takeaways

  • Adaptive Thinking is the easiest way to start—Claude automatically allocates thinking tokens based on task complexity.
  • Effort Budgets give you predictable cost and latency by capping thinking tokens; use them in production for cost-sensitive workloads.
  • Fast Mode reduces latency for simple tasks but is experimental—always validate output quality.
  • Combine features strategically: use Fast Mode for simple queries, Effort Budgets for complex ones, and Adaptive Thinking for mixed workloads.
  • Monitor usage.thinking_tokens to fine-tune your configuration and avoid overspending.