Guide2026-04-26

Mastering Claude’s Extended Thinking: Adaptive Mode, Effort Budgets, and Fast Mode

Learn how to use Claude's Extended Thinking features—Adaptive Thinking, Effort Budgets, and Fast Mode—to control reasoning depth, speed, and cost in your API applications.

Quick Answer

This guide explains Claude's Extended Thinking capabilities: Adaptive Thinking for dynamic reasoning depth, Effort Budgets to cap token usage, and Fast Mode for speed-critical tasks. You'll learn when to use each and how to implement them in Python.

Extended ThinkingAdaptive ThinkingEffort BudgetsFast ModeClaude API

Introduction

Claude’s Extended Thinking capabilities give you fine-grained control over how the model reasons through complex problems. Whether you need deep, step-by-step analysis for research or lightning-fast responses for real-time applications, understanding Adaptive Thinking, Effort Budgets, and Fast Mode is essential.

This guide breaks down each feature, explains when to use them, and provides practical code examples to integrate them into your Claude API workflows.

What Is Extended Thinking?

Extended Thinking refers to Claude’s ability to allocate additional computational resources to reasoning tasks. Instead of generating a single output, Claude can “think” through intermediate steps, explore multiple paths, and refine its answers. This is especially valuable for:

Complex math and logic problems
Multi-step reasoning tasks
Code generation and debugging
Research analysis and summarization

Claude offers three modes of Extended Thinking:

Adaptive Thinking – Automatically adjusts reasoning depth based on task complexity.
Effort Budgets (beta) – Lets you set a maximum token limit for thinking.
Fast Mode (beta: research preview) – Prioritizes speed over depth for simple tasks.

Adaptive Thinking: Let Claude Decide the Depth

Adaptive Thinking is the default mode for Extended Thinking. Claude dynamically determines how much reasoning is needed for each query. If you ask a simple question like “What is 2+2?”, Claude uses minimal thinking. For a complex prompt like “Prove Fermat’s Last Theorem for n=3,” it allocates more tokens to reasoning.

When to Use Adaptive Thinking

You want the best balance of quality and speed.
Your use case involves varied query complexity.
You don’t want to manually tune thinking depth.

How to Enable Adaptive Thinking

In the API, you enable Extended Thinking by setting the thinking parameter in your request. Here’s an example in Python:

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    thinking={
        "type": "enabled",
        "budget_tokens": 2048  # Maximum tokens for thinking
    },
    messages=[
        {"role": "user", "content": "Solve the equation: 3x^2 + 5x - 2 = 0"}
    ]
)
print(response.content[0].text)

Note: The budget_tokens in Adaptive Thinking is a maximum cap. Claude will use fewer tokens if the task doesn’t require full depth.

Effort Budgets (Beta): Take Control of Thinking Costs

Effort Budgets let you explicitly set the maximum number of tokens Claude can use for thinking. This is useful when you need to:

Control API costs for high-volume applications.
Ensure consistent response times.
Limit reasoning depth for simpler tasks.

When to Use Effort Budgets

You have strict cost constraints.
You’re processing many similar queries (e.g., batch classification).
You want to prevent overthinking on trivial tasks.

How to Set an Effort Budget

Effort Budgets are set via the budget_tokens field inside the thinking object. The value represents the maximum tokens Claude can use for internal reasoning before generating the final response.

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    thinking={
        "type": "enabled",
        "budget_tokens": 512  # Strict limit on thinking tokens
    },
    messages=[
        {"role": "user", "content": "Summarize this article in 3 bullet points: [text]"}
    ]
)

Best Practices for Effort Budgets

Start with a higher budget (e.g., 2048 tokens) and reduce it iteratively.
Monitor response quality as you lower the budget.
For simple tasks like classification or extraction, 256–512 tokens is often sufficient.
For complex reasoning (e.g., code generation), use 2048–4096 tokens.

Fast Mode (Beta: Research Preview): Speed Over Depth

Fast Mode is designed for scenarios where response speed is critical and deep reasoning is unnecessary. It reduces the thinking budget to a minimum, forcing Claude to generate answers quickly.

When to Use Fast Mode

Real-time chatbots requiring sub-second responses.
Simple Q&A (e.g., “What’s the weather in Tokyo?”).
High-throughput batch processing where latency matters.

How to Enable Fast Mode

Fast Mode is activated by setting type: "fast" in the thinking parameter. Note that this is a research preview and may have limited availability.

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    thinking={
        "type": "fast"
    },
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)
print(response.content[0].text)

Trade-offs of Fast Mode

Pros: Low latency, reduced token usage, lower cost.
Cons: Reduced accuracy on complex tasks, no intermediate reasoning visible.

Comparing the Three Modes

Feature	Adaptive Thinking	Effort Budgets	Fast Mode
Control	Automatic	Manual (token cap)	Minimal
Best for	Mixed workloads	Cost-sensitive apps	Real-time apps
Reasoning depth	Dynamic	Fixed maximum	Shallow
Latency	Moderate	Predictable	Low
API parameter	`type: "enabled"`	`budget_tokens: N`	`type: "fast"`

Practical Example: Choosing the Right Mode

Let’s say you’re building a customer support bot. You might use:

Fast Mode for greeting and simple FAQs.
Effort Budgets (512 tokens) for order status lookups.
Adaptive Thinking for complex refund or technical issues.

Here’s how you could implement a routing function:

def get_thinking_config(query_type):
    if query_type == "simple":
        return {"type": "fast"}
    elif query_type == "moderate":
        return {"type": "enabled", "budget_tokens": 512}
    else:
        return {"type": "enabled", "budget_tokens": 2048}
Usage
config = get_thinking_config("complex")
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    thinking=config,
    messages=[{"role": "user", "content": user_query}]
)

Key Takeaways

Adaptive Thinking is the default and best for general use—it automatically balances depth and speed.
Effort Budgets give you precise control over thinking token usage, ideal for cost management and predictable latency.
Fast Mode sacrifices reasoning depth for speed, perfect for simple, real-time interactions.
Always set budget_tokens to a reasonable maximum (e.g., 2048) even in Adaptive mode to prevent runaway costs.
Test different modes with your specific use case to find the optimal balance of quality, speed, and cost.