Guide2026-04-26

Mastering Claude’s Extended Thinking: A Practical Guide to Adaptive Thinking, Effort Budgets, and Fast Mode

Learn how to use Claude’s Extended Thinking features—Adaptive Thinking, Effort Budgets, and Fast Mode—to control reasoning depth, cost, and speed in your AI applications.

Quick Answer

This guide explains Claude’s Extended Thinking capabilities, including Adaptive Thinking (auto-adjusts reasoning depth), Effort Budgets (set max thinking tokens), and Fast Mode (speed over depth). You’ll learn when to use each and see practical API code examples.

Extended ThinkingAdaptive ThinkingEffort BudgetsFast ModeClaude API

Mastering Claude’s Extended Thinking: Adaptive Thinking, Effort Budgets, and Fast Mode

Claude’s Extended Thinking capabilities give you fine-grained control over how the model reasons through complex problems. Whether you need deep, step-by-step analysis for research or lightning-fast responses for real-time chat, understanding these features is essential for building efficient, cost-effective AI applications.

In this guide, you’ll learn the practical differences between Adaptive Thinking, Effort Budgets, and Fast Mode, and see exactly how to implement them using the Claude API.

---

What Is Extended Thinking?

Extended Thinking refers to Claude’s ability to allocate additional computational resources—specifically, more “thinking tokens”—to reason through a problem before generating a final answer. This is distinct from the normal generation process: Claude can now spend extra tokens to plan, verify, and refine its reasoning internally.

The feature is especially valuable for:

Complex math and logic problems
Multi-step reasoning tasks
Code generation and debugging
Research analysis and summarization
Any task where accuracy matters more than speed

Claude offers three modes for Extended Thinking:

Mode	Behavior	Best For
Adaptive Thinking	Claude automatically decides how many thinking tokens to use based on the complexity of the input	General-purpose use, mixed workloads
Effort Budgets	You set a maximum number of thinking tokens (a “budget”)	Cost-sensitive applications, predictable latency
Fast Mode	Minimal thinking tokens; prioritizes speed over depth	Real-time chat, simple Q&A, high-throughput systems

Let’s dive into each one.

---

Adaptive Thinking (Default)

Adaptive Thinking is Claude’s default behavior when Extended Thinking is enabled. The model analyzes the input and dynamically allocates thinking tokens—spending more on complex queries and less on simple ones.

How It Works

When you send a request with extended_thinking enabled and no explicit budget, Claude uses its internal heuristics to determine the appropriate depth. For example:

A simple question like “What is the capital of France?” might use 50 thinking tokens.
A complex prompt like “Prove Fermat’s Last Theorem” might use 2,000 thinking tokens.

When to Use Adaptive Thinking

Mixed workloads: Your application handles both simple and complex queries.
No strict latency requirements: You’re okay with variable response times.
You want maximum accuracy: Claude will spend as many tokens as it deems necessary.

API Example (Python)

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    extended_thinking=True,  # Enables Adaptive Thinking
    messages=[
        {"role": "user", "content": "Explain the difference between a monad and a functor in functional programming."}
    ]
)
print(response.content[0].text)

Note: When extended_thinking is True without a budget, Adaptive Thinking is active. The thinking tokens are included in your max_tokens count.

---

Effort Budgets (Beta)

Effort Budgets give you direct control over how many thinking tokens Claude can use. You set a maximum budget, and Claude will not exceed that limit—even if it means producing a less thorough answer.

How It Works

You specify a thinking_budget parameter (in tokens) alongside extended_thinking. Claude will reason up to that budget, then generate the final response. If the budget is too low for the task, Claude may produce a partial or less accurate answer.

When to Use Effort Budgets

Cost control: You want predictable API costs per request.
Latency-sensitive apps: You need consistent response times.
Batch processing: You’re processing many requests and need to cap resource usage.

API Example (Python)

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    extended_thinking=True,
    thinking_budget=500,  # Max 500 thinking tokens
    messages=[
        {"role": "user", "content": "Write a Python function to solve the traveling salesman problem using dynamic programming."}
    ]
)
print(response.content[0].text)

Best Practices for Setting Budgets

Start with a high budget (e.g., 2000 tokens) and monitor response quality.
Gradually reduce the budget until you find the sweet spot between accuracy and cost.
For simple tasks, budgets as low as 100–300 tokens may suffice.
For complex reasoning tasks, budgets of 1000–4000 tokens are common.

---

Fast Mode (Beta: Research Preview)

Fast Mode is the opposite of Extended Thinking: it minimizes thinking tokens to prioritize speed. This mode is ideal when you need quick, straightforward answers and can tolerate lower reasoning depth.

How It Works

When Fast Mode is enabled, Claude skips most internal reasoning and jumps directly to generating the response. The model still performs basic comprehension but does not engage in deep analysis or multi-step verification.

When to Use Fast Mode

Real-time chat: Customer support bots, conversational agents.
High-throughput systems: Processing thousands of simple requests per minute.
Simple Q&A: Factual lookups, definitions, translations.
Prototyping: When you’re iterating quickly and don’t need perfect answers.

API Example (Python)

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    fast_mode=True,  # Enables Fast Mode
    messages=[
        {"role": "user", "content": "What is the weather in Tokyo today?"}
    ]
)
print(response.content[0].text)

Important: Fast Mode is still in research preview. It may not be available on all models or regions. Check the latest documentation for availability.

---

Choosing the Right Mode

Here’s a decision flowchart to help you pick:

Is speed your top priority? → Use Fast Mode.
Do you need maximum accuracy? → Use Adaptive Thinking (default).
Do you need predictable cost/latency? → Use Effort Budgets.
Are you handling mixed workloads? → Start with Adaptive Thinking, then add budgets for specific requests.

Practical Example: Building a Tiered System

You can combine these modes in a single application. For instance, a customer support bot might use:

Fast Mode for greeting and simple FAQs.
Effort Budgets for account-specific queries (e.g., “What’s my last bill?”).
Adaptive Thinking for complex troubleshooting (e.g., “Why is my internet not working?”).

def get_response(query, complexity):
    if complexity == "low":
        return client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=512,
            fast_mode=True,
            messages=[{"role": "user", "content": query}]
        )
    elif complexity == "medium":
        return client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=2048,
            extended_thinking=True,
            thinking_budget=500,
            messages=[{"role": "user", "content": query}]
        )
    else:  # high complexity
        return client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            extended_thinking=True,
            messages=[{"role": "user", "content": query}]
        )

---

Monitoring Thinking Token Usage

To see how many thinking tokens Claude actually used, inspect the response object:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4096,
    extended_thinking=True,
    thinking_budget=1000,
    messages=[{"role": "user", "content": "Solve this equation: 3x + 7 = 22"}]
)
Check usage
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Thinking tokens: {response.usage.thinking_tokens}")  # Only available with Extended Thinking

This data helps you fine-tune your budgets and understand your application’s cost profile.

---

Limitations and Caveats

Fast Mode is a research preview—expect changes and potential instability.
Effort Budgets are in beta; the exact token count may vary slightly from your budget.
Extended Thinking counts toward your max_tokens limit. If you set max_tokens=4096 and use 2000 thinking tokens, only 2096 tokens remain for the final response.
Not all Claude models support Extended Thinking. Check the model’s capabilities in the documentation.

---

Key Takeaways

Adaptive Thinking is Claude’s default mode—it automatically adjusts reasoning depth based on input complexity, making it ideal for general-purpose use.
Effort Budgets let you cap thinking tokens for predictable cost and latency, perfect for production systems with strict SLAs.
Fast Mode sacrifices depth for speed, best suited for real-time chat and high-throughput applications.
Use the response.usage.thinking_tokens field to monitor actual token consumption and optimize your budgets.
Combine modes in a single application to balance cost, speed, and accuracy across different query types.

By mastering these Extended Thinking features, you can build Claude-powered applications that are both intelligent and efficient—delivering the right level of reasoning for every task.