1. Deep Reasoning as Default
Starting with Gemini 2.5, all models spend a thinking budget natively before producing a response. Unlike earlier models that responded immediately, modern Gemini models reason internally — analyzing the prompt, considering approaches, and planning their output before generating the final answer.
gemini-3.5-flash or gemini-3.1-pro includes internal reasoning. You can only reduce or disable it, not “turn it on.”
1.1 Default Reasoning Behavior
When you make a standard API call without specifying any thinking configuration, the model dynamically allocates a thinking budget based on prompt complexity:
from google import genai
client = genai.Client()
# Default call — model reasons internally before responding
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="What are the implications of Gödel's incompleteness theorems for AI?"
)
print(response.text)
The model internally decomposes the question, considers multiple angles (mathematical logic, computability theory, philosophical implications), and synthesizes a coherent response — all before producing any output tokens.
1.2 Inspecting Thinking Tokens
Every response includes usage_metadata that reveals how many tokens the model spent on reasoning:
from google import genai
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Explain the P vs NP problem and its practical significance."
)
# Inspect token usage breakdown
metadata = response.usage_metadata
print(f"Input tokens: {metadata.prompt_token_count}")
print(f"Thinking tokens: {metadata.thoughts_token_count}")
print(f"Output tokens: {metadata.candidates_token_count}")
print(f"Total tokens: {metadata.total_token_count}")
thoughts_token_count field shows how many tokens were consumed by internal reasoning. These tokens are never shown to the user but are billed at the output token rate. A complex math proof might use 2,000–8,000 thinking tokens; a simple factual lookup might use fewer than 100.
2. Controlling the Thinking Budget
While thinking is on by default, you have fine-grained control over how much reasoning the model performs via the ThinkingConfig parameter.
2.1 Budget Values & Effects
| Budget Value | Behavior | Use Case |
|---|---|---|
0 | Thinking disabled entirely | Simple lookups, translations, formatting |
-1 | Dynamic (model decides) | Default behavior — optimal for most tasks |
1024 | Light reasoning | Summarization, Q&A with clear answers |
4096 | Moderate reasoning | Multi-step analysis, code generation |
8192 | Deep reasoning | Complex math, proofs, research synthesis |
24576 | Maximum reasoning | PhD-level problems, novel algorithm design |
from google import genai
from google.genai import types
client = genai.Client()
# Disable thinking — fastest, cheapest, least accurate for complex tasks
response_none = client.models.generate_content(
model="gemini-3.5-flash",
contents="What is the capital of France?",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=0)
)
)
print(f"No thinking: {response_none.text}")
print(f"Thinking tokens: {response_none.usage_metadata.thoughts_token_count}")
# Dynamic — let the model decide (equivalent to no config)
response_dynamic = client.models.generate_content(
model="gemini-3.5-flash",
contents="Compare merge sort and quicksort with Big-O analysis.",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=-1)
)
)
print(f"\nDynamic thinking tokens: {response_dynamic.usage_metadata.thoughts_token_count}")
# High budget — force deep reasoning
response_deep = client.models.generate_content(
model="gemini-3.5-flash",
contents="Prove that the halting problem is undecidable using diagonalization.",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=8192)
)
)
print(f"\nDeep thinking tokens: {response_deep.usage_metadata.thoughts_token_count}")
print(f"Answer preview: {response_deep.text[:300]}...")
2.2 When to Use Each Level
- Budget 0: Chatbots with simple greetings, data formatting, language translation
- Budget 1024: Content summarization, straightforward Q&A, text classification
- Budget 4096: Code generation, multi-step word problems, document analysis
- Budget 8192+: Mathematical proofs, research paper synthesis, complex debugging
- Budget -1: When you trust the model to allocate appropriately (production default)
from google import genai
from google.genai import types
client = genai.Client()
def generate_with_budget(prompt: str, budget: int) -> dict:
"""Helper to compare thinking budget effects."""
config = types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=budget)
)
response = client.models.generate_content(
model="gemini-3.5-flash",
contents=prompt,
config=config
)
return {
"budget": budget,
"thinking_tokens": response.usage_metadata.thoughts_token_count,
"output_tokens": response.usage_metadata.candidates_token_count,
"answer_length": len(response.text)
}
# Compare budgets on the same prompt
prompt = "What are three sorting algorithms and their time complexities?"
for budget in [0, 1024, 4096, 8192]:
result = generate_with_budget(prompt, budget)
print(f"Budget {result['budget']:>5}: "
f"thinking={result['thinking_tokens']:>4}, "
f"output={result['output_tokens']:>4}, "
f"chars={result['answer_length']:>4}")
Automated Competitive Intelligence
A consulting firm built a Gemini agent that monitors competitors daily: it searches news, analyzes financial filings, tracks product launches, and generates weekly briefings. The agent maintains a knowledge graph that grows over time, with each run building on previous findings.
3. Thought Signatures & Multi-Turn Preservation
Beginning with Gemini 3.5 Flash, the model preserves reasoning context from all previous turns via encrypted thought signatures. These opaque strings encode compressed reasoning state, allowing the model to maintain deep logical continuity across multi-turn conversations.
3.1 How Signatures Work
When the model responds with thinking enabled, each response part may include a thought_signature field:
from google import genai
from google.genai import types
client = genai.Client()
# First turn — model generates a thought signature
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Let's work through a complex optimization problem step by step. "
"I have a warehouse with 50 products and 3 trucks with different capacities."
)
# Inspect the response parts for thought signatures
for i, candidate in enumerate(response.candidates):
for j, part in enumerate(candidate.content.parts):
print(f"Part {j}: text length={len(part.text) if part.text else 0}")
if hasattr(part, 'thought_signature') and part.thought_signature:
print(f" → Thought signature present ({len(part.thought_signature)} bytes)")
3.2 Multi-Turn Preservation
To maintain reasoning continuity, you must include the thought signatures from previous model responses when building the conversation history:
from google import genai
from google.genai import types
client = genai.Client()
# Turn 1: Initial question
history = [
types.Content(role="user", parts=[
types.Part(text="I need to solve a system of 3 equations with 3 unknowns: "
"2x + y - z = 8, -3x - y + 2z = -11, -2x + y + 2z = -3")
])
]
response1 = client.models.generate_content(
model="gemini-3.5-flash",
contents=history,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=4096)
)
)
print(f"Turn 1: {response1.text[:200]}...")
# CRITICAL: Append the model's response WITH thought signatures intact
history.append(response1.candidates[0].content)
# Turn 2: Follow-up that builds on previous reasoning
history.append(types.Content(role="user", parts=[
types.Part(text="Now verify the solution by substituting back into all three equations.")
]))
response2 = client.models.generate_content(
model="gemini-3.5-flash",
contents=history,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=4096)
)
)
print(f"\nTurn 2: {response2.text[:200]}...")
# The model remembers its reasoning approach from Turn 1
# without re-deriving the solution from scratch
print(f"\nTurn 2 thinking tokens: {response2.usage_metadata.thoughts_token_count}")
3.3 Automatic Handling in Interactions API
The Interactions API eliminates manual thought signature management entirely. Since the server maintains conversation state, thought signatures are preserved automatically:
from google import genai
client = genai.Client()
# No manual history or signature management needed!
interaction1 = client.interactions.create(
model="gemini-3.5-flash",
input="Solve this step by step: If a train leaves at 3pm going 60mph, "
"and another at 4pm going 80mph on the same track, when do they meet "
"if they start 200 miles apart?"
)
print(f"Step 1: {interaction1.output_text[:200]}...")
# Continue — thought signatures handled server-side
interaction2 = client.interactions.create(
model="gemini-3.5-flash",
previous_interaction_id=interaction1.id,
input="What if the second train was going 90mph instead?"
)
print(f"\nModified: {interaction2.output_text[:200]}...")
# Model adjusts the previous calculation without re-solving from scratch
4. Strict Requirements for Function Calling
When combining thinking with function calling, thought signatures become mandatory. Omitting them from the conversation history causes 400 Bad Request validation errors.
from google import genai
from google.genai import types
client = genai.Client()
# Define a tool
weather_tool = types.Tool(function_declarations=[
types.FunctionDeclaration(
name="get_weather",
description="Get current weather for a city",
parameters=types.Schema(
type="OBJECT",
properties={
"city": types.Schema(type="STRING", description="City name"),
},
required=["city"]
)
)
])
# Turn 1: Ask a question that triggers tool use
history = [
types.Content(role="user", parts=[
types.Part(text="What's the weather like in Tokyo and should I bring an umbrella?")
])
]
response = client.models.generate_content(
model="gemini-3.5-flash",
contents=history,
config=types.GenerateContentConfig(
tools=[weather_tool],
thinking_config=types.ThinkingConfig(thinking_budget=1024)
)
)
# CRITICAL: Append the ENTIRE model response (including thought signatures)
history.append(response.candidates[0].content)
# Execute the function call
function_call = response.candidates[0].content.parts[0].function_call
print(f"Model requested: {function_call.name}({function_call.args})")
# Return the function result — thought signatures from above are preserved in history
history.append(types.Content(role="user", parts=[
types.Part(function_response=types.FunctionResponse(
name="get_weather",
response={"temperature": "22°C", "condition": "partly cloudy", "rain_chance": "15%"}
))
]))
# Model generates final answer using both reasoning context AND tool result
final_response = client.models.generate_content(
model="gemini-3.5-flash",
contents=history,
config=types.GenerateContentConfig(
tools=[weather_tool],
thinking_config=types.ThinkingConfig(thinking_budget=1024)
)
)
print(f"\nFinal answer: {final_response.text}")
5. Cost Implications
5.1 Cost Formula
Thinking tokens are billed at the output token rate, making them a significant cost factor for reasoning-heavy workloads:
total_cost = (input_tokens × input_rate) + (thinking_tokens × output_rate) + (output_tokens × output_rate)For Gemini 3.5 Flash at paid tier:
• Input: $0.15 per 1M tokens
• Thinking: $0.60 per 1M tokens (same as output)
• Output: $0.60 per 1M tokens
from google import genai
from google.genai import types
client = genai.Client()
# Calculate actual cost for a request
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Design a microservices architecture for a real-time trading platform "
"supporting 1M concurrent users with sub-10ms latency requirements.",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=8192)
)
)
metadata = response.usage_metadata
input_tokens = metadata.prompt_token_count
thinking_tokens = metadata.thoughts_token_count
output_tokens = metadata.candidates_token_count
# Gemini 3.5 Flash pricing (per token)
input_rate = 0.15 / 1_000_000 # $0.15 per 1M
output_rate = 0.60 / 1_000_000 # $0.60 per 1M (thinking uses this rate too)
cost = (input_tokens * input_rate) + (thinking_tokens * output_rate) + (output_tokens * output_rate)
print(f"Input tokens: {input_tokens:>6} → ${input_tokens * input_rate:.6f}")
print(f"Thinking tokens: {thinking_tokens:>6} → ${thinking_tokens * output_rate:.6f}")
print(f"Output tokens: {output_tokens:>6} → ${output_tokens * output_rate:.6f}")
print(f"{'─' * 40}")
print(f"Total cost: → ${cost:.6f}")
5.2 Optimization Strategies
from google import genai
from google.genai import types
client = genai.Client()
def smart_generate(prompt: str, complexity: str = "auto") -> str:
"""Route prompts to appropriate thinking budgets based on complexity."""
budget_map = {
"trivial": 0, # Simple lookups, formatting
"low": 1024, # Basic Q&A, summarization
"medium": 4096, # Code gen, analysis
"high": 8192, # Complex reasoning
"auto": -1 # Let model decide
}
budget = budget_map.get(complexity, -1)
config = types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=budget)
)
response = client.models.generate_content(
model="gemini-3.5-flash",
contents=prompt,
config=config
)
cost_per_thinking = response.usage_metadata.thoughts_token_count * (0.60 / 1_000_000)
print(f"[{complexity}] Thinking: {response.usage_metadata.thoughts_token_count} tokens "
f"(${cost_per_thinking:.6f})")
return response.text
# Route different tasks to appropriate budgets
print(smart_generate("What is 2 + 2?", "trivial"))
print(smart_generate("Summarize the benefits of microservices", "low"))
print(smart_generate("Write a binary search in Python", "medium"))
print(smart_generate("Prove the Pythagorean theorem three different ways", "high"))
- Use
thinking_budget=0for simple tasks (saves 60–90% on those calls) - Use context caching for repeated prompts — cached tokens cost ~75% less
- Batch simple queries together to amortize overhead
- Monitor
thoughts_token_countin production to detect unexpectedly expensive calls - Consider the Interactions API for multi-turn — automatic caching reduces repeated input costs
Next in the Gemini SDK Track
In Part 6: Function Calling & Tool Integration, we’ll declare custom tools with Python type hints, implement the agentic function calling loop, handle strict response matching requirements, return multimodal function responses, and orchestrate parallel tool calls.