Gemini SDK Track Part 5: Thinking, Reasoning & Thought Signatures

                        
                        What You’ll Learn: Agents are where Gemini goes from answering questions to taking action. An agent uses tools iteratively — planning steps, executing them, evaluating results, and adapting its approach. This article teaches you to build agents with the Gemini SDK: the agentic loop, tool orchestration, and patterns for complex multi-step workflows.
                    

1. Deep Reasoning as Default

Starting with Gemini 2.5, all models spend a thinking budget natively before producing a response. Unlike earlier models that responded immediately, modern Gemini models reason internally — analyzing the prompt, considering approaches, and planning their output before generating the final answer.

                        
                        Key Insight: Thinking is ON by default. You do not need to enable it — every call to gemini-3.5-flash or gemini-3.1-pro includes internal reasoning. You can only reduce or disable it, not “turn it on.”
                    

1.1 Default Reasoning Behavior

When you make a standard API call without specifying any thinking configuration, the model dynamically allocates a thinking budget based on prompt complexity:

from google import genai

client = genai.Client()

# Default call — model reasons internally before responding
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="What are the implications of Gödel's incompleteness theorems for AI?"
)

print(response.text)

The model internally decomposes the question, considers multiple angles (mathematical logic, computability theory, philosophical implications), and synthesizes a coherent response — all before producing any output tokens.

1.2 Inspecting Thinking Tokens

Every response includes usage_metadata that reveals how many tokens the model spent on reasoning:

from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Explain the P vs NP problem and its practical significance."
)

# Inspect token usage breakdown
metadata = response.usage_metadata
print(f"Input tokens:    {metadata.prompt_token_count}")
print(f"Thinking tokens: {metadata.thoughts_token_count}")
print(f"Output tokens:   {metadata.candidates_token_count}")
print(f"Total tokens:    {metadata.total_token_count}")

                        
                        Token Accounting: The thoughts_token_count field shows how many tokens were consumed by internal reasoning. These tokens are never shown to the user but are billed at the output token rate. A complex math proof might use 2,000–8,000 thinking tokens; a simple factual lookup might use fewer than 100.
                    

2. Controlling the Thinking Budget

While thinking is on by default, you have fine-grained control over how much reasoning the model performs via the ThinkingConfig parameter.

2.1 Budget Values & Effects

Budget Value	Behavior	Use Case
`0`	Thinking disabled entirely	Simple lookups, translations, formatting
`-1`	Dynamic (model decides)	Default behavior — optimal for most tasks
`1024`	Light reasoning	Summarization, Q&A with clear answers
`4096`	Moderate reasoning	Multi-step analysis, code generation
`8192`	Deep reasoning	Complex math, proofs, research synthesis
`24576`	Maximum reasoning	PhD-level problems, novel algorithm design

from google import genai
from google.genai import types

client = genai.Client()

# Disable thinking — fastest, cheapest, least accurate for complex tasks
response_none = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="What is the capital of France?",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=0)
    )
)
print(f"No thinking: {response_none.text}")
print(f"Thinking tokens: {response_none.usage_metadata.thoughts_token_count}")

# Dynamic — let the model decide (equivalent to no config)
response_dynamic = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Compare merge sort and quicksort with Big-O analysis.",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=-1)
    )
)
print(f"\nDynamic thinking tokens: {response_dynamic.usage_metadata.thoughts_token_count}")

# High budget — force deep reasoning
response_deep = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Prove that the halting problem is undecidable using diagonalization.",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=8192)
    )
)
print(f"\nDeep thinking tokens: {response_deep.usage_metadata.thoughts_token_count}")
print(f"Answer preview: {response_deep.text[:300]}...")

2.2 When to Use Each Level

                        
                        Strategy Guide:
                        Budget 0: Chatbots with simple greetings, data formatting, language translation
Budget 1024: Content summarization, straightforward Q&A, text classification
Budget 4096: Code generation, multi-step word problems, document analysis
Budget 8192+: Mathematical proofs, research paper synthesis, complex debugging
Budget -1: When you trust the model to allocate appropriately (production default)

                    

from google import genai
from google.genai import types

client = genai.Client()

def generate_with_budget(prompt: str, budget: int) -> dict:
    """Helper to compare thinking budget effects."""
    config = types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=budget)
    )
    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=prompt,
        config=config
    )
    return {
        "budget": budget,
        "thinking_tokens": response.usage_metadata.thoughts_token_count,
        "output_tokens": response.usage_metadata.candidates_token_count,
        "answer_length": len(response.text)
    }

# Compare budgets on the same prompt
prompt = "What are three sorting algorithms and their time complexities?"
for budget in [0, 1024, 4096, 8192]:
    result = generate_with_budget(prompt, budget)
    print(f"Budget {result['budget']:>5}: "
          f"thinking={result['thinking_tokens']:>4}, "
          f"output={result['output_tokens']:>4}, "
          f"chars={result['answer_length']:>4}")

Real-World Application

Automated Competitive Intelligence

A consulting firm built a Gemini agent that monitors competitors daily: it searches news, analyzes financial filings, tracks product launches, and generates weekly briefings. The agent maintains a knowledge graph that grows over time, with each run building on previous findings.

AgentsCompetitive IntelligenceKnowledge Graph

3. Thought Signatures & Multi-Turn Preservation

Beginning with Gemini 3.5 Flash, the model preserves reasoning context from all previous turns via encrypted thought signatures. These opaque strings encode compressed reasoning state, allowing the model to maintain deep logical continuity across multi-turn conversations.

3.1 How Signatures Work

                        
                        Critical Concept: Thought signatures are encrypted, opaque byte strings. You cannot read, modify, or interpret them. Your only job is to pass them back unchanged in subsequent turns. Tampering with signatures causes validation errors.
                    

When the model responds with thinking enabled, each response part may include a thought_signature field:

from google import genai
from google.genai import types

client = genai.Client()

# First turn — model generates a thought signature
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Let's work through a complex optimization problem step by step. "
             "I have a warehouse with 50 products and 3 trucks with different capacities."
)

# Inspect the response parts for thought signatures
for i, candidate in enumerate(response.candidates):
    for j, part in enumerate(candidate.content.parts):
        print(f"Part {j}: text length={len(part.text) if part.text else 0}")
        if hasattr(part, 'thought_signature') and part.thought_signature:
            print(f"  → Thought signature present ({len(part.thought_signature)} bytes)")

3.2 Multi-Turn Preservation

To maintain reasoning continuity, you must include the thought signatures from previous model responses when building the conversation history:

from google import genai
from google.genai import types

client = genai.Client()

# Turn 1: Initial question
history = [
    types.Content(role="user", parts=[
        types.Part(text="I need to solve a system of 3 equations with 3 unknowns: "
                       "2x + y - z = 8, -3x - y + 2z = -11, -2x + y + 2z = -3")
    ])
]

response1 = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=history,
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=4096)
    )
)
print(f"Turn 1: {response1.text[:200]}...")

# CRITICAL: Append the model's response WITH thought signatures intact
history.append(response1.candidates[0].content)

# Turn 2: Follow-up that builds on previous reasoning
history.append(types.Content(role="user", parts=[
    types.Part(text="Now verify the solution by substituting back into all three equations.")
]))

response2 = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=history,
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=4096)
    )
)
print(f"\nTurn 2: {response2.text[:200]}...")

# The model remembers its reasoning approach from Turn 1
# without re-deriving the solution from scratch
print(f"\nTurn 2 thinking tokens: {response2.usage_metadata.thoughts_token_count}")

                        
                        Why This Matters: Without thought signatures, each turn would reason from scratch. With them, the model resumes where it left off — enabling coherent multi-step problem solving (like debugging across multiple turns or iterative document editing).
                    

3.3 Automatic Handling in Interactions API

The Interactions API eliminates manual thought signature management entirely. Since the server maintains conversation state, thought signatures are preserved automatically:

from google import genai

client = genai.Client()

# No manual history or signature management needed!
interaction1 = client.interactions.create(
    model="gemini-3.5-flash",
    input="Solve this step by step: If a train leaves at 3pm going 60mph, "
          "and another at 4pm going 80mph on the same track, when do they meet "
          "if they start 200 miles apart?"
)
print(f"Step 1: {interaction1.output_text[:200]}...")

# Continue — thought signatures handled server-side
interaction2 = client.interactions.create(
    model="gemini-3.5-flash",
    previous_interaction_id=interaction1.id,
    input="What if the second train was going 90mph instead?"
)
print(f"\nModified: {interaction2.output_text[:200]}...")
# Model adjusts the previous calculation without re-solving from scratch

4. Strict Requirements for Function Calling

When combining thinking with function calling, thought signatures become mandatory. Omitting them from the conversation history causes 400 Bad Request validation errors.

                        
                        Hard Requirement: If thinking is active (budget > 0) and the model returns function calls, you MUST include the complete model response — including all thought signature fields — when sending back function results. Stripping signatures or rebuilding the response manually will cause 4xx errors.
                    

from google import genai
from google.genai import types

client = genai.Client()

# Define a tool
weather_tool = types.Tool(function_declarations=[
    types.FunctionDeclaration(
        name="get_weather",
        description="Get current weather for a city",
        parameters=types.Schema(
            type="OBJECT",
            properties={
                "city": types.Schema(type="STRING", description="City name"),
            },
            required=["city"]
        )
    )
])

# Turn 1: Ask a question that triggers tool use
history = [
    types.Content(role="user", parts=[
        types.Part(text="What's the weather like in Tokyo and should I bring an umbrella?")
    ])
]

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=history,
    config=types.GenerateContentConfig(
        tools=[weather_tool],
        thinking_config=types.ThinkingConfig(thinking_budget=1024)
    )
)

# CRITICAL: Append the ENTIRE model response (including thought signatures)
history.append(response.candidates[0].content)

# Execute the function call
function_call = response.candidates[0].content.parts[0].function_call
print(f"Model requested: {function_call.name}({function_call.args})")

# Return the function result — thought signatures from above are preserved in history
history.append(types.Content(role="user", parts=[
    types.Part(function_response=types.FunctionResponse(
        name="get_weather",
        response={"temperature": "22°C", "condition": "partly cloudy", "rain_chance": "15%"}
    ))
]))

# Model generates final answer using both reasoning context AND tool result
final_response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=history,
    config=types.GenerateContentConfig(
        tools=[weather_tool],
        thinking_config=types.ThinkingConfig(thinking_budget=1024)
    )
)
print(f"\nFinal answer: {final_response.text}")

5. Cost Implications

5.1 Cost Formula

Thinking tokens are billed at the output token rate, making them a significant cost factor for reasoning-heavy workloads:

                        
                        Total Cost Formula:

                        total_cost = (input_tokens × input_rate) + (thinking_tokens × output_rate) + (output_tokens × output_rate)

                        For Gemini 3.5 Flash at paid tier:

                        • Input: $0.15 per 1M tokens

                        • Thinking: $0.60 per 1M tokens (same as output)

                        • Output: $0.60 per 1M tokens

from google import genai
from google.genai import types

client = genai.Client()

# Calculate actual cost for a request
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Design a microservices architecture for a real-time trading platform "
             "supporting 1M concurrent users with sub-10ms latency requirements.",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=8192)
    )
)

metadata = response.usage_metadata
input_tokens = metadata.prompt_token_count
thinking_tokens = metadata.thoughts_token_count
output_tokens = metadata.candidates_token_count

# Gemini 3.5 Flash pricing (per token)
input_rate = 0.15 / 1_000_000    # $0.15 per 1M
output_rate = 0.60 / 1_000_000   # $0.60 per 1M (thinking uses this rate too)

cost = (input_tokens * input_rate) + (thinking_tokens * output_rate) + (output_tokens * output_rate)

print(f"Input tokens:    {input_tokens:>6} → ${input_tokens * input_rate:.6f}")
print(f"Thinking tokens: {thinking_tokens:>6} → ${thinking_tokens * output_rate:.6f}")
print(f"Output tokens:   {output_tokens:>6} → ${output_tokens * output_rate:.6f}")
print(f"{'─' * 40}")
print(f"Total cost:              → ${cost:.6f}")

5.2 Optimization Strategies

from google import genai
from google.genai import types

client = genai.Client()

def smart_generate(prompt: str, complexity: str = "auto") -> str:
    """Route prompts to appropriate thinking budgets based on complexity."""
    budget_map = {
        "trivial": 0,       # Simple lookups, formatting
        "low": 1024,        # Basic Q&A, summarization
        "medium": 4096,     # Code gen, analysis
        "high": 8192,       # Complex reasoning
        "auto": -1          # Let model decide
    }
    
    budget = budget_map.get(complexity, -1)
    
    config = types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=budget)
    )
    
    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=prompt,
        config=config
    )
    
    cost_per_thinking = response.usage_metadata.thoughts_token_count * (0.60 / 1_000_000)
    print(f"[{complexity}] Thinking: {response.usage_metadata.thoughts_token_count} tokens "
          f"(${cost_per_thinking:.6f})")
    
    return response.text

# Route different tasks to appropriate budgets
print(smart_generate("What is 2 + 2?", "trivial"))
print(smart_generate("Summarize the benefits of microservices", "low"))
print(smart_generate("Write a binary search in Python", "medium"))
print(smart_generate("Prove the Pythagorean theorem three different ways", "high"))

                        
                        Cost Optimization Tips:
                        Use thinking_budget=0 for simple tasks (saves 60–90% on those calls)
Use context caching for repeated prompts — cached tokens cost ~75% less
Batch simple queries together to amortize overhead
Monitor thoughts_token_count in production to detect unexpectedly expensive calls
Consider the Interactions API for multi-turn — automatic caching reduces repeated input costs

                    

                        
                        Try It Yourself: Build a ‘research agent’ that can answer complex questions by breaking them into steps: (1) plan research strategy, (2) search for information using a web_search tool, (3) extract and validate facts, (4) synthesize a final answer with citations. Test with 3 complex questions that require multiple search steps.
                    

Next in the Gemini SDK Track

In Part 6: Function Calling & Tool Integration, we’ll declare custom tools with Python type hints, implement the agentic function calling loop, handle strict response matching requirements, return multimodal function responses, and orchestrate parallel tool calls.