Back to AI App Dev Series

Gemini SDK Track Part 2: Text Generation & Structured Outputs

May 24, 2026 Wasil Zafar 35 min read

Master text generation with generateContent, enforce type-safe JSON responses via response_format with schemas, streaming responses, token counting, and working with Gemini’s 1M token context window.

Table of Contents

  1. Text Generation Fundamentals
  2. Streaming Responses
  3. Structured Outputs & JSON Schema
  4. Token Counting & Usage
  5. Long Context (1M Tokens)
What You’ll Learn: Content generation with Gemini goes far beyond simple text completion — you can control creativity with temperature, guide output with system instructions, generate multiple candidates for comparison, and use safety settings to filter harmful content. This article teaches you to craft requests that produce exactly the output quality and format your application needs.

1. Text Generation Fundamentals

The generate_content method is the core of all Gemini text interactions. It accepts a model identifier, content (text, images, or multi-turn history), and an optional configuration object controlling generation behavior.

1.1 Basic Generation

The simplest pattern — a single text prompt producing a single text response:

from google import genai

client = genai.Client()

# Basic single-turn generation
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Explain the difference between TCP and UDP in 3 bullet points."
)

print(response.text)

You can also provide system instructions to set the model’s persona and behavior constraints:

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    config=types.GenerateContentConfig(
        system_instruction="You are a senior software architect. Answer concisely with concrete examples. Use bullet points."
    ),
    contents="What are the trade-offs between microservices and monoliths?"
)

print(response.text)

1.2 Multi-Turn Conversations

For multi-turn conversations, pass a list of Content objects alternating between user and model roles:

from google import genai
from google.genai import types

client = genai.Client()

# Build a conversation history
history = [
    types.Content(role="user", parts=[types.Part(text="My name is Alex and I'm building a REST API.")]),
    types.Content(role="model", parts=[types.Part(text="Hi Alex! I'd be happy to help with your REST API. What technology stack are you using?")]),
    types.Content(role="user", parts=[types.Part(text="FastAPI with Python. How should I structure authentication?")])
]

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=history
)

print(response.text)

1.3 Generation Configuration

Fine-tune output behavior with GenerateContentConfig. Key parameters control randomness, length, and stop conditions:

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    config=types.GenerateContentConfig(
        temperature=0.2,          # Lower = more deterministic (0.0-2.0)
        top_p=0.8,                # Nucleus sampling threshold
        top_k=40,                 # Top-K token sampling
        max_output_tokens=1024,   # Maximum response length
        stop_sequences=["\n\n"],  # Stop generation at double newline
        candidate_count=1         # Number of response candidates
    ),
    contents="Write a Python function to validate an email address using regex."
)

print(response.text)
Temperature Guide: Use 0.0–0.3 for factual/code tasks (deterministic), 0.7–1.0 for creative writing (diverse), and 1.0–2.0 for brainstorming (highly varied). The default is model-dependent but typically around 1.0.

2. Streaming Responses

Streaming delivers response chunks as they’re generated, reducing time-to-first-token and enabling real-time UI updates. Essential for chat interfaces and long-form generation.

2.1 Python Streaming

from google import genai

client = genai.Client()

# Stream response chunks as they arrive
response = client.models.generate_content_stream(
    model="gemini-3.5-flash",
    contents="Write a comprehensive guide to Python decorators with examples."
)

# Iterate over streamed chunks
full_text = ""
for chunk in response:
    print(chunk.text, end="", flush=True)
    full_text += chunk.text

print(f"\n\n--- Total length: {len(full_text)} characters ---")

2.2 JavaScript Streaming

import { GoogleGenAI } from "@google/genai";

const client = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await client.models.generateContentStream({
    model: "gemini-3.5-flash",
    contents: "Explain event-driven architecture with a real-world example."
});

let fullText = "";
for await (const chunk of response) {
    process.stdout.write(chunk.text);
    fullText += chunk.text;
}

console.log(`\n\n--- Total length: ${fullText.length} characters ---`);
Streaming & Structured Outputs: Streaming works with structured output mode. Each chunk delivers a partial JSON fragment. However, you should accumulate all chunks and parse the complete JSON only after the stream finishes — partial JSON is not valid for parsing.
Real-World Application

AI-Powered Marketing Agency

A digital agency generates 200+ social media posts daily using Gemini with tailored system instructions per brand. Each brand has a “voice profile” encoded in the system instruction, and temperature is tuned per content type (0.3 for product descriptions, 0.9 for engagement posts).

Content GenerationSystem InstructionsTemperature Tuning

3. Structured Outputs & JSON Schema

Structured outputs guarantee the model returns valid JSON conforming to a schema you define. This eliminates post-processing, regex extraction, and retry loops — the response is always machine-parseable.

3.1 JSON Mode with Inline Schema

Specify response_mime_type and response_schema in the config to enforce structured JSON output:

from google import genai
from google.genai import types

client = genai.Client()

# Define the schema inline
response = client.models.generate_content(
    model="gemini-3.5-flash",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "object",
            "properties": {
                "recipe_name": {"type": "string"},
                "prep_time_minutes": {"type": "integer"},
                "difficulty": {"type": "string", "enum": ["easy", "medium", "hard"]},
                "ingredients": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "quantity": {"type": "string"}
                        },
                        "required": ["name", "quantity"]
                    }
                },
                "steps": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["recipe_name", "prep_time_minutes", "difficulty", "ingredients", "steps"]
        }
    ),
    contents="Give me a recipe for classic French onion soup."
)

import json
recipe = json.loads(response.text)
print(f"Recipe: {recipe['recipe_name']}")
print(f"Difficulty: {recipe['difficulty']}")
print(f"Prep time: {recipe['prep_time_minutes']} min")
print(f"Ingredients: {len(recipe['ingredients'])}")
for step in recipe['steps'][:3]:
    print(f"  - {step}")

3.2 Pydantic Schema Approach

For Python developers, you can pass a Pydantic BaseModel class directly as the schema — the SDK handles serialization automatically:

from google import genai
from google.genai import types
from pydantic import BaseModel

client = genai.Client()

# Define schema as a Pydantic model
class MovieReview(BaseModel):
    title: str
    year: int
    genre: str
    rating: float
    summary: str
    pros: list[str]
    cons: list[str]

# Pass the class directly — SDK converts to JSON Schema
response = client.models.generate_content(
    model="gemini-3.5-flash",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=MovieReview
    ),
    contents="Review the movie 'Inception' (2010) by Christopher Nolan."
)

import json
review = json.loads(response.text)
print(f"{review['title']} ({review['year']}) — {review['rating']}/10")
print(f"Genre: {review['genre']}")
print(f"Summary: {review['summary'][:100]}...")
print(f"Pros: {', '.join(review['pros'][:3])}")

3.3 Supported Schema Types & Properties

The Gemini structured output system supports the following JSON Schema features:

TypeDescriptionExtra Properties
stringText valuesenum, format (date, uri, email)
numberFloating-pointminimum, maximum
integerWhole numbersminimum, maximum
booleantrue/false
objectNested objectsproperties, required, additionalProperties
arrayListsitems, minItems, maxItems
nullNullable fieldsUse with anyOf for optional fields
Schema Limitations: Gemini does not support $ref, oneOf, allOf, or recursive schemas. Keep schemas flat or use at most 2 levels of nesting. For complex data, break into multiple API calls with simpler schemas.

4. Token Counting & Usage Metadata

4.1 Count Tokens API

Before sending expensive prompts, use count_tokens to preview the token cost without making a generation call:

from google import genai

client = genai.Client()

# Count tokens before generation (no cost incurred)
prompt = "Explain the complete history of the Roman Empire from founding to fall."
token_count = client.models.count_tokens(
    model="gemini-3.5-flash",
    contents=prompt
)

print(f"Prompt tokens: {token_count.total_tokens}")
print(f"Estimated cost at $0.15/1M input: ${token_count.total_tokens * 0.15 / 1_000_000:.6f}")

4.2 Usage Metadata from Responses

Every generation response includes detailed token usage in usage_metadata:

from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Write a haiku about distributed systems."
)

print(f"Response: {response.text}")
print(f"\n--- Token Usage ---")
print(f"Prompt tokens:    {response.usage_metadata.prompt_token_count}")
print(f"Output tokens:    {response.usage_metadata.candidates_token_count}")
print(f"Thinking tokens:  {response.usage_metadata.thoughts_token_count}")
print(f"Total tokens:     {response.usage_metadata.total_token_count}")
Multimodal Token Rates: Video inputs are tokenized at 263 tokens/second. Audio inputs at 32 tokens/second. A 1-minute video costs approximately 15,780 input tokens. Use count_tokens with multimodal content to get exact counts before generation.

5. Long Context (1M Tokens)

Gemini 3.5 Flash supports a 1,048,576 token context window — enough to process entire codebases, books, or hours of video in a single API call. This is one of Gemini’s most distinctive capabilities.

5.1 When to Use Long Context

from google import genai

client = genai.Client()

# Example: Analyzing a large document in full context
large_document = open("technical_spec.md").read()  # Imagine a 200-page document

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=f"""Analyze the following technical specification and provide:
1. A one-paragraph executive summary
2. The top 5 risks identified
3. Any inconsistencies between sections

Document:
{large_document}"""
)

print(response.text)

5.2 Long Context vs RAG

Choosing between stuffing the full context and using RAG (retrieval-augmented generation) depends on your use case:

FactorLong ContextRAG
Best forComplete understanding of a documentSearching across millions of documents
AccuracyHigher (sees everything)Depends on retrieval quality
Cost per queryHigher (large input token count)Lower (only relevant chunks)
LatencyHigher for first callLower per query
Setup complexityMinimal (just send the text)Requires embeddings, vector DB, indexing
Dynamic dataAlways fresh (send latest)Requires re-indexing
Cost Optimization: Use context caching for repeated queries against the same large document. Cached tokens are charged at ~75% discount. Upload the document once, create a cache, then query it multiple times at reduced cost. Part 7 covers caching in detail.
Try It Yourself: Create a ‘blog post generator’ that demonstrates parameter control: (1) Generate the same topic with temperature=0.2 (factual), 0.7 (balanced), and 1.2 (creative), (2) Use system_instruction to set a specific writing style, (3) Generate 3 candidates and programmatically select the best one based on length and keyword presence.

Next in the Gemini SDK Track

In Part 3: Multimodal — Vision, Video & Documents, we’ll explore image understanding vs. generation, video analysis with timestamps, PDF processing, and the Files API for managing media assets.