Back to Technology

AI Application Development Mastery Part 2: LLM Fundamentals for Developers

April 1, 2026 Wasil Zafar 44 min read

Master the developer-facing fundamentals of large language models. Understand tokenization, context windows, sampling parameters, API patterns, model trade-offs, and build your first LLM-powered application from scratch.

Table of Contents

  1. Tokens & Tokenization
  2. Context Windows
  3. Sampling Parameters
  4. API Patterns
  5. Model Comparison
  6. LLM Limitations
  7. Your First LLM Application
  8. Exercises & Self-Assessment
  9. LLM Config Generator
  10. Conclusion & Next Steps

Introduction: Thinking Like an LLM Developer

Series Overview: This is Part 2 of our 18-part AI Application Development Mastery series. In Part 1 we traced the evolution from ELIZA to modern AI agents. Now we dive into the practical fundamentals every developer needs to build LLM-powered applications.

AI Application Development Mastery

Your 20-step learning path • Currently on Step 2
1
Foundations & Evolution of AI Apps
Pre-LLM era, transformers, LLM revolution
2
LLM Fundamentals for Developers
Tokens, context windows, sampling, API patterns
You Are Here
3
Prompt Engineering Mastery
Zero/few-shot, CoT, ReAct, structured outputs
4
LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL
5
Retrieval-Augmented Generation (RAG)
Embeddings, vector DBs, retrievers, RAG pipelines
6
Memory & Context Engineering
Buffer/summary/vector memory, chunking, re-ranking
7
Agents — Core of Modern AI Apps
ReAct, tool-calling, planner-executor agents
8
LangGraph — Stateful Agent Workflows
Nodes, edges, state, graph execution, cycles
9
Deep Agents & Autonomous Systems
Multi-step reasoning, self-reflection, planning
10
Multi-Agent Systems
Supervisor, swarm, debate, role-based collaboration
11
AI Application Design Patterns
RAG, chat+memory, workflow automation, agent loops
12
Ecosystem & Frameworks
LlamaIndex, Haystack, HuggingFace, vLLM
13
MCP Foundations & Architecture
Protocol design, Host/Client/Server, primitives, security
14
MCP in Production
Building servers, integrations, scaling, agent systems
15
Evaluation & LLMOps
Prompt eval, tracing, LangSmith, experiment tracking
16
Production AI Systems
APIs, queues, caching, streaming, scaling
17
Safety, Guardrails & Reliability
Input filtering, hallucination mitigation, prompt injection
18
Advanced Topics
Fine-tuning, tool learning, hybrid LLM+symbolic
19
Building Real AI Applications
Chatbot, document QA, coding assistant, full-stack
20
Future of AI Applications
Autonomous agents, self-improving, multi-modal, AI OS

Most developers interact with LLMs through API calls — but without understanding what happens between your request and the response, you'll make costly mistakes. You'll waste tokens (and money) on bloated prompts, get inconsistent outputs because you don't understand sampling, or hit context window limits at the worst possible time.

This article gives you the mental model every AI application developer needs. By the end, you'll understand exactly how LLMs process your text, how to control their behavior, and how to choose the right model for your use case.

Key Insight: An LLM is a next-token prediction machine. Everything it does — answering questions, writing code, analyzing sentiment, translating languages — emerges from repeatedly predicting the most likely next token given all previous tokens. Understanding this single concept unlocks your intuition for prompt engineering, debugging, and architecture.

1. Tokens & Tokenization

LLMs do not read text the way humans do — they process tokens, subword units that break text into a vocabulary the model was trained on. Understanding tokenization is essential for AI application developers because tokens directly determine cost (you pay per token), context limits (models have a maximum token capacity), and output quality (how the model "sees" your input). This section covers what tokens are, how the Byte Pair Encoding algorithm builds them, and how to count tokens in practice.

1.1 What Are Tokens?

LLMs don't read text the way humans do. Before any processing, your text is broken into tokens — subword units that the model treats as its vocabulary. A token might be a word, part of a word, a single character, or even a space.

# Understanding tokenization with tiktoken (OpenAI's tokenizer)
# pip install tiktoken
import tiktoken

# GPT-4o uses the o200k_base tokenizer
enc = tiktoken.encoding_for_model("gpt-4o")

# Example 1: Simple English text
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Text: '{text}'")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Tokens: [9906, 11, 1917, 0]
# Decoded: ['Hello', ',', ' world', '!']

# Example 2: Code is tokenized differently
code = "def calculate_total(items):"
tokens = enc.encode(code)
print(f"\nCode: '{code}'")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Common programming keywords get their own tokens
# Underscores, parentheses are separate tokens

# Example 3: Non-English text uses more tokens
japanese = "AI applications"
tokens_en = enc.encode("AI applications")
tokens_jp = enc.encode(japanese)
print(f"\nEnglish 'AI applications': {len(tokens_en)} tokens")
# Non-Latin scripts often require more tokens per character

# Example 4: Numbers can be surprising
numbers = "123456789"
tokens = enc.encode(numbers)
print(f"\n'{numbers}': {len(tokens)} tokens")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Numbers are often split into 1-3 digit chunks
Cost Alert: You pay per token — both input and output. A single API call with a 2,000-token prompt and 500-token response costs you 2,500 tokens. At GPT-4o's pricing ($2.50/1M input, $10/1M output), that's $0.005 input + $0.005 output = $0.01 per call. At 10,000 users making 10 calls/day, that's $3,000/month. Token awareness is cost awareness.

1.2 Byte Pair Encoding (BPE)

Most modern LLMs use Byte Pair Encoding (BPE) for tokenization. BPE starts with individual characters and iteratively merges the most frequent pairs into new tokens:

# Simplified BPE algorithm demonstration
# This shows how tokenizers build their vocabulary

def simple_bpe(text, num_merges=10):
    """
    Simplified BPE: start with characters, merge frequent pairs.
    Real BPE operates on bytes and handles much larger vocabularies.
    """
    # Start: split text into individual characters
    tokens = list(text)
    print(f"Initial tokens: {tokens}")

    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for j in range(len(tokens) - 1):
            pair = (tokens[j], tokens[j + 1])
            pairs[pair] = pairs.get(pair, 0) + 1

        if not pairs:
            break

        # Find the most frequent pair
        best_pair = max(pairs, key=pairs.get)
        merged = best_pair[0] + best_pair[1]

        # Merge all occurrences
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens) - 1 and (tokens[j], tokens[j + 1]) == best_pair:
                new_tokens.append(merged)
                j += 2
            else:
                new_tokens.append(tokens[j])
                j += 1

        tokens = new_tokens
        print(f"Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}' -> '{merged}' | Tokens: {tokens}")

    return tokens

# Watch BPE build tokens from characters
result = simple_bpe("the cat sat on the mat", num_merges=5)
print(f"\nFinal tokens: {result}")

# Key insight: Common words and subwords get their own tokens
# Rare words are split into known subword pieces
# This is why "un" + "break" + "able" works but rare words get fragmented

1.3 Token Counting in Practice

Content Type Approx. Tokens per 1,000 Words Rule of Thumb
English prose ~1,300 tokens 1 token = ~0.75 words (or ~4 characters)
Python code ~1,500-2,000 tokens Code is token-expensive due to indentation, symbols
JSON data ~2,000-2,500 tokens Curly braces, quotes, and keys add up fast
Markdown ~1,400-1,600 tokens Formatting characters add modest overhead
Non-Latin scripts ~2,000-4,000 tokens CJK characters often need 2-3 tokens each

2. Context Windows

Every LLM has a context window — the maximum number of tokens it can process in a single request, including both the input and the generated output. Think of it as the model's working memory: everything it needs to know (system prompt, conversation history, retrieved documents, and the user's question) must fit within this window. Managing context budget is one of the most critical skills in AI application development, because exceeding the window silently truncates information, while underusing it wastes the model's potential.

2.1 How Context Windows Work

The context window is the total number of tokens an LLM can process in a single request — including both your input (system prompt + user message + any injected context) and the model's output. Think of it as the model's "working memory" — everything it needs to know for this conversation must fit within this window.

# Context window budget management
# This is a critical skill for production AI applications

def calculate_context_budget(
    model_context_window: int,
    system_prompt_tokens: int,
    max_output_tokens: int,
    conversation_history_tokens: int = 0,
    rag_context_tokens: int = 0
) -> dict:
    """
    Calculate how many tokens are available for your actual content.

    Context Window = System Prompt + History + RAG Context + User Query + Output
    """
    used = system_prompt_tokens + conversation_history_tokens + rag_context_tokens
    reserved_for_output = max_output_tokens
    available_for_query = model_context_window - used - reserved_for_output

    budget = {
        "total_window": model_context_window,
        "system_prompt": system_prompt_tokens,
        "conversation_history": conversation_history_tokens,
        "rag_context": rag_context_tokens,
        "reserved_for_output": reserved_for_output,
        "available_for_query": available_for_query,
        "utilization": f"{(used / model_context_window) * 100:.1f}%"
    }

    if available_for_query < 500:
        budget["warning"] = "CRITICAL: Less than 500 tokens available for user query!"

    return budget

# Example: GPT-4o with a RAG application
budget = calculate_context_budget(
    model_context_window=128000,    # GPT-4o: 128K tokens
    system_prompt_tokens=500,        # Your system instructions
    max_output_tokens=4096,          # Max response length
    conversation_history_tokens=2000, # Previous messages
    rag_context_tokens=8000          # Retrieved documents
)

for key, value in budget.items():
    print(f"  {key}: {value}")
# available_for_query: 113,404 tokens — plenty of room!

# Example: Smaller model with aggressive RAG
budget_small = calculate_context_budget(
    model_context_window=8192,      # Smaller model: 8K tokens
    system_prompt_tokens=500,
    max_output_tokens=2048,
    conversation_history_tokens=2000,
    rag_context_tokens=4000
)
# available_for_query: -356 tokens — OVERFLOW! Need to reduce context

2.2 Managing Context Budget

Pro Tip: Context window size is not just about fitting more text — it directly impacts cost, latency, and quality. Larger contexts cost more (you pay per token), take longer to process (attention is O(n^2)), and can actually reduce quality because the model may attend to irrelevant information (the "lost in the middle" problem). Use the minimum context necessary.
Strategy When to Use Implementation
Sliding window Chat apps — keep last N messages Drop oldest messages when approaching limit
Summary compression Long conversations — preserve key information Periodically summarize history into a compact form
RAG retrieval limiting Document Q&A — control injected context Retrieve top-K chunks, set max token budget for context
Prompt compression Expensive prompts — reduce system prompt size Use concise instructions, avoid repetition in prompts

3. Sampling Parameters

When an LLM generates text, it predicts a probability distribution over its entire vocabulary for each next token. Sampling parameters control how the model selects from this distribution — and they dramatically affect output quality, creativity, and consistency.

3.1 Temperature

Temperature controls the randomness of token selection. It scales the logits (raw scores) before applying softmax:

# Temperature: How it actually works
import numpy as np

def apply_temperature(logits, temperature):
    """
    Temperature scales the logits before softmax.
    - temperature=0: Always picks the highest-probability token (deterministic)
    - temperature=0.3: Conservative — strong preference for likely tokens
    - temperature=0.7: Balanced — some creativity, mostly coherent
    - temperature=1.0: Model's natural distribution
    - temperature=1.5+: Very creative/random — may lose coherence
    """
    if temperature == 0:
        # Greedy: always pick the most likely token
        probs = np.zeros_like(logits)
        probs[np.argmax(logits)] = 1.0
        return probs

    scaled = logits / temperature
    exp_scaled = np.exp(scaled - np.max(scaled))  # Numerical stability
    return exp_scaled / np.sum(exp_scaled)

# Example: Vocabulary = ["the", "a", "this", "my", "one"]
# Raw logits from the model
logits = np.array([5.0, 3.0, 2.5, 1.0, 0.5])
tokens = ["the", "a", "this", "my", "one"]

print("Token probabilities at different temperatures:")
print(f"{'Token':<8} {'t=0.1':<10} {'t=0.5':<10} {'t=1.0':<10} {'t=1.5':<10}")
print("-" * 48)

# Compute probabilities for each temperature and display side by side
temperatures = [0.1, 0.5, 1.0, 1.5]
all_probs = {t: apply_temperature(logits, t) for t in temperatures}

for i, token in enumerate(tokens):
    print(f"{token:<8}", end="")
    for t in temperatures:
        print(f"{all_probs[t][i]:<10.3f}", end="")
    print()

# At t=0.1: "the" has ~99.9% probability (nearly deterministic)
# At t=0.5: "the" has ~88%, "a" has ~10%
# At t=1.0: "the" has ~65%, "a" has ~20%, others share the rest
# At t=1.5: "the" has ~45% — much more random

3.2 Top-p (Nucleus Sampling) & Top-k

Top-p and top-k are alternative (or complementary) ways to control randomness by limiting which tokens are considered:

# Top-p (Nucleus Sampling) and Top-k explained
import numpy as np

def top_k_sampling(probs, k):
    """Keep only the top-k most likely tokens, zero out the rest."""
    top_k_indices = np.argsort(probs)[-k:]
    filtered = np.zeros_like(probs)
    filtered[top_k_indices] = probs[top_k_indices]
    return filtered / filtered.sum()  # Renormalize

def top_p_sampling(probs, p):
    """Keep the smallest set of tokens whose cumulative probability >= p."""
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumulative = np.cumsum(sorted_probs)

    # Find cutoff: keep tokens until cumulative probability reaches p
    cutoff_idx = np.searchsorted(cumulative, p) + 1

    filtered = np.zeros_like(probs)
    filtered[sorted_indices[:cutoff_idx]] = probs[sorted_indices[:cutoff_idx]]
    return filtered / filtered.sum()

# Example distribution
tokens = ["the", "a", "this", "my", "one", "that", "our", "some"]
probs = np.array([0.40, 0.20, 0.15, 0.10, 0.06, 0.04, 0.03, 0.02])

print("Original distribution:")
for t, p in zip(tokens, probs):
    print(f"  {t}: {p:.2f}")

print("\nTop-k=3 (only consider 3 most likely tokens):")
tk = top_k_sampling(probs, k=3)
for t, p in zip(tokens, tk):
    if p > 0:
        print(f"  {t}: {p:.2f}")

print("\nTop-p=0.8 (tokens covering 80% cumulative probability):")
tp = top_p_sampling(probs, p=0.8)
for t, p in zip(tokens, tp):
    if p > 0:
        print(f"  {t}: {p:.2f}")

# Top-p is generally preferred because it adapts:
# When the model is confident, fewer tokens pass the threshold
# When uncertain, more tokens are considered

3.3 Sampling Strategy Guide

Use Case Temperature Top-p Why
Code generation 0 - 0.2 0.1 - 0.3 Code must be syntactically correct — low randomness
Data extraction / JSON 0 1.0 Deterministic output needed for parsing
Question answering 0.1 - 0.3 0.3 - 0.5 Factual accuracy matters more than creativity
Conversational chat 0.5 - 0.8 0.7 - 0.9 Balance between coherence and natural variation
Creative writing 0.8 - 1.2 0.9 - 1.0 Want diverse, surprising outputs
Brainstorming 1.0 - 1.5 0.95 - 1.0 Maximum creativity, accept some incoherence

4. API Patterns

Interacting with LLMs in production happens through well-defined API patterns. The Chat Completions API is the standard interface: you send a list of messages (system, user, assistant) and receive a completion. Beyond basic request-response, two patterns are critical for production: streaming (sending tokens to the client as they are generated for real-time UX) and function calling (the LLM outputs structured JSON that your code executes, enabling tool use). This section covers all three patterns with working code.

4.1 Chat Completions API

The Chat Completions API is the standard interface for interacting with modern LLMs. It uses a message-based format with three roles:

# The Chat Completions API — the foundation of every LLM app
# pip install openai
from openai import OpenAI

# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()  # Uses OPENAI_API_KEY environment variable

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",        # Sets the AI's behavior and persona
            "content": """You are a senior Python developer. You:
            - Write clean, well-documented code
            - Follow PEP 8 conventions
            - Include type hints
            - Explain your reasoning before writing code"""
        },
        {
            "role": "user",          # The human's message
            "content": "Write a function that validates email addresses."
        },
        {
            "role": "assistant",     # Previous AI response (for multi-turn)
            "content": "I'll write an email validator using regex..."
        },
        {
            "role": "user",
            "content": "Can you also add validation for common disposable email domains?"
        }
    ],
    temperature=0.2,          # Low temp for code generation
    max_tokens=1000,          # Maximum response length
    top_p=0.3,                # Conservative sampling for accuracy
    frequency_penalty=0.0,    # Don't penalize repeated tokens
    presence_penalty=0.0,     # Don't force topic changes
)

# Access the response
print(response.choices[0].message.content)
print(f"\nUsage: {response.usage.prompt_tokens} prompt + "
      f"{response.usage.completion_tokens} completion = "
      f"{response.usage.total_tokens} total tokens")

4.2 Streaming Responses

Streaming sends tokens to the client as they're generated, dramatically improving perceived latency. Without streaming, users stare at a blank screen for 2-10 seconds. With streaming, they see text appear immediately, word by word.

# Streaming — essential for any user-facing LLM application
# pip install openai
from openai import OpenAI

# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()

# Non-streaming: wait for entire response (poor UX)
# response = client.chat.completions.create(model="gpt-4o", messages=[...])

# Streaming: get tokens as they're generated (great UX)
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    stream=True  # Enable streaming
)

# Process tokens as they arrive
full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        token = chunk.choices[0].delta.content
        print(token, end="", flush=True)  # Print immediately
        full_response += token

print("\n\n--- Streaming complete ---")
print(f"Total response length: {len(full_response)} characters")

# In a web app (FastAPI example):
# from fastapi import FastAPI
# from fastapi.responses import StreamingResponse
#
# @app.post("/chat")
# async def chat(request: ChatRequest):
#     async def generate():
#         stream = client.chat.completions.create(
#             model="gpt-4o", messages=request.messages, stream=True
#         )
#         for chunk in stream:
#             if chunk.choices[0].delta.content:
#                 yield f"data: {chunk.choices[0].delta.content}\n\n"
#         yield "data: [DONE]\n\n"
#     return StreamingResponse(generate(), media_type="text/event-stream")

4.3 Function Calling (Tool Use)

Function calling allows the LLM to request that your application execute specific functions with structured arguments. This is the foundation of agent-based architectures — the model decides which tool to use and what arguments to pass.

# Function Calling — the bridge between LLMs and real-world actions
# pip install openai
import json
from openai import OpenAI

# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()

# Step 1: Define your tools (functions the LLM can call)
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g., 'San Francisco, CA'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the product database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "category": {"type": "string", "description": "Product category"},
                    "max_results": {"type": "integer", "description": "Max results to return"}
                },
                "required": ["query"]
            }
        }
    }
]

# Simulated tool implementations (replace with real APIs in production)
def get_weather(location, unit="celsius"):
    """Simulate a weather API call."""
    return json.dumps({"location": location, "temperature": 22, "unit": unit, "condition": "Rainy"})

def search_database(query, category=None, max_results=5):
    """Simulate a product database search."""
    return json.dumps({"results": [{"name": f"Umbrella - {query}", "price": 25.99}]})

# Map function names to implementations
available_functions = {
    "get_weather": get_weather,
    "search_database": search_database,
}

# Step 2: Send message with tools
messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the weather in Tokyo and find me some umbrellas?"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto"  # Let the model decide which tools to use
)

# Step 3: Process tool calls and feed results back to the model
message = response.choices[0].message
if message.tool_calls:
    # Append the assistant message with tool calls
    messages.append(message)

    for tool_call in message.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)
        print(f"Model wants to call: {function_name}")
        print(f"  Arguments: {arguments}")

        # Execute the function
        func = available_functions[function_name]
        result = func(**arguments)

        # Append tool result so the model can generate a final response
        messages.append({
            "tool_call_id": tool_call.id,
            "role": "tool",
            "content": result,
        })

    # Step 4: Get the final response with tool results
    final_response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    )
    print("\nFinal response:", final_response.choices[0].message.content)
Key Insight: Function calling is not the LLM executing code — it's the LLM generating structured JSON that your application executes. The model decides "I should call get_weather with location='Tokyo'" and returns that as structured data. Your code then actually calls the weather API and feeds the result back. This separation of "reasoning" (LLM) and "execution" (your code) is fundamental to safe agent design.

5. Model Comparison

The LLM landscape is evolving rapidly, with new models from OpenAI, Anthropic, Google, Meta, and Mistral releasing every few months. Choosing the right model for your application requires balancing capability (reasoning quality, code generation, multilingual support), cost (per-token pricing can vary 100x between models), latency (response time for real-time apps), and context window (how much data the model can process at once). This section provides a comprehensive comparison and a decision framework to guide model selection.

5.1 Comprehensive Model Comparison

Model Provider Context Window Strengths Best For Approx. Cost (input/output per 1M tokens)
GPT-4o OpenAI 128K Multimodal, fast, strong reasoning General-purpose, code, analysis $2.50 / $10.00
GPT-4o mini OpenAI 128K Very fast, cheap, good quality High-volume, cost-sensitive apps $0.15 / $0.60
Claude 3.5 Sonnet Anthropic 200K Long context, strong reasoning, safety Long documents, code, analysis $3.00 / $15.00
Claude 3 Haiku Anthropic 200K Very fast, affordable Simple tasks, high volume $0.25 / $1.25
Gemini 1.5 Pro Google 1M-2M Massive context, native multimodal Huge documents, video/audio analysis $1.25 / $5.00
Llama 3.1 405B Meta (open) 128K Open weights, self-hostable On-prem, privacy-sensitive, fine-tuning Self-hosted cost varies
Llama 3.1 70B Meta (open) 128K Strong open model, reasonable to host Cost-effective self-hosting Self-hosted / ~$0.50-1.00 via providers
Mistral Large Mistral AI 128K Efficient, multilingual, EU-based European compliance, multilingual apps $2.00 / $6.00

5.2 Choosing the Right Model

Decision Framework: Start with the cheapest model that meets your quality bar. For most applications: prototype with GPT-4o (highest quality), then optimize by switching to GPT-4o mini or Claude Haiku for simpler subtasks. Use Llama/Mistral when you need data privacy or want to fine-tune. Use Gemini when you need massive context windows.
# Model routing: Use the right model for each task
# This pattern saves 60-80% on API costs in production

# pip install openai anthropic
# Set your API keys:
#   export OPENAI_API_KEY="sk-..."
#   export ANTHROPIC_API_KEY="sk-ant-..."
from openai import OpenAI
from anthropic import Anthropic

openai_client = OpenAI()       # Uses OPENAI_API_KEY env var
anthropic_client = Anthropic()  # Uses ANTHROPIC_API_KEY env var

def route_to_model(task_type: str, content: str) -> str:
    """Route different tasks to the most cost-effective model."""

    if task_type == "classification":
        # Simple classification: cheapest model
        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": content}],
            temperature=0,
            max_tokens=50
        )
        return response.choices[0].message.content

    elif task_type == "complex_reasoning":
        # Complex analysis: strongest model
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": content}],
            temperature=0.3,
            max_tokens=2000
        )
        return response.choices[0].message.content

    elif task_type == "long_document":
        # Long document analysis: largest context window
        message = anthropic_client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=4096,
            messages=[{"role": "user", "content": content}]
        )
        return message.content[0].text

    else:
        # Default: balanced model
        response = openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": content}],
            temperature=0.5
        )
        return response.choices[0].message.content

# Usage:
# result = route_to_model("classification", "Is this email spam? ...")
# result = route_to_model("complex_reasoning", "Analyze this contract...")

6. LLM Limitations

LLMs are powerful, but they have fundamental limitations that every application developer must understand and mitigate. Hallucinations — confident, plausible-sounding statements that are factually wrong — are the most critical risk, especially in applications that users trust for accurate information. Cost and latency constraints also shape architectural decisions: a high-quality model like GPT-4o can cost 15–20x more per token than a smaller model like GPT-4o mini. This section covers both failure modes and the practical strategies to address them.

6.1 Hallucinations

Hallucination is when an LLM generates plausible-sounding but factually incorrect information. This is not a bug — it's an inherent property of how LLMs work. Because LLMs predict the most likely next token based on patterns, they can generate text that "sounds right" but is wrong.

Hallucination Type Example Mitigation Strategy
Factual "The Eiffel Tower is 500 meters tall" (actual: 330m) RAG with verified sources, fact-checking step
Citation Inventing fake academic papers with plausible authors Never trust LLM-generated citations without verification
Code Using a function that doesn't exist in the library Automated testing, code execution verification
Confident nonsense Explaining a concept with complete confidence but wrong details Temperature=0, explicit "say I don't know" instructions

6.2 Cost & Latency

Production Reality: The #1 reason AI applications fail in production is not quality — it's cost and latency. A prototype that works beautifully at 10 queries/day can become prohibitively expensive at 10,000 queries/day. Always calculate your per-query cost and multiply by expected volume before choosing a model.
# Cost and latency estimation for production planning
def estimate_monthly_cost(
    queries_per_day: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_price_per_million: float,
    output_price_per_million: float
) -> dict:
    """Estimate monthly API costs for an LLM application."""
    monthly_queries = queries_per_day * 30
    total_input_tokens = monthly_queries * avg_input_tokens
    total_output_tokens = monthly_queries * avg_output_tokens

    input_cost = (total_input_tokens / 1_000_000) * input_price_per_million
    output_cost = (total_output_tokens / 1_000_000) * output_price_per_million

    return {
        "monthly_queries": monthly_queries,
        "total_tokens": total_input_tokens + total_output_tokens,
        "input_cost": f"${input_cost:.2f}",
        "output_cost": f"${output_cost:.2f}",
        "total_cost": f"${input_cost + output_cost:.2f}",
        "cost_per_query": f"${(input_cost + output_cost) / monthly_queries:.4f}"
    }

# Compare models for the same workload
print("=== GPT-4o ===")
gpt4o = estimate_monthly_cost(
    queries_per_day=5000, avg_input_tokens=2000, avg_output_tokens=500,
    input_price_per_million=2.50, output_price_per_million=10.00
)
for k, v in gpt4o.items():
    print(f"  {k}: {v}")

print("\n=== GPT-4o mini ===")
gpt4o_mini = estimate_monthly_cost(
    queries_per_day=5000, avg_input_tokens=2000, avg_output_tokens=500,
    input_price_per_million=0.15, output_price_per_million=0.60
)
for k, v in gpt4o_mini.items():
    print(f"  {k}: {v}")

# GPT-4o: ~$1,500/month vs GPT-4o mini: ~$90/month
# That's a 16x cost difference for often-comparable quality!

7. Your First LLM Application

Let's build a complete, production-quality LLM application — a code review assistant that analyzes code, identifies issues, and suggests improvements.

# Complete LLM Application: Code Review Assistant
# This combines everything we've learned in this article

# pip install openai
import json
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional

# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()

@dataclass
class ReviewResult:
    """Structured output from the code review."""
    language: str
    issues: list
    suggestions: list
    quality_score: int  # 1-10
    summary: str

def review_code(
    code: str,
    language: str = "python",
    focus_areas: Optional[list] = None
) -> ReviewResult:
    """
    Review code using GPT-4o with structured output.

    Args:
        code: The source code to review
        language: Programming language
        focus_areas: Optional areas to focus on (security, performance, etc.)
    """
    focus_instruction = ""
    if focus_areas:
        focus_instruction = f"\nFocus especially on: {', '.join(focus_areas)}"

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"""You are a senior {language} code reviewer.
Analyze code for bugs, security issues, performance problems, and style.
{focus_instruction}

Return your review as JSON with this exact structure:
{{
    "language": "{language}",
    "issues": [
        {{"severity": "high|medium|low", "line": "line number or range",
          "description": "what's wrong", "fix": "how to fix it"}}
    ],
    "suggestions": ["improvement suggestion 1", "suggestion 2"],
    "quality_score": 7,
    "summary": "one paragraph overall assessment"
}}"""
                },
                {
                    "role": "user",
                    "content": f"Review this {language} code:\n\n```{language}\n{code}\n```"
                }
            ],
            temperature=0,       # Deterministic for consistent reviews
            max_tokens=2000,
            response_format={"type": "json_object"}  # Force JSON output
        )

        # Parse the structured response
        result = json.loads(response.choices[0].message.content)

        return ReviewResult(
            language=result["language"],
            issues=result["issues"],
            suggestions=result["suggestions"],
            quality_score=result["quality_score"],
            summary=result["summary"]
        )
    except json.JSONDecodeError as e:
        print(f"Failed to parse JSON response: {e}")
        raise
    except Exception as e:
        print(f"API call failed: {e}")
        raise

# Usage example
sample_code = """
def get_user(id):
    query = f"SELECT * FROM users WHERE id = {id}"
    result = db.execute(query)
    password = result['password']
    return {"id": id, "name": result['name'], "password": password}
"""

review = review_code(
    code=sample_code,
    language="python",
    focus_areas=["security", "best practices"]
)

print(f"Quality Score: {review.quality_score}/10")
print(f"Summary: {review.summary}")
print(f"\nIssues Found: {len(review.issues)}")
for issue in review.issues:
    print(f"  [{issue['severity'].upper()}] Line {issue['line']}: {issue['description']}")
    print(f"    Fix: {issue['fix']}")
print(f"\nSuggestions:")
for s in review.suggestions:
    print(f"  - {s}")
Architecture Note

What Makes This a Real Application

This code review assistant demonstrates several production patterns:

  • Structured output: Uses response_format={"type": "json_object"} to guarantee parseable JSON
  • Temperature=0: Ensures consistent, reproducible reviews
  • Typed data classes: Converts raw JSON into typed Python objects
  • Parameterized prompts: Language and focus areas are configurable
  • System/user separation: Instructions in system message, data in user message
Structured Output JSON Mode Production Pattern

Exercises & Self-Assessment

Exercise 1

Token Economics Calculator

Build a token cost calculator:

  1. Install tiktoken: pip install tiktoken
  2. Write a function that takes a prompt string and returns: token count, estimated cost for GPT-4o, GPT-4o mini, and Claude 3.5 Sonnet
  3. Test with prompts of different sizes: 100 words, 1,000 words, 10,000 words
  4. Calculate: How many queries can you make for $100/month with each model?
Exercise 2

Temperature Experiment

Systematically explore how temperature affects output:

  1. Choose a creative prompt: "Write a one-paragraph story about a robot who learns to cook"
  2. Generate 5 responses at each temperature: 0, 0.3, 0.7, 1.0, 1.5
  3. For each temperature level, measure: diversity (how different are the 5 responses?), coherence (do they make sense?), creativity (surprise factor)
  4. Repeat with a factual prompt: "Explain how photosynthesis works"
  5. Write a recommendation: what temperature would you use for each task type?
Exercise 3

Build a Streaming Chat Application

Create a terminal-based chat application with streaming:

  1. Implement a conversation loop that maintains message history
  2. Stream responses token-by-token to the terminal
  3. Track and display token usage after each response
  4. Implement a context window manager that summarizes old messages when history exceeds 4,000 tokens
  5. Add a /model command that lets users switch between GPT-4o and GPT-4o mini mid-conversation
Exercise 4

Reflective Questions

  1. Why is tokenization language-dependent? What implications does this have for building multilingual AI applications?
  2. Explain the trade-off between context window size and cost/latency. When would you deliberately use a smaller context window?
  3. A user reports that your AI chatbot gives different answers to the same question. What parameter would you adjust and why?
  4. Compare function calling and RAG as mechanisms for giving LLMs access to external information. When would you use each?
  5. Your application uses GPT-4o at $3,000/month. Your CEO wants to cut costs by 80%. What's your strategy?

LLM Configuration Document Generator

Generate a professional LLM configuration document for your application. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

You now have a solid developer-facing understanding of how LLMs work and how to use them effectively. Here are the key takeaways from Part 2:

  • Tokens are the fundamental unit — everything is tokenized before processing, and you pay per token. Always be token-aware
  • Context windows are your budget — plan your token allocation across system prompt, history, RAG context, user query, and output
  • Sampling parameters (temperature, top-p, top-k) control creativity vs. consistency. Use low values for factual tasks, higher for creative ones
  • Streaming is essential for user-facing applications — never make users wait for the full response
  • Function calling bridges LLMs and the real world — the model decides what to do, your code executes it
  • Model selection is a cost-quality trade-off. Start with GPT-4o for prototyping, then optimize with cheaper models for production
  • Hallucinations are inherent — plan for them with RAG, verification steps, and guardrails

Next in the Series

In Part 3: Prompt Engineering Mastery, we'll master the art and science of prompting — zero-shot, few-shot, chain-of-thought, Tree-of-Thoughts, ReAct, structured output enforcement with JSON mode and Pydantic, LangChain prompt templates, optimization techniques, and defending against prompt injection.

Technology