Introduction: Thinking Like an LLM Developer
Series Overview: This is Part 2 of our 18-part AI Application Development Mastery series. In Part 1 we traced the evolution from ELIZA to modern AI agents. Now we dive into the practical fundamentals every developer needs to build LLM-powered applications.
1
Foundations & Evolution of AI Apps
Pre-LLM era, transformers, LLM revolution
2
LLM Fundamentals for Developers
Tokens, context windows, sampling, API patterns
You Are Here
3
Prompt Engineering Mastery
Zero/few-shot, CoT, ReAct, structured outputs
4
LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL
5
Retrieval-Augmented Generation (RAG)
Embeddings, vector DBs, retrievers, RAG pipelines
6
Memory & Context Engineering
Buffer/summary/vector memory, chunking, re-ranking
7
Agents — Core of Modern AI Apps
ReAct, tool-calling, planner-executor agents
8
LangGraph — Stateful Agent Workflows
Nodes, edges, state, graph execution, cycles
9
Deep Agents & Autonomous Systems
Multi-step reasoning, self-reflection, planning
10
Multi-Agent Systems
Supervisor, swarm, debate, role-based collaboration
11
AI Application Design Patterns
RAG, chat+memory, workflow automation, agent loops
12
Ecosystem & Frameworks
LlamaIndex, Haystack, HuggingFace, vLLM
13
MCP Foundations & Architecture
Protocol design, Host/Client/Server, primitives, security
14
MCP in Production
Building servers, integrations, scaling, agent systems
15
Evaluation & LLMOps
Prompt eval, tracing, LangSmith, experiment tracking
16
Production AI Systems
APIs, queues, caching, streaming, scaling
17
Safety, Guardrails & Reliability
Input filtering, hallucination mitigation, prompt injection
18
Advanced Topics
Fine-tuning, tool learning, hybrid LLM+symbolic
19
Building Real AI Applications
Chatbot, document QA, coding assistant, full-stack
20
Future of AI Applications
Autonomous agents, self-improving, multi-modal, AI OS
Most developers interact with LLMs through API calls — but without understanding what happens between your request and the response, you'll make costly mistakes. You'll waste tokens (and money) on bloated prompts, get inconsistent outputs because you don't understand sampling, or hit context window limits at the worst possible time.
This article gives you the mental model every AI application developer needs. By the end, you'll understand exactly how LLMs process your text, how to control their behavior, and how to choose the right model for your use case.
Key Insight: An LLM is a next-token prediction machine. Everything it does — answering questions, writing code, analyzing sentiment, translating languages — emerges from repeatedly predicting the most likely next token given all previous tokens. Understanding this single concept unlocks your intuition for prompt engineering, debugging, and architecture.
1. Tokens & Tokenization
LLMs do not read text the way humans do — they process tokens, subword units that break text into a vocabulary the model was trained on. Understanding tokenization is essential for AI application developers because tokens directly determine cost (you pay per token), context limits (models have a maximum token capacity), and output quality (how the model "sees" your input). This section covers what tokens are, how the Byte Pair Encoding algorithm builds them, and how to count tokens in practice.
1.1 What Are Tokens?
LLMs don't read text the way humans do. Before any processing, your text is broken into tokens — subword units that the model treats as its vocabulary. A token might be a word, part of a word, a single character, or even a space.
# Understanding tokenization with tiktoken (OpenAI's tokenizer)
# pip install tiktoken
import tiktoken
# GPT-4o uses the o200k_base tokenizer
enc = tiktoken.encoding_for_model("gpt-4o")
# Example 1: Simple English text
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Text: '{text}'")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Tokens: [9906, 11, 1917, 0]
# Decoded: ['Hello', ',', ' world', '!']
# Example 2: Code is tokenized differently
code = "def calculate_total(items):"
tokens = enc.encode(code)
print(f"\nCode: '{code}'")
print(f"Token count: {len(tokens)}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Common programming keywords get their own tokens
# Underscores, parentheses are separate tokens
# Example 3: Non-English text uses more tokens
japanese = "AI applications"
tokens_en = enc.encode("AI applications")
tokens_jp = enc.encode(japanese)
print(f"\nEnglish 'AI applications': {len(tokens_en)} tokens")
# Non-Latin scripts often require more tokens per character
# Example 4: Numbers can be surprising
numbers = "123456789"
tokens = enc.encode(numbers)
print(f"\n'{numbers}': {len(tokens)} tokens")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# Numbers are often split into 1-3 digit chunks
Cost Alert: You pay per token — both input and output. A single API call with a 2,000-token prompt and 500-token response costs you 2,500 tokens. At GPT-4o's pricing ($2.50/1M input, $10/1M output), that's $0.005 input + $0.005 output = $0.01 per call. At 10,000 users making 10 calls/day, that's $3,000/month. Token awareness is cost awareness.
1.2 Byte Pair Encoding (BPE)
Most modern LLMs use Byte Pair Encoding (BPE) for tokenization. BPE starts with individual characters and iteratively merges the most frequent pairs into new tokens:
# Simplified BPE algorithm demonstration
# This shows how tokenizers build their vocabulary
def simple_bpe(text, num_merges=10):
"""
Simplified BPE: start with characters, merge frequent pairs.
Real BPE operates on bytes and handles much larger vocabularies.
"""
# Start: split text into individual characters
tokens = list(text)
print(f"Initial tokens: {tokens}")
for i in range(num_merges):
# Count all adjacent pairs
pairs = {}
for j in range(len(tokens) - 1):
pair = (tokens[j], tokens[j + 1])
pairs[pair] = pairs.get(pair, 0) + 1
if not pairs:
break
# Find the most frequent pair
best_pair = max(pairs, key=pairs.get)
merged = best_pair[0] + best_pair[1]
# Merge all occurrences
new_tokens = []
j = 0
while j < len(tokens):
if j < len(tokens) - 1 and (tokens[j], tokens[j + 1]) == best_pair:
new_tokens.append(merged)
j += 2
else:
new_tokens.append(tokens[j])
j += 1
tokens = new_tokens
print(f"Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}' -> '{merged}' | Tokens: {tokens}")
return tokens
# Watch BPE build tokens from characters
result = simple_bpe("the cat sat on the mat", num_merges=5)
print(f"\nFinal tokens: {result}")
# Key insight: Common words and subwords get their own tokens
# Rare words are split into known subword pieces
# This is why "un" + "break" + "able" works but rare words get fragmented
1.3 Token Counting in Practice
| Content Type |
Approx. Tokens per 1,000 Words |
Rule of Thumb |
| English prose |
~1,300 tokens |
1 token = ~0.75 words (or ~4 characters) |
| Python code |
~1,500-2,000 tokens |
Code is token-expensive due to indentation, symbols |
| JSON data |
~2,000-2,500 tokens |
Curly braces, quotes, and keys add up fast |
| Markdown |
~1,400-1,600 tokens |
Formatting characters add modest overhead |
| Non-Latin scripts |
~2,000-4,000 tokens |
CJK characters often need 2-3 tokens each |
2. Context Windows
Every LLM has a context window — the maximum number of tokens it can process in a single request, including both the input and the generated output. Think of it as the model's working memory: everything it needs to know (system prompt, conversation history, retrieved documents, and the user's question) must fit within this window. Managing context budget is one of the most critical skills in AI application development, because exceeding the window silently truncates information, while underusing it wastes the model's potential.
2.1 How Context Windows Work
The context window is the total number of tokens an LLM can process in a single request — including both your input (system prompt + user message + any injected context) and the model's output. Think of it as the model's "working memory" — everything it needs to know for this conversation must fit within this window.
# Context window budget management
# This is a critical skill for production AI applications
def calculate_context_budget(
model_context_window: int,
system_prompt_tokens: int,
max_output_tokens: int,
conversation_history_tokens: int = 0,
rag_context_tokens: int = 0
) -> dict:
"""
Calculate how many tokens are available for your actual content.
Context Window = System Prompt + History + RAG Context + User Query + Output
"""
used = system_prompt_tokens + conversation_history_tokens + rag_context_tokens
reserved_for_output = max_output_tokens
available_for_query = model_context_window - used - reserved_for_output
budget = {
"total_window": model_context_window,
"system_prompt": system_prompt_tokens,
"conversation_history": conversation_history_tokens,
"rag_context": rag_context_tokens,
"reserved_for_output": reserved_for_output,
"available_for_query": available_for_query,
"utilization": f"{(used / model_context_window) * 100:.1f}%"
}
if available_for_query < 500:
budget["warning"] = "CRITICAL: Less than 500 tokens available for user query!"
return budget
# Example: GPT-4o with a RAG application
budget = calculate_context_budget(
model_context_window=128000, # GPT-4o: 128K tokens
system_prompt_tokens=500, # Your system instructions
max_output_tokens=4096, # Max response length
conversation_history_tokens=2000, # Previous messages
rag_context_tokens=8000 # Retrieved documents
)
for key, value in budget.items():
print(f" {key}: {value}")
# available_for_query: 113,404 tokens — plenty of room!
# Example: Smaller model with aggressive RAG
budget_small = calculate_context_budget(
model_context_window=8192, # Smaller model: 8K tokens
system_prompt_tokens=500,
max_output_tokens=2048,
conversation_history_tokens=2000,
rag_context_tokens=4000
)
# available_for_query: -356 tokens — OVERFLOW! Need to reduce context
2.2 Managing Context Budget
Pro Tip: Context window size is not just about fitting more text — it directly impacts cost, latency, and quality. Larger contexts cost more (you pay per token), take longer to process (attention is O(n^2)), and can actually reduce quality because the model may attend to irrelevant information (the "lost in the middle" problem). Use the minimum context necessary.
| Strategy |
When to Use |
Implementation |
| Sliding window |
Chat apps — keep last N messages |
Drop oldest messages when approaching limit |
| Summary compression |
Long conversations — preserve key information |
Periodically summarize history into a compact form |
| RAG retrieval limiting |
Document Q&A — control injected context |
Retrieve top-K chunks, set max token budget for context |
| Prompt compression |
Expensive prompts — reduce system prompt size |
Use concise instructions, avoid repetition in prompts |
3. Sampling Parameters
When an LLM generates text, it predicts a probability distribution over its entire vocabulary for each next token. Sampling parameters control how the model selects from this distribution — and they dramatically affect output quality, creativity, and consistency.
3.1 Temperature
Temperature controls the randomness of token selection. It scales the logits (raw scores) before applying softmax:
# Temperature: How it actually works
import numpy as np
def apply_temperature(logits, temperature):
"""
Temperature scales the logits before softmax.
- temperature=0: Always picks the highest-probability token (deterministic)
- temperature=0.3: Conservative — strong preference for likely tokens
- temperature=0.7: Balanced — some creativity, mostly coherent
- temperature=1.0: Model's natural distribution
- temperature=1.5+: Very creative/random — may lose coherence
"""
if temperature == 0:
# Greedy: always pick the most likely token
probs = np.zeros_like(logits)
probs[np.argmax(logits)] = 1.0
return probs
scaled = logits / temperature
exp_scaled = np.exp(scaled - np.max(scaled)) # Numerical stability
return exp_scaled / np.sum(exp_scaled)
# Example: Vocabulary = ["the", "a", "this", "my", "one"]
# Raw logits from the model
logits = np.array([5.0, 3.0, 2.5, 1.0, 0.5])
tokens = ["the", "a", "this", "my", "one"]
print("Token probabilities at different temperatures:")
print(f"{'Token':<8} {'t=0.1':<10} {'t=0.5':<10} {'t=1.0':<10} {'t=1.5':<10}")
print("-" * 48)
# Compute probabilities for each temperature and display side by side
temperatures = [0.1, 0.5, 1.0, 1.5]
all_probs = {t: apply_temperature(logits, t) for t in temperatures}
for i, token in enumerate(tokens):
print(f"{token:<8}", end="")
for t in temperatures:
print(f"{all_probs[t][i]:<10.3f}", end="")
print()
# At t=0.1: "the" has ~99.9% probability (nearly deterministic)
# At t=0.5: "the" has ~88%, "a" has ~10%
# At t=1.0: "the" has ~65%, "a" has ~20%, others share the rest
# At t=1.5: "the" has ~45% — much more random
3.2 Top-p (Nucleus Sampling) & Top-k
Top-p and top-k are alternative (or complementary) ways to control randomness by limiting which tokens are considered:
# Top-p (Nucleus Sampling) and Top-k explained
import numpy as np
def top_k_sampling(probs, k):
"""Keep only the top-k most likely tokens, zero out the rest."""
top_k_indices = np.argsort(probs)[-k:]
filtered = np.zeros_like(probs)
filtered[top_k_indices] = probs[top_k_indices]
return filtered / filtered.sum() # Renormalize
def top_p_sampling(probs, p):
"""Keep the smallest set of tokens whose cumulative probability >= p."""
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
cumulative = np.cumsum(sorted_probs)
# Find cutoff: keep tokens until cumulative probability reaches p
cutoff_idx = np.searchsorted(cumulative, p) + 1
filtered = np.zeros_like(probs)
filtered[sorted_indices[:cutoff_idx]] = probs[sorted_indices[:cutoff_idx]]
return filtered / filtered.sum()
# Example distribution
tokens = ["the", "a", "this", "my", "one", "that", "our", "some"]
probs = np.array([0.40, 0.20, 0.15, 0.10, 0.06, 0.04, 0.03, 0.02])
print("Original distribution:")
for t, p in zip(tokens, probs):
print(f" {t}: {p:.2f}")
print("\nTop-k=3 (only consider 3 most likely tokens):")
tk = top_k_sampling(probs, k=3)
for t, p in zip(tokens, tk):
if p > 0:
print(f" {t}: {p:.2f}")
print("\nTop-p=0.8 (tokens covering 80% cumulative probability):")
tp = top_p_sampling(probs, p=0.8)
for t, p in zip(tokens, tp):
if p > 0:
print(f" {t}: {p:.2f}")
# Top-p is generally preferred because it adapts:
# When the model is confident, fewer tokens pass the threshold
# When uncertain, more tokens are considered
3.3 Sampling Strategy Guide
| Use Case |
Temperature |
Top-p |
Why |
| Code generation |
0 - 0.2 |
0.1 - 0.3 |
Code must be syntactically correct — low randomness |
| Data extraction / JSON |
0 |
1.0 |
Deterministic output needed for parsing |
| Question answering |
0.1 - 0.3 |
0.3 - 0.5 |
Factual accuracy matters more than creativity |
| Conversational chat |
0.5 - 0.8 |
0.7 - 0.9 |
Balance between coherence and natural variation |
| Creative writing |
0.8 - 1.2 |
0.9 - 1.0 |
Want diverse, surprising outputs |
| Brainstorming |
1.0 - 1.5 |
0.95 - 1.0 |
Maximum creativity, accept some incoherence |
4. API Patterns
Interacting with LLMs in production happens through well-defined API patterns. The Chat Completions API is the standard interface: you send a list of messages (system, user, assistant) and receive a completion. Beyond basic request-response, two patterns are critical for production: streaming (sending tokens to the client as they are generated for real-time UX) and function calling (the LLM outputs structured JSON that your code executes, enabling tool use). This section covers all three patterns with working code.
4.1 Chat Completions API
The Chat Completions API is the standard interface for interacting with modern LLMs. It uses a message-based format with three roles:
# The Chat Completions API — the foundation of every LLM app
# pip install openai
from openai import OpenAI
# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI() # Uses OPENAI_API_KEY environment variable
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system", # Sets the AI's behavior and persona
"content": """You are a senior Python developer. You:
- Write clean, well-documented code
- Follow PEP 8 conventions
- Include type hints
- Explain your reasoning before writing code"""
},
{
"role": "user", # The human's message
"content": "Write a function that validates email addresses."
},
{
"role": "assistant", # Previous AI response (for multi-turn)
"content": "I'll write an email validator using regex..."
},
{
"role": "user",
"content": "Can you also add validation for common disposable email domains?"
}
],
temperature=0.2, # Low temp for code generation
max_tokens=1000, # Maximum response length
top_p=0.3, # Conservative sampling for accuracy
frequency_penalty=0.0, # Don't penalize repeated tokens
presence_penalty=0.0, # Don't force topic changes
)
# Access the response
print(response.choices[0].message.content)
print(f"\nUsage: {response.usage.prompt_tokens} prompt + "
f"{response.usage.completion_tokens} completion = "
f"{response.usage.total_tokens} total tokens")
4.2 Streaming Responses
Streaming sends tokens to the client as they're generated, dramatically improving perceived latency. Without streaming, users stare at a blank screen for 2-10 seconds. With streaming, they see text appear immediately, word by word.
# Streaming — essential for any user-facing LLM application
# pip install openai
from openai import OpenAI
# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()
# Non-streaming: wait for entire response (poor UX)
# response = client.chat.completions.create(model="gpt-4o", messages=[...])
# Streaming: get tokens as they're generated (great UX)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
stream=True # Enable streaming
)
# Process tokens as they arrive
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content is not None:
token = chunk.choices[0].delta.content
print(token, end="", flush=True) # Print immediately
full_response += token
print("\n\n--- Streaming complete ---")
print(f"Total response length: {len(full_response)} characters")
# In a web app (FastAPI example):
# from fastapi import FastAPI
# from fastapi.responses import StreamingResponse
#
# @app.post("/chat")
# async def chat(request: ChatRequest):
# async def generate():
# stream = client.chat.completions.create(
# model="gpt-4o", messages=request.messages, stream=True
# )
# for chunk in stream:
# if chunk.choices[0].delta.content:
# yield f"data: {chunk.choices[0].delta.content}\n\n"
# yield "data: [DONE]\n\n"
# return StreamingResponse(generate(), media_type="text/event-stream")
4.3 Function Calling (Tool Use)
Function calling allows the LLM to request that your application execute specific functions with structured arguments. This is the foundation of agent-based architectures — the model decides which tool to use and what arguments to pass.
# Function Calling — the bridge between LLMs and real-world actions
# pip install openai
import json
from openai import OpenAI
# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()
# Step 1: Define your tools (functions the LLM can call)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g., 'San Francisco, CA'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the product database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"category": {"type": "string", "description": "Product category"},
"max_results": {"type": "integer", "description": "Max results to return"}
},
"required": ["query"]
}
}
}
]
# Simulated tool implementations (replace with real APIs in production)
def get_weather(location, unit="celsius"):
"""Simulate a weather API call."""
return json.dumps({"location": location, "temperature": 22, "unit": unit, "condition": "Rainy"})
def search_database(query, category=None, max_results=5):
"""Simulate a product database search."""
return json.dumps({"results": [{"name": f"Umbrella - {query}", "price": 25.99}]})
# Map function names to implementations
available_functions = {
"get_weather": get_weather,
"search_database": search_database,
}
# Step 2: Send message with tools
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the weather in Tokyo and find me some umbrellas?"}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto" # Let the model decide which tools to use
)
# Step 3: Process tool calls and feed results back to the model
message = response.choices[0].message
if message.tool_calls:
# Append the assistant message with tool calls
messages.append(message)
for tool_call in message.tool_calls:
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {function_name}")
print(f" Arguments: {arguments}")
# Execute the function
func = available_functions[function_name]
result = func(**arguments)
# Append tool result so the model can generate a final response
messages.append({
"tool_call_id": tool_call.id,
"role": "tool",
"content": result,
})
# Step 4: Get the final response with tool results
final_response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print("\nFinal response:", final_response.choices[0].message.content)
Key Insight: Function calling is not the LLM executing code — it's the LLM generating structured JSON that your application executes. The model decides "I should call get_weather with location='Tokyo'" and returns that as structured data. Your code then actually calls the weather API and feeds the result back. This separation of "reasoning" (LLM) and "execution" (your code) is fundamental to safe agent design.
5. Model Comparison
The LLM landscape is evolving rapidly, with new models from OpenAI, Anthropic, Google, Meta, and Mistral releasing every few months. Choosing the right model for your application requires balancing capability (reasoning quality, code generation, multilingual support), cost (per-token pricing can vary 100x between models), latency (response time for real-time apps), and context window (how much data the model can process at once). This section provides a comprehensive comparison and a decision framework to guide model selection.
5.1 Comprehensive Model Comparison
| Model |
Provider |
Context Window |
Strengths |
Best For |
Approx. Cost (input/output per 1M tokens) |
| GPT-4o |
OpenAI |
128K |
Multimodal, fast, strong reasoning |
General-purpose, code, analysis |
$2.50 / $10.00 |
| GPT-4o mini |
OpenAI |
128K |
Very fast, cheap, good quality |
High-volume, cost-sensitive apps |
$0.15 / $0.60 |
| Claude 3.5 Sonnet |
Anthropic |
200K |
Long context, strong reasoning, safety |
Long documents, code, analysis |
$3.00 / $15.00 |
| Claude 3 Haiku |
Anthropic |
200K |
Very fast, affordable |
Simple tasks, high volume |
$0.25 / $1.25 |
| Gemini 1.5 Pro |
Google |
1M-2M |
Massive context, native multimodal |
Huge documents, video/audio analysis |
$1.25 / $5.00 |
| Llama 3.1 405B |
Meta (open) |
128K |
Open weights, self-hostable |
On-prem, privacy-sensitive, fine-tuning |
Self-hosted cost varies |
| Llama 3.1 70B |
Meta (open) |
128K |
Strong open model, reasonable to host |
Cost-effective self-hosting |
Self-hosted / ~$0.50-1.00 via providers |
| Mistral Large |
Mistral AI |
128K |
Efficient, multilingual, EU-based |
European compliance, multilingual apps |
$2.00 / $6.00 |
5.2 Choosing the Right Model
Decision Framework: Start with the cheapest model that meets your quality bar. For most applications: prototype with GPT-4o (highest quality), then optimize by switching to GPT-4o mini or Claude Haiku for simpler subtasks. Use Llama/Mistral when you need data privacy or want to fine-tune. Use Gemini when you need massive context windows.
# Model routing: Use the right model for each task
# This pattern saves 60-80% on API costs in production
# pip install openai anthropic
# Set your API keys:
# export OPENAI_API_KEY="sk-..."
# export ANTHROPIC_API_KEY="sk-ant-..."
from openai import OpenAI
from anthropic import Anthropic
openai_client = OpenAI() # Uses OPENAI_API_KEY env var
anthropic_client = Anthropic() # Uses ANTHROPIC_API_KEY env var
def route_to_model(task_type: str, content: str) -> str:
"""Route different tasks to the most cost-effective model."""
if task_type == "classification":
# Simple classification: cheapest model
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": content}],
temperature=0,
max_tokens=50
)
return response.choices[0].message.content
elif task_type == "complex_reasoning":
# Complex analysis: strongest model
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
temperature=0.3,
max_tokens=2000
)
return response.choices[0].message.content
elif task_type == "long_document":
# Long document analysis: largest context window
message = anthropic_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
messages=[{"role": "user", "content": content}]
)
return message.content[0].text
else:
# Default: balanced model
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": content}],
temperature=0.5
)
return response.choices[0].message.content
# Usage:
# result = route_to_model("classification", "Is this email spam? ...")
# result = route_to_model("complex_reasoning", "Analyze this contract...")
6. LLM Limitations
LLMs are powerful, but they have fundamental limitations that every application developer must understand and mitigate. Hallucinations — confident, plausible-sounding statements that are factually wrong — are the most critical risk, especially in applications that users trust for accurate information. Cost and latency constraints also shape architectural decisions: a high-quality model like GPT-4o can cost 15–20x more per token than a smaller model like GPT-4o mini. This section covers both failure modes and the practical strategies to address them.
6.1 Hallucinations
Hallucination is when an LLM generates plausible-sounding but factually incorrect information. This is not a bug — it's an inherent property of how LLMs work. Because LLMs predict the most likely next token based on patterns, they can generate text that "sounds right" but is wrong.
| Hallucination Type |
Example |
Mitigation Strategy |
| Factual |
"The Eiffel Tower is 500 meters tall" (actual: 330m) |
RAG with verified sources, fact-checking step |
| Citation |
Inventing fake academic papers with plausible authors |
Never trust LLM-generated citations without verification |
| Code |
Using a function that doesn't exist in the library |
Automated testing, code execution verification |
| Confident nonsense |
Explaining a concept with complete confidence but wrong details |
Temperature=0, explicit "say I don't know" instructions |
6.2 Cost & Latency
Production Reality: The #1 reason AI applications fail in production is not quality — it's cost and latency. A prototype that works beautifully at 10 queries/day can become prohibitively expensive at 10,000 queries/day. Always calculate your per-query cost and multiply by expected volume before choosing a model.
# Cost and latency estimation for production planning
def estimate_monthly_cost(
queries_per_day: int,
avg_input_tokens: int,
avg_output_tokens: int,
input_price_per_million: float,
output_price_per_million: float
) -> dict:
"""Estimate monthly API costs for an LLM application."""
monthly_queries = queries_per_day * 30
total_input_tokens = monthly_queries * avg_input_tokens
total_output_tokens = monthly_queries * avg_output_tokens
input_cost = (total_input_tokens / 1_000_000) * input_price_per_million
output_cost = (total_output_tokens / 1_000_000) * output_price_per_million
return {
"monthly_queries": monthly_queries,
"total_tokens": total_input_tokens + total_output_tokens,
"input_cost": f"${input_cost:.2f}",
"output_cost": f"${output_cost:.2f}",
"total_cost": f"${input_cost + output_cost:.2f}",
"cost_per_query": f"${(input_cost + output_cost) / monthly_queries:.4f}"
}
# Compare models for the same workload
print("=== GPT-4o ===")
gpt4o = estimate_monthly_cost(
queries_per_day=5000, avg_input_tokens=2000, avg_output_tokens=500,
input_price_per_million=2.50, output_price_per_million=10.00
)
for k, v in gpt4o.items():
print(f" {k}: {v}")
print("\n=== GPT-4o mini ===")
gpt4o_mini = estimate_monthly_cost(
queries_per_day=5000, avg_input_tokens=2000, avg_output_tokens=500,
input_price_per_million=0.15, output_price_per_million=0.60
)
for k, v in gpt4o_mini.items():
print(f" {k}: {v}")
# GPT-4o: ~$1,500/month vs GPT-4o mini: ~$90/month
# That's a 16x cost difference for often-comparable quality!
7. Your First LLM Application
Let's build a complete, production-quality LLM application — a code review assistant that analyzes code, identifies issues, and suggests improvements.
# Complete LLM Application: Code Review Assistant
# This combines everything we've learned in this article
# pip install openai
import json
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional
# Set your API key: export OPENAI_API_KEY="sk-..."
client = OpenAI()
@dataclass
class ReviewResult:
"""Structured output from the code review."""
language: str
issues: list
suggestions: list
quality_score: int # 1-10
summary: str
def review_code(
code: str,
language: str = "python",
focus_areas: Optional[list] = None
) -> ReviewResult:
"""
Review code using GPT-4o with structured output.
Args:
code: The source code to review
language: Programming language
focus_areas: Optional areas to focus on (security, performance, etc.)
"""
focus_instruction = ""
if focus_areas:
focus_instruction = f"\nFocus especially on: {', '.join(focus_areas)}"
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": f"""You are a senior {language} code reviewer.
Analyze code for bugs, security issues, performance problems, and style.
{focus_instruction}
Return your review as JSON with this exact structure:
{{
"language": "{language}",
"issues": [
{{"severity": "high|medium|low", "line": "line number or range",
"description": "what's wrong", "fix": "how to fix it"}}
],
"suggestions": ["improvement suggestion 1", "suggestion 2"],
"quality_score": 7,
"summary": "one paragraph overall assessment"
}}"""
},
{
"role": "user",
"content": f"Review this {language} code:\n\n```{language}\n{code}\n```"
}
],
temperature=0, # Deterministic for consistent reviews
max_tokens=2000,
response_format={"type": "json_object"} # Force JSON output
)
# Parse the structured response
result = json.loads(response.choices[0].message.content)
return ReviewResult(
language=result["language"],
issues=result["issues"],
suggestions=result["suggestions"],
quality_score=result["quality_score"],
summary=result["summary"]
)
except json.JSONDecodeError as e:
print(f"Failed to parse JSON response: {e}")
raise
except Exception as e:
print(f"API call failed: {e}")
raise
# Usage example
sample_code = """
def get_user(id):
query = f"SELECT * FROM users WHERE id = {id}"
result = db.execute(query)
password = result['password']
return {"id": id, "name": result['name'], "password": password}
"""
review = review_code(
code=sample_code,
language="python",
focus_areas=["security", "best practices"]
)
print(f"Quality Score: {review.quality_score}/10")
print(f"Summary: {review.summary}")
print(f"\nIssues Found: {len(review.issues)}")
for issue in review.issues:
print(f" [{issue['severity'].upper()}] Line {issue['line']}: {issue['description']}")
print(f" Fix: {issue['fix']}")
print(f"\nSuggestions:")
for s in review.suggestions:
print(f" - {s}")
Architecture Note
What Makes This a Real Application
This code review assistant demonstrates several production patterns:
- Structured output: Uses
response_format={"type": "json_object"} to guarantee parseable JSON
- Temperature=0: Ensures consistent, reproducible reviews
- Typed data classes: Converts raw JSON into typed Python objects
- Parameterized prompts: Language and focus areas are configurable
- System/user separation: Instructions in system message, data in user message
Structured Output
JSON Mode
Production Pattern
Exercises & Self-Assessment
Exercise 1
Token Economics Calculator
Build a token cost calculator:
- Install tiktoken:
pip install tiktoken
- Write a function that takes a prompt string and returns: token count, estimated cost for GPT-4o, GPT-4o mini, and Claude 3.5 Sonnet
- Test with prompts of different sizes: 100 words, 1,000 words, 10,000 words
- Calculate: How many queries can you make for $100/month with each model?
Exercise 2
Temperature Experiment
Systematically explore how temperature affects output:
- Choose a creative prompt: "Write a one-paragraph story about a robot who learns to cook"
- Generate 5 responses at each temperature: 0, 0.3, 0.7, 1.0, 1.5
- For each temperature level, measure: diversity (how different are the 5 responses?), coherence (do they make sense?), creativity (surprise factor)
- Repeat with a factual prompt: "Explain how photosynthesis works"
- Write a recommendation: what temperature would you use for each task type?
Exercise 3
Build a Streaming Chat Application
Create a terminal-based chat application with streaming:
- Implement a conversation loop that maintains message history
- Stream responses token-by-token to the terminal
- Track and display token usage after each response
- Implement a context window manager that summarizes old messages when history exceeds 4,000 tokens
- Add a
/model command that lets users switch between GPT-4o and GPT-4o mini mid-conversation
Exercise 4
Reflective Questions
- Why is tokenization language-dependent? What implications does this have for building multilingual AI applications?
- Explain the trade-off between context window size and cost/latency. When would you deliberately use a smaller context window?
- A user reports that your AI chatbot gives different answers to the same question. What parameter would you adjust and why?
- Compare function calling and RAG as mechanisms for giving LLMs access to external information. When would you use each?
- Your application uses GPT-4o at $3,000/month. Your CEO wants to cut costs by 80%. What's your strategy?
Conclusion & Next Steps
You now have a solid developer-facing understanding of how LLMs work and how to use them effectively. Here are the key takeaways from Part 2:
- Tokens are the fundamental unit — everything is tokenized before processing, and you pay per token. Always be token-aware
- Context windows are your budget — plan your token allocation across system prompt, history, RAG context, user query, and output
- Sampling parameters (temperature, top-p, top-k) control creativity vs. consistency. Use low values for factual tasks, higher for creative ones
- Streaming is essential for user-facing applications — never make users wait for the full response
- Function calling bridges LLMs and the real world — the model decides what to do, your code executes it
- Model selection is a cost-quality trade-off. Start with GPT-4o for prototyping, then optimize with cheaper models for production
- Hallucinations are inherent — plan for them with RAG, verification steps, and guardrails
Next in the Series
In Part 3: Prompt Engineering Mastery, we'll master the art and science of prompting — zero-shot, few-shot, chain-of-thought, Tree-of-Thoughts, ReAct, structured output enforcement with JSON mode and Pydantic, LangChain prompt templates, optimization techniques, and defending against prompt injection.
Continue the Series
Part 1: Foundations & Evolution of AI Apps
From ELIZA to autonomous agents — trace the evolution of AI applications and the modern framework landscape.
Read Article
Part 3: Prompt Engineering Mastery
Zero/few-shot, chain-of-thought, ReAct, structured outputs, LangChain templates, and prompt optimization.
Read Article
Part 4: LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL, and building your first LangChain application.
Read Article