Introduction: From Theory to Production
Series Overview: This is Part 17 of our 18-part AI Application Development Mastery series. After 16 parts of building foundational knowledge, frameworks, patterns, and advanced techniques, it is time to put everything together. In this part, we build four complete, production-ready AI applications from scratch.
1
Foundations & Evolution of AI Apps
Pre-LLM era, transformers, LLM revolution
2
LLM Fundamentals for Developers
Tokens, context windows, sampling, API patterns
3
Prompt Engineering Mastery
Zero/few-shot, CoT, ReAct, structured outputs
4
LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL
5
Retrieval-Augmented Generation (RAG)
Embeddings, vector DBs, retrievers, RAG pipelines
6
Memory & Context Engineering
Buffer/summary/vector memory, chunking, re-ranking
7
Agents — Core of Modern AI Apps
ReAct, tool-calling, planner-executor agents
8
LangGraph — Stateful Agent Workflows
Nodes, edges, state, graph execution, cycles
9
Deep Agents & Autonomous Systems
Multi-step reasoning, self-reflection, planning
10
Multi-Agent Systems
Supervisor, swarm, debate, role-based collaboration
11
AI Application Design Patterns
RAG, chat+memory, workflow automation, agent loops
12
Ecosystem & Frameworks
LlamaIndex, Haystack, HuggingFace, vLLM
13
MCP Foundations & Architecture
Protocol design, Host/Client/Server, primitives, security
14
MCP in Production
Building servers, integrations, scaling, agent systems
15
Evaluation & LLMOps
Prompt eval, tracing, LangSmith, experiment tracking
16
Production AI Systems
APIs, queues, caching, streaming, scaling
17
Safety, Guardrails & Reliability
Input filtering, hallucination mitigation, prompt injection
18
Advanced Topics
Fine-tuning, tool learning, hybrid LLM+symbolic
19
Building Real AI Applications
Chatbot, document QA, coding assistant, full-stack
You Are Here
20
Future of AI Applications
Autonomous agents, self-improving, multi-modal, AI OS
Reading about AI application architecture is one thing. Building production-ready applications is something else entirely. Every project in this part involves real decisions: which model to use, how to structure memory, how to handle failures, how to stream responses, how to manage costs, and how to deploy reliably.
Each project builds on concepts from earlier parts of this series. You will see RAG from Part 5, memory from Part 6, agents from Part 7, LangGraph from Part 8, design patterns from Part 11, production systems from Part 14, and safety from Part 15 all come together in working code.
Key Insight: The four projects are ordered by complexity. Start with the chatbot (simplest) and work your way to the research agent (most complex). Each project introduces new patterns that subsequent projects build upon. By the end, you will have a portfolio of real AI applications.
| Project |
Core Concepts |
Difficulty |
| Chatbot with Memory |
LangChain, Redis memory, streaming, session management |
Intermediate |
| Document QA (RAG + FAISS) |
Document loading, chunking, FAISS, re-ranking, citation |
Intermediate-Advanced |
| AI Coding Assistant |
Codebase indexing, AST parsing, code-aware RAG, agents |
Advanced |
| Research Agent |
LangGraph, web search, multi-step reasoning, report gen |
Advanced |
1. Project 1: Chatbot with Memory
Our first project is a conversational AI chatbot with persistent memory — it remembers previous conversations across sessions, maintains user preferences, and provides contextually relevant responses. This is the foundation of every AI assistant product.
1.1 Architecture & Memory Design
The architecture uses Redis for persistent conversation storage, LangChain for LLM orchestration and memory management, and FastAPI for the API layer with Server-Sent Events (SSE) streaming. Each user gets a unique session key, and Redis stores both the conversation history and user preferences as serialized JSON — enabling the chatbot to resume context across sessions.
# Project 1: Chatbot with Persistent Memory
# Architecture: LangChain + Redis + FastAPI + Streaming
# pip install langchain-openai langchain-community langchain-core redis
import os
from datetime import datetime
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.output_parsers import StrOutputParser
# Ensure OPENAI_API_KEY is set: export OPENAI_API_KEY="your-key-here"
class SmartChatbot:
"""Production chatbot with multi-layer memory."""
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis_url = redis_url
self.llm = ChatOpenAI(
model="gpt-4",
temperature=0.7,
streaming=True
)
# System prompt with personality and instructions
self.prompt = ChatPromptTemplate.from_messages([
("system",
"You are a helpful, knowledgeable AI assistant. "
"You remember previous conversations and user preferences. "
"Be concise but thorough. If you reference something from "
"a previous conversation, mention that you remember it. "
"Current date: {current_date}"),
MessagesPlaceholder(variable_name="history"),
("human", "{input}")
])
# Build the chain
self.chain = self.prompt | self.llm | StrOutputParser()
# Wrap with message history
self.chain_with_history = RunnableWithMessageHistory(
self.chain,
self._get_session_history,
input_messages_key="input",
history_messages_key="history"
)
def _get_session_history(self, session_id: str):
"""Get or create Redis-backed chat history for a session."""
return RedisChatMessageHistory(
session_id=session_id,
url=self.redis_url,
ttl=86400 * 30 # 30-day TTL for conversations
)
async def chat(self, message: str, session_id: str):
"""Send a message and get a streaming response."""
from datetime import datetime
config = {"configurable": {"session_id": session_id}}
async for chunk in self.chain_with_history.astream(
{"input": message, "current_date": datetime.now().isoformat()},
config=config
):
yield chunk
def get_history(self, session_id: str) -> list:
"""Retrieve conversation history for a session."""
history = self._get_session_history(session_id)
return history.messages
1.2 Full Implementation
The full implementation exposes the chatbot through FastAPI endpoints: /chat for standard request-response and /chat/stream for real-time token streaming via SSE. The server manages session lifecycle, routes messages through the LangChain chain, and persists conversation state to Redis after each exchange.
# FastAPI server for the chatbot
# pip install fastapi uvicorn
# Requires: SmartChatbot class from above
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import json
app = FastAPI(title="Smart Chatbot API")
chatbot = SmartChatbot()
class ChatRequest(BaseModel):
message: str
session_id: str
@app.post("/chat")
async def chat_endpoint(request: ChatRequest):
"""Streaming chat endpoint."""
async def generate():
async for chunk in chatbot.chat(
request.message, request.session_id
):
yield f"data: {json.dumps({'content': chunk})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream"
)
@app.get("/history/{session_id}")
async def get_history(session_id: str):
"""Get conversation history for a session."""
messages = chatbot.get_history(session_id)
return {
"session_id": session_id,
"messages": [
{"role": m.type, "content": m.content}
for m in messages
]
}
1.3 Persistent Memory with Redis
Memory Architecture: The chatbot uses a three-tier memory system:
- Short-term (buffer): Last 20 messages in the current session, kept in the LLM context window
- Medium-term (Redis): Full conversation history per session, stored in Redis with 30-day TTL
- Long-term (summary): Periodic summarization of old conversations, stored as a compressed context
# Enhanced memory with summarization for long conversations
# Requires: RedisChatMessageHistory from langchain_community
from langchain_core.messages import SystemMessage
from langchain_community.chat_message_histories import RedisChatMessageHistory
class EnhancedMemory:
"""Multi-tier memory with automatic summarization."""
def __init__(self, llm, redis_url, max_messages=20):
self.llm = llm
self.redis_url = redis_url
self.max_messages = max_messages
async def get_context(self, session_id: str) -> list:
"""Get optimized context for the LLM."""
history = RedisChatMessageHistory(
session_id=session_id, url=self.redis_url
)
messages = history.messages
if len(messages) <= self.max_messages:
return messages
# Summarize older messages
old_messages = messages[:-self.max_messages]
recent_messages = messages[-self.max_messages:]
summary = await self.llm.ainvoke(
f"Summarize this conversation concisely, preserving key facts, "
f"user preferences, and important context:\n\n"
+ "\n".join(f"{m.type}: {m.content}" for m in old_messages)
)
return [
SystemMessage(content=f"Previous conversation summary: {summary.content}"),
*recent_messages
]
2. Project 2: Document QA (RAG + FAISS)
The document QA system allows users to upload PDFs, Word documents, or text files, then ask questions about them. It uses RAG with FAISS for fast, accurate retrieval, with re-ranking and citation tracking for production quality.
2.1 Ingestion & Retrieval Pipeline
The ingestion pipeline handles the complete document processing workflow: loading files from multiple formats (PDF, DOCX, TXT), splitting them into semantically meaningful chunks with overlap, generating embeddings, and storing them in a FAISS vector index. At query time, the retrieval pipeline finds the most relevant chunks and feeds them to the LLM as context for answer generation.
# Project 2: Document QA with RAG + FAISS
# pip install langchain-openai langchain-community langchain faiss-cpu pypdf docx2txt
import hashlib
from pathlib import Path
from langchain_community.document_loaders import (
PyPDFLoader, Docx2txtLoader, TextLoader, UnstructuredHTMLLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
# Ensure OPENAI_API_KEY is set: export OPENAI_API_KEY="your-key-here"
class DocumentQA:
"""Production document QA system with RAG + FAISS."""
def __init__(self, persist_dir: str = "./faiss_index"):
self.persist_dir = persist_dir
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.llm = ChatOpenAI(model="gpt-4", temperature=0)
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len
)
self.qa_prompt = ChatPromptTemplate.from_messages([
("system",
"Answer the question based ONLY on the following context. "
"If the context doesn't contain enough information, say so. "
"Always cite which document and section your answer comes from.\n\n"
"Context:\n{context}"),
("human", "{question}")
])
# Load existing index or create new
self.vectorstore = self._load_or_create_index()
def _load_or_create_index(self):
"""Load persisted FAISS index or create new one."""
index_path = Path(self.persist_dir)
if index_path.exists():
return FAISS.load_local(
self.persist_dir, self.embeddings,
allow_dangerous_deserialization=True
)
return None
def ingest(self, file_path: str) -> dict:
"""Ingest a document into the vector store."""
# Select appropriate loader
ext = Path(file_path).suffix.lower()
loaders = {
".pdf": PyPDFLoader,
".docx": Docx2txtLoader,
".txt": TextLoader,
".html": UnstructuredHTMLLoader
}
loader = loaders.get(ext)
if not loader:
raise ValueError(f"Unsupported file type: {ext}")
# Load and split
documents = loader(file_path).load()
chunks = self.text_splitter.split_documents(documents)
# Add metadata
doc_hash = hashlib.md5(file_path.encode()).hexdigest()[:8]
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source_file": Path(file_path).name,
"chunk_index": i,
"doc_id": doc_hash,
"total_chunks": len(chunks)
})
# Add to vector store
if self.vectorstore is None:
self.vectorstore = FAISS.from_documents(chunks, self.embeddings)
else:
self.vectorstore.add_documents(chunks)
# Persist index
self.vectorstore.save_local(self.persist_dir)
return {
"file": Path(file_path).name,
"chunks": len(chunks),
"status": "ingested"
}
def query(self, question: str, k: int = 5) -> dict:
"""Query the document store with re-ranking."""
if not self.vectorstore:
return {"error": "No documents ingested yet"}
# Retrieve with score
docs_with_scores = self.vectorstore.similarity_search_with_score(
question, k=k * 2 # Over-retrieve for re-ranking
)
# Filter by relevance threshold
relevant_docs = [
(doc, score) for doc, score in docs_with_scores
if score < 0.8 # Lower = more similar in FAISS L2
][:k]
# Build context with citations
context_parts = []
sources = []
for i, (doc, score) in enumerate(relevant_docs):
source = doc.metadata.get("source_file", "Unknown")
chunk_idx = doc.metadata.get("chunk_index", "?")
context_parts.append(
f"[Source {i+1}: {source}, Section {chunk_idx}]\n{doc.page_content}"
)
sources.append({
"source": source,
"chunk": chunk_idx,
"relevance_score": round(1 - score, 3),
"preview": doc.page_content[:200]
})
context = "\n\n".join(context_parts)
# Generate answer
chain = self.qa_prompt | self.llm
answer = chain.invoke({
"context": context,
"question": question
})
return {
"answer": answer.content,
"sources": sources,
"num_sources": len(sources)
}
2.2 FAISS Vector Store
Why FAISS? FAISS (Facebook AI Similarity Search) is the gold standard for in-memory vector search:
- Speed: Searches billions of vectors in milliseconds using GPU acceleration
- Memory: Efficient compression with Product Quantization (PQ) reduces memory by 4-64x
- Flexibility: Supports exact (Flat) and approximate (IVF, HNSW) search indexes
- No server: Runs as a library — no database server to manage
2.3 Advanced RAG Techniques
Basic vector similarity search often misses relevant documents, especially when queries use different terminology than the source text. Hybrid search combines dense vector retrieval with sparse BM25 keyword matching, then uses a cross-encoder re-ranker to score the combined results — significantly improving retrieval accuracy for domain-specific document collections.
# Advanced RAG: Hybrid search + re-ranking
# pip install rank-bm25
# Requires: DocumentQA class from above
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
class AdvancedDocumentQA(DocumentQA):
"""Enhanced with hybrid search and cross-encoder re-ranking."""
def _build_hybrid_retriever(self):
"""Combine semantic (FAISS) and keyword (BM25) search."""
# Get all documents for BM25
all_docs = list(self.vectorstore.docstore._dict.values())
bm25_retriever = BM25Retriever.from_documents(all_docs)
bm25_retriever.k = 10
faiss_retriever = self.vectorstore.as_retriever(
search_kwargs={"k": 10}
)
# Ensemble: combine both retrievers with weights
return EnsembleRetriever(
retrievers=[bm25_retriever, faiss_retriever],
weights=[0.4, 0.6] # Favor semantic search
)
def _rerank(self, question: str, docs: list) -> list:
"""Re-rank documents using LLM-based scoring."""
scored = []
for doc in docs:
score_resp = self.llm.invoke(
f"Rate relevance 0-10 of this passage to the question.\n"
f"Question: {question}\nPassage: {doc.page_content[:500]}\n"
f"Reply with ONLY a number."
)
try:
score = float(score_resp.content.strip())
except ValueError:
score = 5.0
scored.append((doc, score))
scored.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, _ in scored]
def query_advanced(self, question: str) -> dict:
"""Query with hybrid retrieval and re-ranking."""
retriever = self._build_hybrid_retriever()
docs = retriever.invoke(question)
# Cross-encoder re-ranking for precision
reranked = self._rerank(question, docs)
context = "\n\n".join(
f"[{d.metadata.get('source_file')}]: {d.page_content}"
for d in reranked[:5]
)
chain = self.qa_prompt | self.llm
answer = chain.invoke({"context": context, "question": question})
return {"answer": answer.content, "sources": reranked[:5]}
3. Project 3: AI Coding Assistant
The coding assistant understands your entire codebase through code-aware RAG, can answer questions about architecture, generate code that follows your patterns, and execute code in a sandboxed environment. Think of it as a local Copilot that knows your project intimately.
3.1 Codebase RAG & Indexing
The codebase RAG system indexes an entire repository by walking the file tree, chunking source files by logical boundaries (functions, classes), and storing them in a FAISS vector index with metadata (file path, language, line numbers). At query time, it retrieves the most relevant code snippets and feeds them to the LLM alongside the developer’s question for context-aware code assistance.
# Project 3: AI Coding Assistant with Codebase RAG
# pip install langchain-openai langchain-community langchain faiss-cpu
import ast
import os
from pathlib import Path
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Ensure OPENAI_API_KEY is set: export OPENAI_API_KEY="your-key-here"
class CodebaseIndexer:
"""Index a codebase for semantic search with AST awareness."""
LANGUAGE_MAP = {
".py": Language.PYTHON,
".js": Language.JS,
".ts": Language.TS,
".java": Language.JAVA,
".go": Language.GO,
".rs": Language.RUST,
}
def __init__(self):
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
def index_codebase(self, root_dir: str, ignore_patterns=None) -> FAISS:
"""Index an entire codebase with language-aware chunking."""
ignore_patterns = ignore_patterns or [
"node_modules", ".git", "__pycache__", ".venv",
"venv", "dist", "build", ".next"
]
all_chunks = []
for path in Path(root_dir).rglob("*"):
# Skip ignored directories
if any(p in str(path) for p in ignore_patterns):
continue
ext = path.suffix.lower()
if ext not in self.LANGUAGE_MAP:
continue
try:
content = path.read_text(encoding="utf-8")
except (UnicodeDecodeError, PermissionError):
continue
# Language-aware splitting
language = self.LANGUAGE_MAP[ext]
splitter = RecursiveCharacterTextSplitter.from_language(
language=language,
chunk_size=1000,
chunk_overlap=100
)
chunks = splitter.create_documents(
[content],
metadatas=[{
"source": str(path.relative_to(root_dir)),
"language": ext,
"functions": self._extract_functions(content, ext)
}]
)
all_chunks.extend(chunks)
# Build FAISS index
vectorstore = FAISS.from_documents(all_chunks, self.embeddings)
return vectorstore
def _extract_functions(self, code: str, ext: str) -> str:
"""Extract function/class names for metadata."""
if ext != ".py":
return ""
try:
tree = ast.parse(code)
names = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
names.append(f"def {node.name}")
elif isinstance(node, ast.ClassDef):
names.append(f"class {node.name}")
return ", ".join(names)
except SyntaxError:
return ""
3.2 Agent with Code Execution
The coding agent extends the RAG system with tool use — it can not only retrieve and explain code, but also execute code in a sandboxed environment, run tests, and iterate on solutions. The agent uses a ReAct loop: it reasons about the task, decides which tool to use (search codebase, execute code, run tests), observes the result, and continues until the task is complete.
# Coding assistant agent with code execution
# pip install langchain-openai langchain
# Requires: vectorstore from CodebaseIndexer above
import os
import subprocess
import tempfile
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
@tool
def search_codebase(query: str) -> str:
"""Search the indexed codebase for relevant code snippets."""
docs = vectorstore.similarity_search(query, k=5)
results = []
for doc in docs:
results.append(
f"File: {doc.metadata['source']}\n"
f"Functions: {doc.metadata.get('functions', 'N/A')}\n"
f"Code:\n{doc.page_content}\n"
)
return "\n---\n".join(results)
@tool
def execute_python(code: str) -> str:
"""Execute Python code in a sandboxed environment. Returns stdout/stderr."""
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
f.write(code)
f.flush()
try:
result = subprocess.run(
["python", f.name],
capture_output=True, text=True, timeout=30,
env={**os.environ, "PYTHONDONTWRITEBYTECODE": "1"}
)
output = result.stdout
if result.stderr:
output += f"\nSTDERR: {result.stderr}"
return output or "Code executed successfully (no output)"
except subprocess.TimeoutExpired:
return "ERROR: Code execution timed out (30s limit)"
finally:
os.unlink(f.name)
@tool
def read_file(file_path: str) -> str:
"""Read the contents of a file in the codebase."""
try:
return Path(file_path).read_text(encoding="utf-8")[:5000]
except Exception as e:
return f"Error reading file: {e}"
# Build the coding assistant agent
coding_prompt = ChatPromptTemplate.from_messages([
("system",
"You are an expert AI coding assistant that understands the user's "
"codebase deeply. You can:\n"
"1. Search the codebase for relevant code\n"
"2. Read specific files\n"
"3. Execute Python code to test solutions\n\n"
"Always search the codebase first to understand existing patterns "
"before generating new code. Follow the project's coding style."),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad")
])
llm = ChatOpenAI(model="gpt-4", temperature=0)
tools = [search_codebase, execute_python, read_file]
agent = create_tool_calling_agent(llm, tools, coding_prompt)
coding_assistant = AgentExecutor(agent=agent, tools=tools, verbose=True)
3.3 Context-Aware Code Generation
Code Generation Best Practices:
- Always retrieve first: Search the codebase for existing patterns before generating new code
- Include imports: Analyze the project's import style and dependency versions
- Match style: Copy naming conventions, docstring format, and error handling patterns from the existing codebase
- Test generation: When generating a function, also generate unit tests that follow the project's testing framework
4. Project 4: Research Agent
The research agent is the most complex project. It performs multi-step research: searches the web, reads articles, extracts key findings, synthesizes information from multiple sources, identifies gaps, and generates a comprehensive research report — all orchestrated by a LangGraph state machine.
4.1 LangGraph Workflow Design
The research agent uses a LangGraph state machine with specialized nodes: a planner that decomposes the research question into sub-queries, a researcher that executes web searches and extracts key findings, a synthesizer that combines findings into a coherent analysis, and a reviewer that checks for completeness and triggers additional research if needed.
# Project 4: Research Agent with LangGraph
# pip install langgraph langchain-openai langchain-community tavily-python
import json
import operator
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI
# Ensure these env vars are set:
# export OPENAI_API_KEY="your-key-here"
# export TAVILY_API_KEY="your-tavily-key-here"
class ResearchState(TypedDict):
"""State for the research agent workflow."""
topic: str
search_queries: List[str]
search_results: Annotated[List[dict], operator.add]
summaries: Annotated[List[str], operator.add]
gaps: List[str]
final_report: str
iteration: int
max_iterations: int
class ResearchAgent:
"""Multi-step research agent powered by LangGraph."""
def __init__(self):
self.llm = ChatOpenAI(model="gpt-4", temperature=0)
self.graph = self._build_graph()
def _build_graph(self) -> StateGraph:
"""Build the LangGraph research workflow."""
workflow = StateGraph(ResearchState)
# Add nodes
workflow.add_node("generate_queries", self.generate_queries)
workflow.add_node("web_search", self.web_search)
workflow.add_node("summarize_sources", self.summarize_sources)
workflow.add_node("identify_gaps", self.identify_gaps)
workflow.add_node("generate_report", self.generate_report)
# Define edges
workflow.set_entry_point("generate_queries")
workflow.add_edge("generate_queries", "web_search")
workflow.add_edge("web_search", "summarize_sources")
workflow.add_edge("summarize_sources", "identify_gaps")
# Conditional: loop back if gaps found and under iteration limit
workflow.add_conditional_edges(
"identify_gaps",
self._should_continue,
{"continue": "generate_queries", "finish": "generate_report"}
)
workflow.add_edge("generate_report", END)
return workflow.compile(checkpointer=MemorySaver())
def _should_continue(self, state: ResearchState) -> str:
"""Decide whether to do another research iteration."""
if state["iteration"] >= state["max_iterations"]:
return "finish"
if not state.get("gaps") or len(state["gaps"]) == 0:
return "finish"
return "continue"
async def generate_queries(self, state: ResearchState) -> dict:
"""Generate search queries based on topic and gaps."""
gaps_context = ""
if state.get("gaps"):
gaps_context = f"\nKnowledge gaps to fill: {', '.join(state['gaps'])}"
response = await self.llm.ainvoke(
f"Generate 3-5 diverse search queries to research: "
f"{state['topic']}{gaps_context}\n"
f"Return as a JSON array of strings."
)
queries = json.loads(response.content)
return {"search_queries": queries, "iteration": state["iteration"] + 1}
async def web_search(self, state: ResearchState) -> dict:
"""Execute web searches and collect results."""
from langchain_community.tools.tavily_search import TavilySearchResults
search_tool = TavilySearchResults(max_results=3)
results = []
for query in state["search_queries"]:
search_results = await search_tool.ainvoke(query)
results.extend(search_results)
return {"search_results": results}
async def summarize_sources(self, state: ResearchState) -> dict:
"""Summarize each search result."""
summaries = []
for result in state["search_results"][-10:]: # Last 10 results
summary = await self.llm.ainvoke(
f"Summarize the key findings from this source "
f"relevant to '{state['topic']}':\n\n"
f"Title: {result.get('title', 'N/A')}\n"
f"Content: {result.get('content', '')[:2000]}"
)
summaries.append(summary.content)
return {"summaries": summaries}
async def identify_gaps(self, state: ResearchState) -> dict:
"""Identify knowledge gaps in current research."""
all_summaries = "\n\n".join(state["summaries"])
response = await self.llm.ainvoke(
f"Given this research on '{state['topic']}':\n\n"
f"{all_summaries}\n\n"
f"Identify 0-3 important knowledge gaps that need more research. "
f"If the research is comprehensive, return an empty list. "
f"Return as a JSON array of strings."
)
gaps = json.loads(response.content)
return {"gaps": gaps}
async def generate_report(self, state: ResearchState) -> dict:
"""Generate the final research report."""
all_summaries = "\n\n".join(state["summaries"])
response = await self.llm.ainvoke(
f"Write a comprehensive research report on '{state['topic']}' "
f"based on these findings:\n\n{all_summaries}\n\n"
f"Structure: Executive Summary, Key Findings (numbered), "
f"Analysis, Conclusions, and Sources."
)
return {"final_report": response.content}
async def research(self, topic: str, max_iterations: int = 3) -> dict:
"""Run the full research workflow."""
initial_state = {
"topic": topic,
"search_queries": [],
"search_results": [],
"summaries": [],
"gaps": [],
"final_report": "",
"iteration": 0,
"max_iterations": max_iterations
}
config = {"configurable": {"thread_id": topic[:50]}}
result = await self.graph.ainvoke(initial_state, config=config)
return {
"topic": topic,
"report": result["final_report"],
"sources_count": len(result["search_results"]),
"iterations": result["iteration"]
}
4.2 Web Search & Summarization
Research Agent Design Principles:
- Iterative deepening: Start broad, identify gaps, search deeper. Each iteration refines understanding.
- Source diversity: Generate diverse search queries to avoid echo chamber effects.
- Citation tracking: Every claim in the final report should trace back to a specific source.
- Bounded iterations: Cap at 2-3 iterations to control cost. Diminishing returns after 3 rounds.
4.3 Report Generation Pipeline
The report generation pipeline transforms raw research findings into a structured, professional document. It uses structured output (Pydantic models) to enforce a consistent report format with sections, key findings, evidence citations, and confidence levels — then renders the structured data into Markdown with proper formatting.
# Enhanced report generation with structured output
# Requires: llm = ChatOpenAI instance from above
from pydantic import BaseModel, Field
from typing import List
class ResearchReport(BaseModel):
"""Structured research report output."""
executive_summary: str = Field(description="2-3 sentence overview")
key_findings: List[str] = Field(description="Numbered key findings")
analysis: str = Field(description="Detailed analysis")
conclusions: str = Field(description="Final conclusions")
sources: List[dict] = Field(description="List of sources with URLs")
confidence: float = Field(description="Confidence score 0-1")
# Use structured output for reliable report format
structured_llm = llm.with_structured_output(ResearchReport)
5. Full-Stack Architecture
A production AI application requires much more than just the AI logic. Here is the complete full-stack architecture that ties all four projects together into a deployable system.
5.1 React + FastAPI + LangChain + pgvector + Redis + Docker
| Layer |
Technology |
Purpose |
| Frontend |
React + TypeScript + Tailwind CSS |
Chat UI, document upload, streaming display |
| API Gateway |
FastAPI + Pydantic |
REST/WebSocket endpoints, validation, auth |
| AI Orchestration |
LangChain + LangGraph |
Chains, agents, RAG pipelines, workflows |
| Vector Database |
PostgreSQL + pgvector |
Persistent vector storage with SQL capabilities |
| Cache & Memory |
Redis |
Session memory, response cache, rate limiting |
| Container Orchestration |
Docker Compose / Kubernetes |
Service orchestration, scaling, deployment |
5.2 API Design & Streaming
The production API layer wraps the AI pipeline in FastAPI with proper error handling, request validation, rate limiting, and WebSocket streaming for real-time token delivery. The API design follows RESTful conventions for synchronous endpoints and WebSocket connections for streaming — both backed by Redis for session management and caching.
# Production FastAPI application
# pip install fastapi uvicorn redis websockets
# Requires: SmartChatbot, init_vectorstore from project modules
from fastapi import FastAPI, HTTPException, Depends, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
import redis.asyncio as redis
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Application lifecycle management."""
# Startup: initialize connections
app.state.redis = redis.Redis.from_url("redis://redis:6379")
app.state.vectorstore = await init_vectorstore()
yield
# Shutdown: cleanup
await app.state.redis.close()
app = FastAPI(title="AI Application Platform", lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"],
allow_methods=["*"],
allow_headers=["*"],
)
# WebSocket for real-time streaming
@app.websocket("/ws/chat/{session_id}")
async def websocket_chat(websocket: WebSocket, session_id: str):
await websocket.accept()
chatbot = SmartChatbot()
try:
while True:
message = await websocket.receive_text()
async for chunk in chatbot.chat(message, session_id):
await websocket.send_json({"type": "chunk", "content": chunk})
await websocket.send_json({"type": "done"})
except Exception:
await websocket.close()
5.3 Docker Compose Deployment
Docker Compose orchestrates the complete application stack: the FastAPI application server, Redis for caching and session storage, Prometheus for metrics collection, and Grafana for monitoring dashboards. The configuration below includes health checks, resource limits, and environment variable injection for production readiness.
# docker-compose.yml
version: "3.9"
services:
frontend:
build: ./frontend
ports:
- "3000:3000"
environment:
- REACT_APP_API_URL=http://localhost:8000
depends_on:
- api
api:
build: ./backend
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://ai:secret@postgres:5432/aiapp
depends_on:
- redis
- postgres
volumes:
- ./data:/app/data # Persist FAISS indexes
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_DB: aiapp
POSTGRES_USER: ai
POSTGRES_PASSWORD: secret
ports:
- "5432:5432"
volumes:
- pg_data:/var/lib/postgresql/data
volumes:
redis_data:
pg_data:
6. Deployment Patterns
Deploying AI applications requires careful consideration of infrastructure choices across the three major cloud providers, scaling strategies for handling variable LLM latency, and monitoring patterns specific to AI workloads. This section covers cloud service mappings, Kubernetes configurations, and serverless deployment options for each project type.
6.1 Cloud Deployment (AWS/GCP/Azure)
| Component |
AWS |
GCP |
Azure |
| API Server |
ECS Fargate / Lambda |
Cloud Run |
Azure Container Apps |
| Vector DB |
RDS + pgvector / OpenSearch |
AlloyDB + pgvector |
Azure PostgreSQL + pgvector |
| Cache |
ElastiCache (Redis) |
Memorystore |
Azure Cache for Redis |
| Frontend |
CloudFront + S3 |
Cloud CDN + GCS |
Azure CDN + Blob Storage |
| Queue |
SQS |
Pub/Sub |
Service Bus |
6.2 Scaling Strategies
Scaling AI Applications:
- Horizontal scaling: Run multiple API server replicas behind a load balancer. Each replica is stateless (state in Redis/Postgres).
- Queue-based processing: For long-running tasks (research agent, document ingestion), use a job queue (Celery, Bull) to process asynchronously.
- Caching: Cache LLM responses for identical queries. A semantic cache can match similar (not just identical) queries.
- Connection pooling: Pool LLM API connections to avoid rate limits and reduce latency.
7. Cost Estimation
Understanding the per-query cost of each AI application is critical for pricing decisions, budget planning, and optimization prioritization. Costs vary dramatically across project types — a simple chatbot query costs fractions of a cent, while a multi-step research agent query can cost 10-50x more due to multiple LLM calls and tool invocations.
7.1 Per-Query Cost Breakdown
| Component |
Chatbot |
Document QA |
Coding Assistant |
Research Agent |
| LLM (input) |
$0.006 |
$0.012 |
$0.018 |
$0.060 |
| LLM (output) |
$0.012 |
$0.018 |
$0.024 |
$0.090 |
| Embeddings |
$0.000 |
$0.0001 |
$0.0002 |
$0.0003 |
| Web search API |
$0.000 |
$0.000 |
$0.000 |
$0.015 |
| Total per query |
~$0.018 |
~$0.030 |
~$0.042 |
~$0.165 |
| Monthly (1K queries/day) |
~$540 |
~$900 |
~$1,260 |
~$4,950 |
7.2 Cost Optimization Tips
The biggest cost savings come from semantic caching (storing and reusing responses for similar queries), model routing (sending simple queries to cheaper models), and prompt compression (reducing token count while preserving meaning). The implementation below combines all three strategies with Redis-backed caching and configurable similarity thresholds.
# Cost optimization strategies
# pip install langchain-openai redis
import json
import hashlib
import redis.asyncio as redis
from langchain_openai import ChatOpenAI
class CostOptimizer:
"""Strategies to reduce AI application costs."""
def __init__(self):
self.llm = ChatOpenAI(model="gpt-4")
self.small_llm = ChatOpenAI(model="gpt-3.5-turbo")
# 1. Semantic caching — avoid redundant LLM calls
async def cached_query(self, query: str, cache: redis.Redis):
cache_key = f"llm:{hashlib.md5(query.encode()).hexdigest()}"
cached = await cache.get(cache_key)
if cached:
return json.loads(cached) # Free!
result = await self.llm.ainvoke(query)
await cache.set(cache_key, json.dumps(result), ex=3600)
return result
# 2. Model routing — use cheaper models for simple queries
def route_to_model(self, query: str, complexity: str):
if complexity == "simple":
return ChatOpenAI(model="gpt-3.5-turbo") # 10x cheaper
return ChatOpenAI(model="gpt-4")
# 3. Prompt compression — reduce token count
def compress_context(self, context: str, max_tokens: int = 1000):
"""Summarize context to fit within token budget."""
if len(context.split()) < max_tokens:
return context
# Use a small model to compress
compressed = self.small_llm.invoke(
f"Compress this to under {max_tokens} tokens, "
f"keeping all key facts:\n{context}"
)
return compressed.content
Exercises & Self-Assessment
Exercise 1
Build the Chatbot
- Implement the full chatbot with Redis memory. Test that conversations persist across server restarts.
- Add user authentication (session tokens) so different users have separate conversation histories.
- Implement the three-tier memory system (buffer, Redis, summary). Test with a 100+ message conversation.
Exercise 2
Build the Document QA
- Implement the document QA system. Ingest 3 different PDF documents and test with 20 questions spanning all documents.
- Compare basic FAISS retrieval vs. hybrid (FAISS + BM25) retrieval. Measure precision at k=5 for 10 queries.
- Add citation tracking: every sentence in the answer should reference a specific source document and page.
Exercise 3
Build the Coding Assistant
- Index an open-source Python project (e.g., Flask, Requests) and ask architectural questions about it.
- Test code generation: ask the assistant to add a new feature that follows the project's existing patterns.
- Implement the sandboxed code execution tool. Test with safe code and verify that dangerous operations are blocked.
Exercise 4
Build the Research Agent
- Implement the full LangGraph research agent. Run it on 3 different topics and evaluate report quality.
- Add a "fact-checking" node that verifies claims against multiple sources before including them in the report.
- Implement report export in multiple formats (Markdown, PDF, DOCX) with proper citations.
Exercise 5
Reflective Questions
- What are the trade-offs between FAISS (in-memory) and pgvector (database-backed) for vector storage? When would you choose each?
- How would you handle document updates in the RAG system? If a PDF is updated, how do you refresh only the changed chunks?
- Compare the cost of running these applications with GPT-4 vs. a locally deployed Llama model. At what scale does self-hosting become cheaper?
- How would you add multi-tenancy to the full-stack architecture? What changes are needed at each layer?
- Design a monitoring and alerting system for the research agent. What metrics would you track? What constitutes a "failed" research?
Conclusion & Next Steps
You have now built four complete, production-ready AI applications — each demonstrating different aspects of the AI application stack. Here are the key takeaways from Part 19:
- Chatbot with Memory — Redis-backed persistent memory with three-tier architecture (buffer, session, summary) creates conversational AI that truly remembers
- Document QA (RAG + FAISS) — Hybrid retrieval (semantic + keyword) with re-ranking and citation tracking produces accurate, verifiable answers from your documents
- AI Coding Assistant — Code-aware RAG with AST parsing, language-specific chunking, and sandboxed execution creates a project-aware development companion
- Research Agent — LangGraph orchestration with iterative deepening, gap identification, and structured report generation automates complex research workflows
- Full-Stack Architecture — React + FastAPI + LangChain + pgvector + Redis + Docker provides a production-grade foundation for any AI application
- Cost management — Semantic caching, model routing, and prompt compression can reduce LLM costs by 50-80%
Next in the Series
In Part 20: Future of AI Applications (FINAL), we look ahead to fully autonomous agents, self-improving systems with DSPy, multi-modal AI, AI-native operating systems, the evolution of MCP, agentic infrastructure at scale, frontier research directions, and the societal implications of AI applications.
Continue the Series
Part 20: Future of AI Applications (FINAL)
Explore autonomous agents, self-improving systems, multi-modal AI, AI-native operating systems, and the future of agentic infrastructure.
Read Article
Part 1: Foundations & Evolution of AI Apps
Where it all began: from ELIZA to ChatGPT, the transformer revolution, and the modern AI application stack.
Read Article
Part 5: Retrieval-Augmented Generation (RAG)
The core pattern powering most production AI applications: embeddings, vector databases, retrievers, and RAG pipelines.
Read Article